RAM in Altera EABs and Xilinx Block Rams

R

rickman 22 years ago

I am using RAM in a processor design and I am having trouble understanding exactly how best to use these functions for my design. I will be using them to implement stacks, program memory and data memory. Ideally the write function will look like an addressable register where the address, data and enables are setup prior to the clock and the write happens on the clock edge. The read should be async so that I can provide an address and get data after a delay.

The Altera part is an EP1K50 where the EAB read can be async. The write however is only shown as either fully async or fully registered. I recall that I was warned when reading and writing the same address the data out has a longer delay. But I can't seem to find a reference to that. I am also unclear if I can use the write the way I want or if it requires input registers.

The Xilinx part is an XC3S400 with dual port block rams. It seems like the read path must be registered as well as the write path. I think I could live with that if I could read the data that is being written (top of stack) in the same clock cycle. But I belive the docs say that the other port can either read the old data or is invalid. But then I may be able to use a single port ram for a stack. The address would always be pointing to the current TOS and as soon as a new value were pushed, the next clock edge would read the new data as it is written to the new address.

I don't want to pipeline anything in this design to keep it very simple. Right now the design is pretty clean and the delay paths are pretty short.

Can anyone clarify how these rams work without pipelining?

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

R

Rajeev 22 years ago

Rick,

I wish I had something more constructive to offer... I have a Stratix design and I use read latency of 2 cycles everywhere (one for address in, one for data out.) While one can eliminate the data output register it adds enough ns that it's just not worth it.

I can't help noticing the (huge?) disparity between the 1K50 and the

3S400, and am surprised that you're still using the ACEX parts. In that vein, I'm carrying around the notion that _all_ newer FPGAs are or will require registered ports... so why not bite the bullet and go synchronous ?

I'm also not sure from your post whether "pipelined" is synonymous with "registered", ie you're trying to do something like one instruction per clock cycle and/or you can't tolerate the 2 ticks latency.

Also, what's you're desired clock speed ?

Regards,

-rajeev-

Vote

R

roller 22 years ago

"rickman" escribió en el mensaje news: snipped-for-privacy@yahoo.com...

i dont know exactly how the spartan3 is related to the spartan2, but it might help you, check this out

formatting link

it says that when you write data, one of the ports reads what you're writting. From Coregen options i'd guess that you can also set it up as read-after-write (this one) or write-after-read (which would read the previous contents, and then write)

Coregen ask you about that too, but the link i gave you dont mention anything. Though, if i recall correctly, i also read (somewhere in xilinx site) that the latency is dependant on the size of the RAM, bigger gets 2 cycles latency, but smaller can get 1 cycle i think. (sorry i dont have a link)

Vote

P

Peter Alfke 22 years ago

Xilinx (Virtex2 or Spartan3) BlockRAM reading while writing: Any write operation also performs a read, and outputs it on the Do output. The user can choose: write before read (= output the data that is being witten), or read before write (=output the previous content that is now being overwritten) or "no change"( keep the old data on the Do lines.

Peter Alfke

>

Vote

S

Symon 22 years ago

Hi Rick, I can offer my experiences with Xilinx blockram. You're correct that both the read and write are synchronous. There are three write options, WRITE_FIRST, READ_FIRST and NO_CHANGE. Carefully (!) read about these in the data sheet. I use WRITE_FIRST almost exclusively, where the "same clock edge that writes the data input (DI) into the memory also transfers DI into the output registers DO". When I did my processor design, I also used one as a stack. Like your design I didn't use pipelining. This was to keep the design small and simple. On the BlockRAM I used one port for PUSHING/POPPING registers, and the other for CALL/RETURN subroutine addresses. The catch with these blockrams is that, if you read from one port whilst you're writing to the *same* address on the other port, the read data is indeterminate. This makes sense if you think about what the BlockRAM is doing. Check out 'Conflict Resolution' in the user guide (I'm looking at ug012 for V2PRO). This means for me that I can't do a POP instruction immediately after doing a CALL subroutine, and I can't do a RETURN immediately after doing a PUSH. No problem to avoid this in the code, of course. It's a wierd thing to do anyway. The ModelSIM simulator also warns if conflicts occur and, of course, simulates the RAM accurately. Good luck! Cheers, Syms.

Vote

P

Peter Alfke 22 years ago

Here is the official Xilinx text (I just rewrote this for the new User Guide). Conflict Avoidance. Virtex-2 BlockRAM is a true dual-port RAM where both ports can access any memory location at any time. When accessing the SAME MEMORY LOCATION from both ports, the user must, however, observe certain restrictions, specified by the clock-to-clock set-up time window.See the following:

There are two fundamentally different situations: The two ports either have a common clock ("Synchronous Clocking"), or the clock frequency or phase is different for the two ports ("Asynchronous Clocking").

Asynchronous Clocking is the more general case, where the active edges of both clocks do not occur simultaneously: There are no timing constraints when both ports perform a read operation on the same location. When one port performs a write operation, the other port must not read- or write-access the same memory location by using a clock edge that falls within the specified forbidden clock-to-clock set-up time window. (If this restriction is ignored, a read operation might read unreliable data, perhaps a mixture of old and new data in this location; a write operation might result in wrong data stored in this location. There is, however, no risk of physical damage to the device.)

Synchronous Clocking is the special case, where the active edges of both port clocks occur simultaneously: There are no timing constraints when both ports perform a read operation. When one port performs a write operation, the other port must not write into the same location, unless both ports write identical data. When one port performs a write operation, the other port can reliably read data from the same location if the write port is in READ_FIRST mode. DATA_OUT will then reflect the previously stored data.

If the write port is in either WRITE_FIRST or in NO_CHANGE mode, then the DATA-OUT on the read port would become invalid (unreliable). Obviously, the read-port's mode setting does not affect this.

June 2004 Peter Alfke ( this text has not yet been posted on xilinx.com)

>

Vote

R

roller 22 years ago

"rickman" escribió en el mensaje news: snipped-for-privacy@yahoo.com...

i dont know exactly how the spartan3 is related to the spartan2, but it might help you, check this out

formatting link

it says that when you write data, one of the ports reads what you're writting. From Coregen options i'd guess that you can also set it up as read-after-write (this one) or write-after-read (which would read the previous contents, and then write)

Coregen ask you about that too, but the link i gave you dont mention anything. Though, if i recall correctly, i also read (somewhere in xilinx site) that the latency is dependant on the size of the RAM, bigger gets 2 cycles latency, but smaller can get 1 cycle i think. (sorry i dont have a link)

Vote

J

John_H 22 years ago

Quoting Peter's text from below, "When one port performs a write operation, the other port must not write into the same location, unless both ports write identical data."

For a one-port dedicated read and one-port dedicated write configuration that I *believe* rickman is pursuing, a little trick could be used: feed the data to *both* write ports and enable the write to the nomally read-only port when a RdAddr==WrAddr compare is valid. This increases the effective address setup time but gives the desired WRITE_FIRST functionality without increasing the Clk-to-out time.

specified

on

perhaps

of

into

the

Vote

R

rickman 22 years ago

In my design it adds a clock cycle delay to have a register on the data out side of the RAM. So that slows things down a lot. I am using the ACEX parts because I need the 5 volt tolerance that has been left behind by the newer parts. For that function, they work very well.

Yes, if you have more than one register in the fetch-decode-execute cycle, then more than one clock cycle is needed and if you want to start a new instruction on every clock (as I do) it would have to be pipelined. Non-pipelined MCUs are *much* simpler and not necessarily slower in the time to execute any given instruction. Pipelining only lets you add more hardware to overlap execution of multiple instructions. You also don't have to deal with throwing away prefetches if you don't pipeline.

After looking at the structure of the Xilinx Spartan 3 block rams, I see that I can't escape the output register. But seeing the mode where the read is done post-write I realized that I can add a mux and an output register which will always reflect the top of the stack without a read delay! I am still not certain it will work ok in the Xilinx part, but this works great in the Altera parts and it speeds up the cycle time a lot. I can decode and execute the current instruction and fetch the next instruction in no more than two levels of logic and one RAM delay per clock cycle. I expect this to run at 60 to 80 MHz without too much trouble. If I work on optimizing the placement and routing, I might even get 100MHz out of this.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

R

rickman 22 years ago

Yes, I saw that. It gave me an idea of how I can deal with the read delay in the Altera part. But I belive the Xilinx part still gives you a two clock delay on reading the new data. I am using the RAM for stacks among other things. So I can use a separate register to always hold the top of stack. But if it pushes to the stack on one clock cycle and on the next clock cycle pops, the data on the output of the Xilinx RAM is still stale. I guess I can use the dual port and always have the read one address below the write.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

R

rickman 22 years ago

But it still has a two cycle delay from writing to read data out, right? So if I want the data that was just written on the next clock cycle (like in a stack) I need to use an external register and use separate read and write addresses. Correct?

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

R

rickman 22 years ago

Sounds like we are doing similar things. I am not trying to share one ram for two stacks though. In the Altera part, I can have an async read and a clocked write all with the same address (single port). So whenever I write (push) the data is available on the read output in the second half of the next clock cycle. To speed up the delay I am adding a mux and a register to hold the top of stack when the stack is written and to get the second to top on pops (new top). Since this is registered, I don't have to worry about the cascaded delays on the address setup and the RAM read times. On a return instruction it would have two RAM delays (return stack and instruction memory) and some three or four LUT delays (decode, mux).

But with the Xilinx part, the two clock cycle thing really gets in the way of implementing one clock cycle stacks. You can't even do a push followed by a pop which is not at all uncommon... "1 2 add"... two pushes followed by a pop. I can do the same muxed register trick I do with the Altera part, but I have to have two addresses and use two ports, one for read and one for write.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

R

rickman 22 years ago

To implement a stack you don't normally need separate read and write ports since you only do one thing at a time. The Xilinx block RAMs can't do a read in less than two clock cycles which gets in the way of a stack. So I would need to use a separate register to hold the top of stack and refresh that on POPs from the RAM using a separate read port with a separate address. In that case there is never the problem of simulaneous reads and writes to the same address because you only ever do one thing at a time.

I have not thought about my program or data memory. I may really be hosed there and have to abandon the one clock cycle instruction idea. I guess I could use a two up clock or something similar. I belive the Spartan 3 block rams are fast enough that I likely won't have a speed issue even with a 2x clock.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

P

Peter Alfke 22 years ago

Just to clarify Rickman's "Two-clock-cycle thing": Xilinx BlockRAMs need ONE clock to perform any operation, be it a read or a write. As a bonus, the write operation also performs a read operation on the same location, showing either the old or the new data (user option). And this is all on one port. You can obviously use the other port independently from the first. The one thing you cannot do is an asynchronous read without a clock edge.

If anybody has any questions about Xilinx BlockRAMs, I am more than happy to explain. Peter Alfke, Xilinx Applications

Vote

R

rickman 22 years ago

Perhaps I didn't understand the documentation. I think I got mixed up in the description of the read port latches. Sometimes I forget the distinction between latches and registers.

First, let me say that I am designing a stack using a single block ram. My understanding is that I can use the RAM as either a single port ram with a single address bus, a write data bus and a read data bus or a dual port ram with two independant interfaces like the single port interface.

Using the single port interface it appears to me that the address and control signals are registered. Looking at the timing diagram for the WRITE_FIRST option, I see that the data output changes with one clock delay. So can I consider the register to be on the input side (address, control) with the read data output using no register? I belive that will work for a stack. When data is being pushed, the incremented address is set up and the write is clocked in, while the data output is steady until the clock edge (old top of stack). Following the clock edge, the data written will be presented on the output (new top of stack). To pop the stack, the address is decremented and a read is done with the new data available following the clock edge (new top of stack). A write (pop and push) is done by not changing the address and registering a new write with the read data changing after the clock edge.

Will the single port WRITE_FIRST ram mode work this way?

I also need program and data memories and the register delay may interfere with full speed operation on these. I might be able to clock the data and instruction memory from "not clock" to allow the read data to be available during the second half of the current clock cycle. This may result is a bit slower clock cycle, but it should be better than a two clock cycle.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

J

John_H 22 years ago

[snip]

The "write (pop and push)" is a little confusing, you may need to elaborate that for my own edification.

For WRITE_FIRST mode, when you push a value to the top of stack, that value - the top of stack - will be sitting on the output after the one clock edge, ready to be used *immediately* for a POP value in the new cycle. With the POP command that uses the top of stack value which is waiting on the read port, the address needs be decremented such that the *next* cycle will have the *new* top of stack value ready for a new POP command. If you have a PUSH before the POP, the address is incremented for the write during the PUSH cycle such that the clock edge will have the new top of stack ready for a next-cycle POP. It's because the WRITE_FIRST makes the most-recently written value available on the read port that the stack can work well.

It's the address that needs to be manipulated combinatorially before the clock edge for the PUSH or POP to have the value ready for POP access whenever the POP comes up. The setup and routing for the address is small enough that the combinatorial delay before the BlockRAM still gives excellent timing.

Vote

R

rickman 22 years ago

Push - write to location at incremented stack pointer, update register to new data. Pop - read location at decremented stack pointer, update register to data read. Write - write to location at stack pointer, update register to new data. write is used when an instruction modifies the top of stack without popping.

I understand about the address. I was not certain about the read timing. The data sheet talks about output latches, but now I realize they mean transparent latches and the registers are all on the input side.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

P

Peter Alfke 22 years ago

If the BlockRAM explanation in the Xilinx data book is not clear, I consider that my problem. Let me fix this here:

The BlockRAM is a synchronous device, nothing happens without a clock edge. Let¹s look at just one port.

Read operation: You have to apply the address and control inputs a (very short) set-up time before the active (optional polarity) clock edge. (DI data input lines are not used). The active clock edge stores the information, decodes the address, reads the data content at that location and puts it onto the DO output lines. There is a very short set-up time, but a relatively long ³clock-to-out² read time, since it includes address decode and read and write strobes.

Write operation: You have to apply the address, Data and control inputs a (very short) set-up time before the active (optional polarity) clock edge. The active clock edge stores the information, decodes the address, creates a read and a write pulse, writes the DI data into the addressed location, and also reads the data content at that location and puts it onto the DO output lines.

The user has control of the relative timing of write and read sequence. Either WRITE_FIRST ³write before read², forcing the written data onto the DO outputs (of marginal interest) Or READ_FIRST ³read before write² , forcing the ³old² data onto the DO outputs and keeping them there until the next operation. Or NO_CHANGE, don¹t change the D0 output, causing it to maintain its data until the next read operation. These options are new to Virtex-II (and Spartan3). Virtex and Spartan2 always did write before read.

Dual-Port operation: The two ports are independent, except for special rules of validity when one port writes into a location that the other port is reading from (I posted the gory details a while ago).

In your case, you perform a synchronous write to the Top-of-Stack address, while (for free) simultaneously also reading this new data on DO. You then can pop the stack synchronously with the decremented address.

I hope this clarifies things.

Peter Alfke

Vote

R

rickman 22 years ago

I like to see diagrams of the functional elements to show how circuits work... "a picture is worth a thousand words"...

The app note is more clear now that I see my mistake. But a block diagram showing the input registers and the output *latch* might help to make the circuit operation more clear. I don't recall if there is also an optional output register, if so, that should be added to the illustration as well. I seem to recall that the operation of the CLB RAM in the 4000E series was illustrated very well in this regards. It showed all the possible modes via registers, muxes and the write pulse generator. Something like that would be useful if added to Xapp 463.

Peter Alfke wrote:

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

S

Symon 22 years ago

Rick, Try this instead. (The POPs are different) Put the BRAM in WRITE_FIRST mode.

PUSH - Write to location at incremented stack pointer, new output is new data. POP - Read output data, decrement stack pointer so new output is new top of stack WRITE - Write new data to top of stack, read old top of stack.

Sounds ideal for a Xilinx BRAM to me, all happens on a single clock edge. The BRAM always presents the top of stack at its output so it's available right away. Anyway, that's what I did..

cheers, Syms.

Vote

RAM in Altera EABs and Xilinx Block Rams

Join the Discussion

Didn't find your answer?