Separate enable on address for ram blocks

I am reworking a design I did a couple of years ago to fit a newer part. The original design made use of async ram blocks since they fit the application better. Now I am forced to use registered synchronous block rams. This will create an extra clock delay on reads if I don't think of a way around it. But the clever guy I am, I have come up with a couple of alternatives.

The source of the address is a register that will often be updated just before the cycle that needs to do the memory access. Call the old memory access cycle 1 and the cycle that calculates the address cycle

  1. The write enable is not valid until this clock cycle 1, but during the clock cycle 0, the address is on the input to the address register, call it "next address". If I run "next address" to the address inputs on the memory, I can start the memory read on clock cycle 0 and the timing works out. I can't do the same trick with the read or write enable since they depend on decoding that will not take place until cycle 1 where the memory access was happening. Since the enable signal is not available early, the read will have to take place on every clock cycle wasting some power whether I need to do a read or not. So it looks like I will have to do a read on every cycle using the "next address" and a write only when I need it using the "current address".

I can use a dual port memory and connect one for the read and one for the write. This can even be done on the same port if the address has a separate enable. Then I can use the address input to the block ram as the address register. The address is updated on every clock cycle and a read performed, except when the logic signals a write, then the write enable is asserted on cycle 1 and the address enable is removed so keep the same address that was latched on cycle 0. I see the Altera Cyclone

2 parts have an address enable that will let me hold the last address. I don't see an address enable on the Xilinx Spartan 3 parts and I am not sure about the Lattice ECP2 parts as I don't have the full data sheet.

The only down side to this "trick" is that it adds a bit of time to the logic path that updates the address register. But the actual ram setup time seems to be pretty small and I expect the routing can be kept pretty minimal as well. So this timing impact may not make the address setup the critical path. But I expect the overall timing to change significantly since the instruction fetch and other internal memory access will be greatly improved using the sync block ram. So the address update may end up as the critical path.

Am I missing something about how to best use a block ram? Any other ideas on how to do the read without adding a clock cycle?

Reply to
rickman
Loading thread data ...

Are you *using* the Altera parts? Xilinx? Lattice? You remain politically neutral in your discussion but it doesn't help with architecture-specific implementations.

Look at the Xilinx WRITE_MODE attribute where NO_CHANGE or WRITE_FIRST

*might* give you an extra trick up your sleeve for that architecture. If you don't need the size of those blocks, perhaps the Altera M512 blocks have slightly different performance characteristics that could be leveraged. I'm still a big fan of distributed memory as in Xilinx and Lattice devices giving that old async kind of feeling.

Do you ever want to read the address you just wrote? Write during cycle

1, read from the same registered address on cycle 2? (More precisely, the same combinatorial address also on cycle 1)
Reply to
John_H

Hi Rick,

And about time too! :-)

If you have the timing budget, (I guess you have if this is coming from an old design) use your idea of a dual port ram with one port for write and one for read, but clock the read half on the falling edge. HTH, Syms.

p.s. DISCLAIMER. I hate using the opposite edge. It's usually bad, costing you in the long ruin. (Nice typo, I'll leave that in!)

Reply to
Symon

I'm looking for the best architecture to implement this design. I want to use a low cost device so the choices are Spartan 3, Cyclone 2 or ECP2. I would like to optimize for the best architecture and have the option of porting to other choices.

I am familiar with the write mode features. I will be using the write through mode or I think Xilinx calls it "WRITE_FIRST". Distributed memory is too small for this application.

The reads and writes in this case will be separate. I don't actually need the WRITE_FIRST mode on this memory, but will use it on other blocks where it implements a stack. That works great with the sync block ram. The main memory works better async. But I think this may work out pretty well and it has the potential of speeding up things overall.

Reply to
rickman

Some general tricks, then:

I've sometimes doubled the write-side bandwidth in the Xilinx BlockRAM and used the "NO_CHANGE" WRITE_MODE to do a read at the clkX1 rising edge and a write at the clkX1 falling edge. I get the full cycle for the read value to get to its destination and a full cycle for the read address to get to the BlockRAM (albiet through a final read/write address mux stage), effectively making my port look like a read-only port working at clkX1 rather than a rd/wr port at clkX2. The cost of adding the multi-cycle constraints in the right places is minor compared to getting the "additional" write port.

While looking into what it'd take to sort 1M 64-bit numbers, I figured the more numerous, smaller memories in the Altera Cyclone-II might give an edge but found lack of support for true dual-port operation for greater than 18 bits. In your application, the psuedo dual-port may be enough to get by with but it's up to you. The Lattice ECP2 family has 18 kbit memories that are also limited to 18 bits in true dual port, similar to the 4.5 kbit Altera memories.

I think we've all experienced delays getting data back out of memory. I was surprised that the 3rd part IP core I evaluated on the Spartan3 was having severe timing troubles in the Altera device a coworker was using; he eventually worked out the timing but a bit late for the project; just a caveat for designs that are pushing speed - look over real implementations if you need high memory performance.

- John_H

Reply to
John_H

Thanks for the idea. I had a glimmer of that idea, but I didn't take the time to follow it through. Since the registered address and data do not need to go through any logic on the way to the RAM and the data output from the RAM only goes through a minimal amount of logic (well, not tons anyway, just a 14 way multiplexor). I might be able to get away with a single port clocked on the opposite edge of the clock from the rest of the design. Then I can use the enable on the reads and save the bit of power, plus it will not affect the path to the address register and should run faster if the opposite clocking works ok on the data out path.

I love simple solutions!

Reply to
rickman

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.