Quad-Port BlockRAM in Virtex

K

Kevin Neilson 10 years ago

I think I need a quad-port blockRAM in a Xilinx V7. Having multiple read p orts is no problem, but I need two read ports and two write ports. The two write ports is the problem. I can't double the clock speed. To be clear, I need to be able to do two reads and two writes per cycle. (Not writes t o the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a se maphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes t he same value to address x in AC and AD and simultaneously sets the semapho re of address x to point to 'A'. Now when reader C wants to read address x , it reads AC and BC and the semaphore, sees that semaphore points toward t he A side, and uses the value from AC and discards BC. If writer B writes to address x, it writes the value to both BC and BD and sets the semaphore x to point to side B. Reader D reads AD and BD and picks one based on the semaphore bit.

The semaphore itself is complicated. I think it would consists of 2 quad-p ort RAMs, one bit wide and the depth of AC, each one having 1 write and 3 r ead ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the side A bit and set its own to the opposite. Now when reader C or D read their two copies (A/B) of the semaphore bits using their read ports, they check i f they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port. Maybe I need a diffe rent solution.

Vote

K

Kevin Neilson 10 years ago

Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so im plementing that in a distributed RAM would probably be advantageous. You c an do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64

*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Vote

K

Kevin Neilson 9 years ago

ms to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/

64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~ 10 slices for 32->1 muxes.)

Update 2: I came up with a better solution than the Altera Cookbook. The semaphore bits are stored partly in a separate blockRAM and partly in the m ain data blockRAMs. Then there is very little logic out in the fabric--jus t the muxes for the two read ports. Too bad there isn't an app note on thi s.

Vote

E

Evgeny Filatov 9 years ago

Again, why do you need four BRAMs? Perhaps I'm stupid, but I don't see what can be achieved with four BRAMs that cannot be achieved with two, if it's correct that "[h]aving multiple read ports is no problem". Or is it just how you solve the problem of having multiple read ports?

Like, you have two BRAMs A and B, and a semaphore array. The writer A writes to A and points the semaphore of address x to A. The writer B does the same for B. You read simultaneously A and B and the semaphore for address x.

Gene

Vote

K

Kevin Neilson 9 years ago

I need 4 ports (2 wr, 2 rd). Your 2-BRAM solution allows for 2 wr ports, b ut only 1 rd port. In your solution you read A and B and the semaphore, th en mux either A or B to your read data output based on the semaphore. But I need a second read port, so I have to have a second copy of the system yo u describe.

I drew up a nice diagram with a good solution for doing the semaphores, but I don't know how to post it here.

Vote

E

Evgeny Filatov 9 years ago

Thanks for explaining the rationale for using 4 BRAMs.

Your solution would be surely interesting to look at. To post an image, you can just upload it to any image-hosting website like

formatting link

and post here the link to your image.

My best idea to remove logic from the design would be to append a timestamp to each writing operation (instead of switching a semaphore). During the read operation, the data word with the newest timestamp would be selected. But it would only work for the limited time, until the data field with the timestamp overflows.

Gene

Vote

J

jim.brakefield 9 years ago

ports is no problem, but I need two read ports and two write ports. The t wo write ports is the problem. I can't double the clock speed. To be clea r, I need to be able to do two reads and two writes per cycle. (Not writes to the same address.)

semaphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes the same value to address x in AC and AD and simultaneously sets the semap hore of address x to point to 'A'. Now when reader C wants to read address x, it reads AC and BC and the semaphore, sees that semaphore points toward the A side, and uses the value from AC and discards BC. If writer B write s to address x, it writes the value to both BC and BD and sets the semaphor e x to point to side B. Reader D reads AD and BD and picks one based on th e semaphore bit.

-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3 read ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the si de A bit and set its own to the opposite. Now when reader C or D read thei r two copies (A/B) of the semaphore bits using their read ports, they check if they are the same (use side A) or opposite (use side B).

ferent solution.

There is a literature on this subject:

formatting link

Vote

K

Kevin Neilson 9 years ago

Thanks. Here's my sketch:

formatting link

The timestamp is a nice idea, but, like you said, it would overflow quickly. And you'd have a long carry chain to do the timestamp comparison.

Vote

K

Kevin Neilson 9 years ago

Yes, I did actually find this yesterday when searching again. The design I ended up using

formatting link

) looks like what they have in Fi g. 3(a), except I implemented the "live value table" in BRAMs so it's much faster. They have a faster solution in Fig. 4(c), which uses their "XOR-ba sed" design. However, it requires a lot more RAM because you need 6 full d ata storage units. I used only 4, and then two much smaller RAMs for semap hores (aka Live Value Table), and I also store semaphore copies in the 4 da ta RAMs.

Vote

T

thomas.entner99 9 years ago

I find this thread very interesting, it discusses quite some approaches I w ould not have thought of in first place...

Maybe a different view-point: As most modern FPGAs support true dual port R AM, with double clock rate you could write to two ports in the first cycle and read from both ports in the second cycle. This would only require 1 BRA M compared to 4 BRAMs (assuming your content fits into 1 BRAM, of course... ).

However, you wrote that you cannot double the clock rate (out of curiosity: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the writes going to both BRAMs (taking two of the 3 cycles), but the reads for these two transactions (4 in total) are done in the 3rd cycle from both BRA Ms. Of course this makes only sense if you can find a simple clock-domain-c rossing-solution on system level...

Regards,

Thomas

formatting link

- Home of EEBlaster and JPEG-Codec

Vote

E

Evgeny Filatov 9 years ago

Great design! In terms of the referenced article, it combines the good features of both the LVT/semaphore approach (requires little memory to store semaphores), and the XOR-based approach (no need for multiport memory to store semaphores).

I would only suggest, that like discussed at pp. 6-7 of LaForest article, it's possible to give user the impression there's no writing delay by adding some forwarding circuitry.

Gene

Vote

K

Kevin Neilson 9 years ago

I realized that since I'm doing read-modify-writes, I don't even need the e xtra semaphore RAMs. Since I'm reading each address two cycles before writ ing, I can get the semaphores from the data RAMs. When I'm doing a write o nly, I can precede it by a dummy read to get the semaphores.

The Xilinx BRAMs operate at the same speed for write-first and read-first m odes, so I probably wouldn't need the forwarding logic. (The setup time is a lot bigger for write-first mode, though.) However, I do need a short "l ocal cache" for when I try to read-modify-write the same location on succes sive cycles. Because of the read latency, the second read would be of stal e data so I have to read from the local cache instead.

Vote

G

Guy Lemieux 9 years ago

There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

formatting link

He has released the designs on GitHub under a permissive open source license.

formatting link

Guy

Vote

G

Guy Lemieux 9 years ago

My Ph.D. Ameer added forwarding paths to his version, available on GitHub. See papers at FPGA2014 and FCCM2016.

formatting link

Vote

K

Kevin Neilson 9 years ago

RAM, with double clock

y: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all th e writes going to both BRAMs (taking two of the 3 cycles), but the reads fo r these two transactions (4 in total) are done in the 3rd cycle from both B RAMs. Of course this makes only sense if you can find a simple clock-domain

-crossing-solution on system level...

That's a great idea. It took me a few minutes to work through this but tha t seems like it would work. The clock I'm using now is 360MHz so a 1.5x cl ock would be 540MHz. That's pushing the edge, but Xilinx says the BRAM wil l run at 543MHz in a -2 part. The clock-domain crossing shouldn't be a pro blem. The clocks are "periodic-synchronous" so you have a known setup time . (Assuming you use DLLs to keep them phase-locked.)

Xilinx does have an old app note (

formatting link

) on using a 2x clock to make a quad-por t. In my case the 2x clock would be 720MHz

Vote

K

Kevin Neilson 9 years ago

extra semaphore RAMs. Since I'm reading each address two cycles before wr iting, I can get the semaphores from the data RAMs. When I'm doing a write only, I can precede it by a dummy read to get the semaphores.

I added a diagram of the simplified R-M-W quad-port to that link. http://i mgur.com/a/NhNr0

Vote

K

Kevin Neilson 9 years ago

ent Ameer Abdelhadi at FPGA2014. He has also extended it to include switche d ports, where some ports can dynamically switch between read and write mod e at FCCM2016.

nse.

Thanks; I enjoyed looking through the papers. The idea of dynamically swit ching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT i n the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and th e BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bi ts as the I-LVT ("semaphore") bit. When I do a read, I don't have to acces s a separate I-LVT RAM.

Vote

A

Ameer Abdelhadi 9 years ago

udent Ameer Abdelhadi at FPGA2014. He has also extended it to include switc hed ports, where some ports can dynamically switch between read and write m ode at FCCM2016.

cense.

itching the write ports to reads is one I might need to use at some point.

in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAM s as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to acc ess a separate I-LVT RAM.

Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (b oth binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE m odification: You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the outpu t selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* probl ems:

1) Additional 2 cycles in the decision path! The longest path of our I-LVT method passes through the LVT as follows: 1- Reading the I-LVT feedbacks 2- Rewriting the I-LVT 3- Reading the I-LVT to generate (through output extraction function) outpu t mux selectors. With these three cycles, our I-LVT required a very complicated bypassing ci rcuitry to deal with even simple hazards as Write-After-Write. Your solution adds two cycles in the selection path, one to rewrite the dat a banks with the I-LVT bits, and the second to read these bits (then extrac t the selectors). This solution requires caching to bypass this very long d ecision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the outp ut mux selectors in your method are read from the data banks instead of the LVT. Once a write happens, the output selectors will see the change after

5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write (selectors) -> data bank read (selectors)), whereas ours requires only 3 cy cles. 2) Modularity: The additional bits can't accommodate bank selectors for every number of wr ite ports. For instance, you mentioned extra 3 bits in each BRAM line. Thes e 3 bits can code selectors for up to 8 write ports. For more than 8 write ports, the meta-data should be stored in additional BRAMs, which will furth er increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance , in your diagram, you are using 140Kbits for the data banks and only 2Kbit s for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1.5 %), however, it eliminates the need for caching (as required by your soluti on).

Ameer

formatting link

Vote

A

Ameer Abdelhadi 9 years ago

student Ameer Abdelhadi at FPGA2014. He has also extended it to include swi tched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

license.

switching the write ports to reads is one I might need to use at some point .

VT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT R AMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the res t of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, an d the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of thos e bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to a ccess a separate I-LVT RAM.

(both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:

e data banks, these LVT bits will also be read as a meta-data, then the out put selectors will be extracted (the XOR's in your diagram). This will inde ed prevent replicating the LVT BRAMs; however, it incurs other *severe* pro blems:

put mux selectors.

circuitry to deal with even simple hazards as Write-After-Write.

ata banks with the I-LVT bits, and the second to read these bits (then extr act the selectors). This solution requires caching to bypass this very long decision path, which will increase the BRAM overhead again.

tput mux selectors in your method are read from the data banks instead of t he LVT. Once a write happens, the output selectors will see the change afte r 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks writ e (selectors) -> data bank read (selectors)), whereas ours requires only 3 cycles.

write ports. For instance, you mentioned extra 3 bits in each BRAM line. Th ese 3 bits can code selectors for up to 8 write ports. For more than 8 writ e ports, the meta-data should be stored in additional BRAMs, which will fur ther increase the BRAM consumption.

ce, in your diagram, you are using 140Kbits for the data banks and only 2Kb its for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1 .5%), however, it eliminates the need for caching (as required by your solu tion).

BTW, our design is available online as an open source library. It's modular , parametrized, optimized for high performance and optimal resources consum ption, fully bypassed, and fully tested with a run-in-batch manager for sim ulation and synthesis.

Just download the Verilog, add it to your project, instantiate the IP modul e, change to your parameters (e.g. #reads, #writes, data width, RAM depth, bypassing...), and you're ready to go!

Open source libraries:

formatting link

BRAM-based Multi-ported RAM from FPGA'14:

formatting link

Paper:

formatting link

Slides:

formatting link

Enjoy!

Vote

K

Kevin Neilson 9 years ago

(both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:

e data banks, these LVT bits will also be read as a meta-data, then the out put selectors will be extracted (the XOR's in your diagram). This will inde ed prevent replicating the LVT BRAMs; however, it incurs other *severe* pro blems:

Ameer, Thanks for the response. Yes, there may be some latency disadvantages in m y approach. For the cache that I need for the bypass logic, I use a Xilinx dynamic SRL. It's the same size and speed whether or not the cache depth is 2 or 32, so making the cache deeper doesn't make much difference. (Ther e is more address-comparison logic, though.)

As for the memory usage, it just depends on what BRAM width you need. If yo u need a 512-deep by 64-bit wide BRAM, you have to use a Xilinx simple-dual port BRAM with a width of 72, so then you have 8 bits of each location "wa sted" which you can use for ILVT flags. But if you need a 72-bit-wide BRAM for data, then there is no advantage in trying to combine the data and the flags. In my case I just happened to need 69 and had 3 bits left over.

I finished the design that uses the quad-port and I can say it's working we ll and it simplified my algorithm significantly. My clock speed is 360 MHz which was too fast to use a 2x clock to time-slice the BRAMs, but the I-LV T design works just fine. Kevin

Vote

Quad-Port BlockRAM in Virtex

Join the Discussion

Didn't find your answer?