Quad-Port BlockRAM in Virtex

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I think I need a quad-port blockRAM in a Xilinx V7.  Having multiple read p
orts is no problem, but I need two read ports and two write ports.  The two
 write ports is the problem.  I can't double the clock speed.  To be clear,
 I need to be able to do two reads and two writes per cycle.  (Not writes t
o the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a se
maphore array.  Let's call the BRAMs AC, AD, BC, and BD.  Writer A writes t
he same value to address x in AC and AD and simultaneously sets the semapho
re of address x to point to 'A'.  Now when reader C wants to read address x
, it reads AC and BC and the semaphore, sees that semaphore points toward t
he A side, and uses the value from AC and discards BC.  If writer B writes  
to address x, it writes the value to both BC and BD and sets the semaphore  
x to point to side B.  Reader D reads AD and BD and picks one based on the  
semaphore bit.

The semaphore itself is complicated.  I think it would consists of 2 quad-p
ort RAMs, one bit wide and the depth of AC, each one having 1 write and 3 r
ead ports.  This could be distributed RAM.  Writer A would read the side B  
semaphore bit and set its own to the same, and writer B would read the side
 A bit and set its own to the opposite.  Now when reader C or D read their  
two copies (A/B) of the semaphore bits using their read ports, they check i
f they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port.  Maybe I need a diffe
rent solution.

Re: Quad-Port BlockRAM in Virtex
Update:  I found a solution in the "Altera Synthesis Cookbook" and it seems
 to be the scheme I described above, but implementing the semaphore bits as
 FFs instead of distributed RAM.  I'd need about 2048 semaphore bits, so im
plementing that in a distributed RAM would probably be advantageous.  You c
an do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64
*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs.  (Add in ~10
 slices for 32->1 muxes.)

Re: Quad-Port BlockRAM in Virtex
On Friday, October 23, 2015 at 2:10:20 PM UTC-6, Kevin Neilson wrote:
Quoted text here. Click to load it
ms to be the scheme I described above, but implementing the semaphore bits  
as FFs instead of distributed RAM.  I'd need about 2048 semaphore bits, so  
implementing that in a distributed RAM would probably be advantageous.  You
 can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/
64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs.  (Add in ~
10 slices for 32->1 muxes.)

Update 2:  I came up with a better solution than the Altera Cookbook.  The  
semaphore bits are stored partly in a separate blockRAM and partly in the m
ain data blockRAMs.  Then there is very little logic out in the fabric--jus
t the muxes for the two read ports.  Too bad there isn't an app note on thi
s.

Re: Quad-Port BlockRAM in Virtex
On 01.11.2016 23:16, Kevin Neilson wrote:
Quoted text here. Click to load it

Again, why do you need four BRAMs? Perhaps I'm stupid, but I don't see  
what can be achieved with four BRAMs that cannot be achieved with two,  
if it's correct that "[h]aving multiple read ports is no problem". Or is  
it just how you solve the problem of having multiple read ports?

Like, you have two BRAMs A and B, and a semaphore array. The writer A  
writes to A and points the semaphore of address x to A. The writer B  
does the same for B. You read simultaneously A and B and the semaphore  
for address x.

Gene


Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it

I need 4 ports (2 wr, 2 rd).  Your 2-BRAM solution allows for 2 wr ports, b
ut only 1 rd port.  In your solution you read A and B and the semaphore, th
en mux either A or B to your read data output based on the semaphore.  But  
I need a second read port, so I have to have a second copy of the system yo
u describe.

I drew up a nice diagram with a good solution for doing the semaphores, but
 I don't know how to post it here.

Re: Quad-Port BlockRAM in Virtex
On 04.11.2016 3:35, Kevin Neilson wrote:
Quoted text here. Click to load it

Thanks for explaining the rationale for using 4 BRAMs.

Your solution would be surely interesting to look at. To post an image,  
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a  
timestamp to each writing operation (instead of switching a semaphore).  
During the read operation, the data word with the newest timestamp would  
be selected. But it would only work for the limited time, until the data  
field with the timestamp overflows.

Gene


Re: Quad-Port BlockRAM in Virtex

Quoted text here. Click to load it

Thanks.  Here's my sketch:

http://imgur.com/a/NhNr0

The timestamp is a nice idea, but, like you said, it would overflow quickly.  And you'd have a long carry chain to do the timestamp comparison.

Re: Quad-Port BlockRAM in Virtex
On 05.11.2016 3:00, Kevin Neilson wrote:
Quoted text here. Click to load it

Great design! In terms of the referenced article, it combines the good  
features of both the LVT/semaphore approach (requires little memory to  
store semaphores), and the XOR-based approach (no need for multiport  
memory to store semaphores).

I would only suggest, that like discussed at pp. 6-7 of LaForest  
article, it's possible to give user the impression there's no writing  
delay by adding some forwarding circuitry.

Gene


Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it

I realized that since I'm doing read-modify-writes, I don't even need the e
xtra semaphore RAMs.  Since I'm reading each address two cycles before writ
ing, I can get the semaphores from the data RAMs.  When I'm doing a write o
nly, I can precede it by a dummy read to get the semaphores.

The Xilinx BRAMs operate at the same speed for write-first and read-first m
odes, so I probably wouldn't need the forwarding logic.  (The setup time is
 a lot bigger for write-first mode, though.)  However, I do need a short "l
ocal cache" for when I try to read-modify-write the same location on succes
sive cycles.  Because of the read latency, the second read would be of stal
e data so I have to read from the local cache instead.

Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it
 extra semaphore RAMs.  Since I'm reading each address two cycles before wr
iting, I can get the semaphores from the data RAMs.  When I'm doing a write
 only, I can precede it by a dummy read to get the semaphores.
Quoted text here. Click to load it

I added a diagram of the simplified R-M-W quad-port to that link.  http://i
mgur.com/a/NhNr0

Re: Quad-Port BlockRAM in Virtex
My Ph.D. Ameer added forwarding paths to his version, available on GitHub. See papers at FPGA2014 and FCCM2016.  

http://ece.ubc.ca/~ameer/publications.html

https://github.com/AmeerAbdelhadi/Multiported-RAM

Re: Quad-Port BlockRAM in Virtex
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.  

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Re: Quad-Port BlockRAM in Virtex
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
Quoted text here. Click to load it
ent Ameer Abdelhadi at FPGA2014. He has also extended it to include switche
d ports, where some ports can dynamically switch between read and write mod
e at FCCM2016.
Quoted text here. Click to load it
nse.  
Quoted text here. Click to load it

Thanks; I enjoyed looking through the papers.  The idea of dynamically swit
ching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT i
n the data RAMs.  For example, for a 2W/2R memory, you show the I-LVT RAMs  
as being 1 write, 3 reads.  My I-LVTs are 1 write, 1 read, with the rest of
 the I-LVT done in the data RAMs.  In my case, I need 69-wide BRAMs, and th
e BRAMs are 72 bits wide, so I have an extra 3 bits.  I use one of those bi
ts as the I-LVT ("semaphore") bit.  When I do a read, I don't have to acces
s a separate I-LVT RAM.

Re: Quad-Port BlockRAM in Virtex
On Tuesday, November 8, 2016 at 12:13:55 PM UTC-5, Kevin Neilson wrote:
Quoted text here. Click to load it
udent Ameer Abdelhadi at FPGA2014. He has also extended it to include switc
hed ports, where some ports can dynamically switch between read and write m
ode at FCCM2016.
Quoted text here. Click to load it
cense.  
Quoted text here. Click to load it
itching the write ports to reads is one I might need to use at some point.
Quoted text here. Click to load it
 in the data RAMs.  For example, for a 2W/2R memory, you show the I-LVT RAM
s as being 1 write, 3 reads.  My I-LVTs are 1 write, 1 read, with the rest  
of the I-LVT done in the data RAMs.  In my case, I need 69-wide BRAMs, and  
the BRAMs are 72 bits wide, so I have an extra 3 bits.  I use one of those  
bits as the I-LVT ("semaphore") bit.  When I do a read, I don't have to acc
ess a separate I-LVT RAM.


Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (b
oth binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE m
odification:
You store the BRAM outputs of the LVT in the data banks. After reading the  
data banks, these LVT bits will also be read as a meta-data, then the outpu
t selectors will be extracted (the XOR's in your diagram). This will indeed
 prevent replicating the LVT BRAMs; however, it incurs other *severe* probl
ems:

1) Additional 2 cycles in the decision path!
The longest path of our I-LVT method passes through the LVT as follows:
1- Reading the I-LVT feedbacks
2- Rewriting the I-LVT
3- Reading the I-LVT to generate (through output extraction function) outpu
t mux selectors.
With these three cycles, our I-LVT required a very complicated bypassing ci
rcuitry to deal with even simple hazards as Write-After-Write.
Your solution adds two cycles in the selection path, one to rewrite the dat
a banks with the I-LVT bits, and the second to read these bits (then extrac
t the selectors). This solution requires caching to bypass this very long d
ecision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the outp
ut mux selectors in your method are read from the data banks instead of the
 LVT. Once a write happens, the output selectors will see the change after  
5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write  
(selectors) -> data bank read (selectors)), whereas ours requires only 3 cy
cles.

2) Modularity:
The additional bits can't accommodate bank selectors for every number of wr
ite ports. For instance, you mentioned extra 3 bits in each BRAM line. Thes
e 3 bits can code selectors for up to 8 write ports. For more than 8 write  
ports, the meta-data should be stored in additional BRAMs, which will furth
er increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance
, in your diagram, you are using 140Kbits for the data banks and only 2Kbit
s for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1.5
%), however, it eliminates the need for caching (as required by your soluti
on).

Ameer
http://www.ece.ubc.ca/~ameer/

Re: Quad-Port BlockRAM in Virtex
On Tuesday, December 13, 2016 at 4:37:42 PM UTC-5, Ameer Abdelhadi wrote:
Quoted text here. Click to load it
student Ameer Abdelhadi at FPGA2014. He has also extended it to include swi
tched ports, where some ports can dynamically switch between read and write
 mode at FCCM2016.
Quoted text here. Click to load it
license.  
Quoted text here. Click to load it
switching the write ports to reads is one I might need to use at some point
.
Quoted text here. Click to load it
VT in the data RAMs.  For example, for a 2W/2R memory, you show the I-LVT R
AMs as being 1 write, 3 reads.  My I-LVTs are 1 write, 1 read, with the res
t of the I-LVT done in the data RAMs.  In my case, I need 69-wide BRAMs, an
d the BRAMs are 72 bits wide, so I have an extra 3 bits.  I use one of thos
e bits as the I-LVT ("semaphore") bit.  When I do a read, I don't have to a
ccess a separate I-LVT RAM.
Quoted text here. Click to load it
(both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE
 modification:
Quoted text here. Click to load it
e data banks, these LVT bits will also be read as a meta-data, then the out
put selectors will be extracted (the XOR's in your diagram). This will inde
ed prevent replicating the LVT BRAMs; however, it incurs other *severe* pro
blems:
Quoted text here. Click to load it
put mux selectors.
Quoted text here. Click to load it
circuitry to deal with even simple hazards as Write-After-Write.
Quoted text here. Click to load it
ata banks with the I-LVT bits, and the second to read these bits (then extr
act the selectors). This solution requires caching to bypass this very long
 decision path, which will increase the BRAM overhead again.
Quoted text here. Click to load it
tput mux selectors in your method are read from the data banks instead of t
he LVT. Once a write happens, the output selectors will see the change afte
r 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks writ
e (selectors) -> data bank read (selectors)), whereas ours requires only 3  
cycles.
Quoted text here. Click to load it
write ports. For instance, you mentioned extra 3 bits in each BRAM line. Th
ese 3 bits can code selectors for up to 8 write ports. For more than 8 writ
e ports, the meta-data should be stored in additional BRAMs, which will fur
ther increase the BRAM consumption.
Quoted text here. Click to load it
ce, in your diagram, you are using 140Kbits for the data banks and only 2Kb
its for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1
.5%), however, it eliminates the need for caching (as required by your solu
tion).
Quoted text here. Click to load it

BTW, our design is available online as an open source library. It's modular
, parametrized, optimized for high performance and optimal resources consum
ption, fully bypassed, and fully tested with a run-in-batch manager for sim
ulation and synthesis.

Just download the Verilog, add it to your project, instantiate the IP modul
e, change to your parameters (e.g. #reads, #writes, data width, RAM depth,  
bypassing...), and you're ready to go!

Open source libraries:
http://www.ece.ubc.ca/~ameer/opensource.html
https://github.com/AmeerAbdelhadi/

BRAM-based Multi-ported RAM from FPGA'14:
https://github.com/AmeerAbdelhadi/Multiported-RAM
Paper: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Conference-2014F
eb-FPGA2014-MultiportedRAM.pdf
Slides: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Talk-2014Feb-FP
GA2014-MultiportedRAM.pdf

Enjoy!

Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it
(both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE
 modification:
Quoted text here. Click to load it
e data banks, these LVT bits will also be read as a meta-data, then the out
put selectors will be extracted (the XOR's in your diagram). This will inde
ed prevent replicating the LVT BRAMs; however, it incurs other *severe* pro
blems:

Ameer,
Thanks for the response.  Yes, there may be some latency disadvantages in m
y approach.  For the cache that I need for the bypass logic, I use a Xilinx
 dynamic SRL.  It's the same size and speed whether or not the cache depth  
is 2 or 32, so making the cache deeper doesn't make much difference.  (Ther
e is more address-comparison logic, though.)  

As for the memory usage, it just depends on what BRAM width you need. If yo
u need a 512-deep by 64-bit wide BRAM, you have to use a Xilinx simple-dual
 port BRAM with a width of 72, so then you have 8 bits of each location "wa
sted" which you can use for ILVT flags.  But if you need a 72-bit-wide BRAM
 for data, then there is no advantage in trying to combine the data and the
 flags.  In my case I just happened to need 69 and had 3 bits left over.  

I finished the design that uses the quad-port and I can say it's working we
ll and it simplified my algorithm significantly.  My clock speed is 360 MHz
 which was too fast to use a 2x clock to time-slice the BRAMs, but the I-LV
T design works just fine.
Kevin

Re: Quad-Port BlockRAM in Virtex
On Friday, October 23, 2015 at 2:40:46 PM UTC-5, Kevin Neilson wrote:
Quoted text here. Click to load it
 ports is no problem, but I need two read ports and two write ports.  The t
wo write ports is the problem.  I can't double the clock speed.  To be clea
r, I need to be able to do two reads and two writes per cycle.  (Not writes
 to the same address.)
Quoted text here. Click to load it
semaphore array.  Let's call the BRAMs AC, AD, BC, and BD.  Writer A writes
 the same value to address x in AC and AD and simultaneously sets the semap
hore of address x to point to 'A'.  Now when reader C wants to read address
 x, it reads AC and BC and the semaphore, sees that semaphore points toward
 the A side, and uses the value from AC and discards BC.  If writer B write
s to address x, it writes the value to both BC and BD and sets the semaphor
e x to point to side B.  Reader D reads AD and BD and picks one based on th
e semaphore bit.
Quoted text here. Click to load it
-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3
 read ports.  This could be distributed RAM.  Writer A would read the side  
B semaphore bit and set its own to the same, and writer B would read the si
de A bit and set its own to the opposite.  Now when reader C or D read thei
r two copies (A/B) of the semaphore bits using their read ports, they check
 if they are the same (use side A) or opposite (use side B).
Quoted text here. Click to load it
ferent solution.

There is a literature on this subject:
http://fpgacpu.ca/multiport/TRETS2014-LaForest-Article.pdf

Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it

Yes, I did actually find this yesterday when searching again.  The design I
 ended up using (http://imgur.com/a/NhNr0 ) looks like what they have in Fi
g. 3(a), except I implemented the "live value table" in BRAMs so it's much  
faster.  They have a faster solution in Fig. 4(c), which uses their "XOR-ba
sed" design.  However, it requires a lot more RAM because you need 6 full d
ata storage units.  I used only 4, and then two much smaller RAMs for semap
hores (aka Live Value Table), and I also store semaphore copies in the 4 da
ta RAMs.

Re: Quad-Port BlockRAM in Virtex
I find this thread very interesting, it discusses quite some approaches I w
ould not have thought of in first place...

Maybe a different view-point: As most modern FPGAs support true dual port R
AM, with double clock rate you could write to two ports in the first cycle  
and read from both ports in the second cycle. This would only require 1 BRA
M compared to 4 BRAMs (assuming your content fits into 1 BRAM, of course...
).

However, you wrote that you cannot double the clock rate (out of curiosity:
 which clock rates are we talking about?). But, maybe you could increase it
 by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the  
writes going to both BRAMs (taking two of the 3 cycles), but the reads for  
these two transactions (4 in total) are done in the 3rd cycle from both BRA
Ms. Of course this makes only sense if you can find a simple clock-domain-c
rossing-solution on system level...

Regards,

Thomas
www.entner-electronics.com - Home of EEBlaster and JPEG-Codec

Re: Quad-Port BlockRAM in Virtex
Quoted text here. Click to load it
 RAM, with double clock  
Quoted text here. Click to load it
y: which clock rates are we talking about?). But, maybe you could increase  
it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all th
e writes going to both BRAMs (taking two of the 3 cycles), but the reads fo
r these two transactions (4 in total) are done in the 3rd cycle from both B
RAMs. Of course this makes only sense if you can find a simple clock-domain
-crossing-solution on system level...
Quoted text here. Click to load it

That's a great idea.  It took me a few minutes to work through this but tha
t seems like it would work.  The clock I'm using now is 360MHz so a 1.5x cl
ock would be 540MHz.  That's pushing the edge, but Xilinx says the BRAM wil
l run at 543MHz in a -2 part.  The clock-domain crossing shouldn't be a pro
blem.  The clocks are "periodic-synchronous" so you have a known setup time
.  (Assuming you use DLLs to keep them phase-locked.)

Xilinx does have an old app note ( https://www.xilinx.com/support/documenta
tion/application_notes/xapp228.pdf ) on using a 2x clock to make a quad-por
t.  In my case the 2x clock would be 720MHz

Site Timeline