Question about multi write ports RAM in FPGA?

Hello, Is that possible to make a multi write ports RAM in FPGA by using distributed RAM or block RAM? It seems impossible to me. But using D flip flop to implement the multi ports RAM will costs too much resouce. Is there any suggestion to implement the multi write ports RAM in FPGA? Thanks a lot.

Reply to
fpga
Loading thread data ...

Hello, Is that possible to make a multi write ports RAM(eg, a 512by32 RAM with several read ports and several write ports) in FPGA by using distributed RAM or block RAM? It is easy to implment a multi read ports, one write port RAM in FPGA by usign the dual port block RAM. But it seems it is impossible to implement a multi write ports RAM by usign distribute or block RAM. Using D flip flop to implement the multi ports RAM will costs too much resource. Is there any suggestion to implement the multi write ports RAM in FPGA? Thanks a lot.

Reply to
fpga

Xilinx has dual ported block RAM. In which both ports (I believe) can do simultaneous writes.

If you want more ports than that then you may have to use time multiplexing.

-------------------------------------------------------------------------------------------------------------- I am an EE student looking for summer employment in Toronto, Canada area If you have any openings please contact me at isaacb[AT]rogers[DOT]com.

Reply to
Isaac Bosompem

Good catch, Isaac, you beat me to it. Yes, BlockRAMs can write from both ports simultaneously, and the write enable input can change from write to read on every clock tick. Interestingly, ever write also performs a read of the same location, either read-before-write or write-before-read. The former is more useful. For additional ports, use time multiplexing. Peter Alfke

Reply to
Peter Alfke

Thank you peter and Issac all. Yes, I want serverl read ports and severl write ports for the RAM. So I think the only way I can do is to use time multiplexing, which will limit the highest frequency I can get.

Reply to
fpga

If multi write means just 2 you are all set, if it means >2, besides time based sharing you might also look at a recent (last week) thread on incresing write ports by banking multiple BRAMs and using voting logic. The difference between 1,2, 3+ is enormous if done in 1 clock. I usually take several to mean >2.

See also Mar 6 "How do I make dual-port RAM from single port RAM?"

There were 2 interesting suggestions offered to allow 1 write port ASIC rams to be used as an effective 2 port write ram but used 4 rams to do this. The voting logic though still had to allow multi writes but is only 1 bit wide. Perhaps these schemes can be used to allow >2 writes per clock with even more voting logic. It will depend alot on your reasons and conditions for wanting >2 writes. Typically high write ports per clock are used in shallow datapath register files while low write counts ports used in buffers etc.

What is your write port count and what is actual ram size and application ?

John

Reply to
JJ

Thanks very much, John. My ram size is 256x32 and I want it has 4 read ports and 3 write ports. The ram is gonna to be used as the local vector register file in my vector coprocessor. My vector coprocessor has different function cores, each has its own local vector register (LVR). So these LVR need to provide ports to the function unit(2 read ports, 1 write ports) and ports for transfer LVR data between this cores (2 read ports, 2 write ports). I choose 2 read ports and 2 write ports for data transfering because I believe it can bring much better performance than 1 read port/1 write port design.

Also, multiple ports RAM (I don't decide the size and ports number yet) will be used as the register file in superscalar machine.

Reply to
fpga

BlockRAMs are the easiest for a dual-port write.

For a non-multiplexed multiport write using distributed RAM, a little extra logic and a bunch more distributed memories can give you what you need.

Each port in an n-port distributed RAM canfiguration has one write and n-1 read from each of the other memories. A write is done with the desired write data and an XOR of all the other reads. A read is done with a read of all the memories. As long as there are never writes to the same port, this sytem works gret; I've used it for multi-channel flags on both sides of a synchronous interface.

As long as you have the asynchronous distributed memories and enough setup for the write address, read, and XOR before the data is written, it all flows.

Reply to
John_H

Thank you very much.

Sorry I didn't clarify my requirement. I know it is easy for a dual-port wirte and 1 wirte/multiple read RAM. But I need a RAM with >2 write ports and >2 read ports. Time-multiplexing is one choice, but it may limited the system frequency. Using voting logic as specified by JJ maybe another choice.

Reply to
fpga

The method I suggested specifically works for your case. You need a total of 9 dual-port distributed CLB SelectRAM memory sets for 3 Rd/Wr adresses and 1 Rd-only address. If your 4 read addresses are unrelated to any of the 3 write addresses, you would end up with 6 dual-ports to support the 3 write ports and 3 dual-ports for each of your independent reads for a total of 18 dual-port CLB SelectRAM memory arrays.

For this to work, 1) you cannot write to the same location in more than one memory at the same time, 2) you have to XOR the input data with the data at the same location in memories related to the other write ports, and 3) your read values are the XORs of the data from each of the memories associated with the three write ports.

For a 4-bit with memory, assume the memories associated with the three write ports at entry 12 are

MemA[12]==4'ha MemB[12]==4'h6 MemC[12]==4'h0

Then a write to index 12 of Din==4'h7 at port B of your three-port write system would be

MemB[12]

Reply to
John_H

To fpga, Do you have any relationship with GN,pheonix. If no sorry for the trouble.

Reply to
vssumesh

Thank you very much for the clever solution, John. But I think the 9 dual-port RAM configuration can only give me 3 rd/wr ports. Where the other read port comes from ?

I think the configuration for the 9 dual-port RAM is like this: form ram0 to ram8. Port A write to RAM0,1,2 and read from 0,3,6; portB write to RAM3,4,5 and read from RAM1,4,7; PortC write to RAM6,7,8 and read from RAM2,5,8. Then all the ports of the dual port RAM has been used and we got 3 rd/wd portA,B,C. Where can I put the other read port? Did I miss something? Thanks a lot.

Reply to
fpga

The RTL (Verilog or VHDL) can take care of instantiating the correct number of dual-port RAMs.

if( WeA ) MemA[AddrA]

Reply to
John_H

Thank you. I will try it the synthesizer and see the resource usage.

Reply to
fpga

fpga schrieb:

For vector units you can simplify things a lot. For example you could have a scoreboard that keeps track of which of three rams contains the most recent entry. For random access this is unpractical because this logic needs three write ports itself. For a vector unit you can update that register in (the simplest case) three extra clock cycles after each stride. This results in 3+N cycles which usually is a lot better than the 2xN of a multi cycling approach.

Kolja Sulimma

Reply to
Kolja Sulimma

Somehow mentioning Superscaler just told me your clock is headed down hill. How many LUT levels of logic do you expect and what is your target frequency? Is this a Uni, commercial or hobby project?

There are really good reasons to consider time driven ports.

A very fast n cycle design can run at the limit of the BRAM or a16b adder or about 3 LUT levels of logic all which are way faster than say a 32b add. This will use about half the total logic and still execute near 150MHz compared to a true simple 1 cycle design. Less logic is much easier to floor plan too.

In my processor design I get 4 effective ports out of 1 BRAM (regRR alternates with regW+fetchI) and that runs at +300MHz using 2 clocks per register opcode in V2Pro -5. The datapath combines 2 half 16b results, and the variable length encoded instruction set uses time based muxing to build opcodes rather than lots of mux arrays. The datapath has no register forwarding or hazard logic since the whole thing runs 4 threads. Thats a whole lot of logic not there to slow things down. With 8 clocks per thread opcode, even DRAM cycles don't look so bad provided only 1 thread does a load/store every 16 cycles or so.

This is inspired by commutating latency hiding DSP design principles rather than the desire to match current full custom cpus that try (and mostly fail) to get more than 1 opcode per clock. The real problem in computing is not how fast processors might crunch data, but the memory systems ability to feed that.

An earlier design that was straight 1 cycle used 3x the logic, 2x the BRAMs and still couldn't get anywhere near 300MHz/2 with all the side control logic stacking up.

Time driven logic will always run faster than parallel complex logic, but if you are prototyping or just studying comp architecture, clock performance doesn't really matter so much.

FPGAs are good for soft cpu design for true RISC in the John Cocke sense, not the OoO SS VLIW EPIC sense that brute force transister design makes possible.

John Jakson Transputer guy

Reply to
JJ

Thank you very much for your detailed explanation. I really appreciated it.

Reply to
fpga

Thank you very much John. I want to implement a system run at

100Mhz(clk) at Xilinx virtexII -5 chip xc2v6000 and time-multiplexing should be able to satisfy my requirement. I used the DCM to get a 300Mhz(clkx3) signal from pin CLKFX and used it as the clock to the dual port RAM, which will render me 6 ports(dual x3) and it is enough for my application. By the way, DCM in this chip can provide up to 320Mhz signal and BRAM can run at about 387Mhz. (I am not sure about the BRAM frequency because I calculate it by myself) Now here is the new problem: after synthesis, while the clkx3 can achieve 287.9Mhz , which I will try to improve it to a higher number above 300Mhz later, the clk can only achieve 72Mhz. Shouldn't they have the relationship clkx3/clk=3, in which case I can get a 287.8/3=96Mhz? Anything wrong here? Besides generating the clkx3 signal, my clk only used to synchronize the output data from BRAM. Actually, if the only usage of clk is to generate clkx3, XST and synplify will still give me same timing result.

Thank you very much for your kind suggestion.

Reply to
fpga

Hello, John. Because my target is to implement a 100Mhz(clk) system on vertexII xc2v6000-5 chip. I think the time-multiplexing way can satisfy my requirement because for this chip, DCM can provide up to 320Mhz and BRAM can run at about 387Mhz. (I am not sure about the BRAM value becasue I calculate it by myself .) I used DCM to get a 300MHz(clkx3) clock from hte CLKFx fpin and used it to get an 6 ports RAM. (dualportx3), which is enough for my application. But another problem arise, my clkx3 can achieve 287.9Mhz, which I will try to improve it a higher value above 300Mhz later , my clk can only achieve 72Mhz. These numbers are got after synthesis.I suppose that clk should be able to achieve 287.9/3=96Mhz. Anything wrong here? Thank you very much for your help again.

Reply to
fpga

To John_H I am a beginer in this type of advanced concepts. I am also facing the same problem of multiple read port and multiple write port. But i could not understand the concept in your design. Please explain little bit more. Thanks in advance Sumesh

Reply to
vssumesh

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.