ROM resource sharing

Hello

How to implement on-chip ROM memory resource sharing in FPGA? I implemented discrete cosine transform core using parallel distributed arithmetic approach, in which hardware multipliers are substituted by precomputed MAC results stored in LUT/ROM. Single ROM instance is 64x14 bits. Problem is that the ROM must be replicated many times to enable high throughput (replicated 9 times for first DCT stage and replicated

11 times for 2nd stage after transposition). This ends up having more than 25kbits of ROM memory in the core, which is pretty big. I know there are dual port memories with dual read port capability, but this will 'only' halve resources needed. Any better ideas?

Michal

Reply to
mikel
Loading thread data ...

You have 20 different addresses for the 20 replications, correct? Which FPGA family are you using?

Reply to
John_H

John,

Actually, the LUT/ROM is replicated twice as much as I said before (18 times 1st stage, 22 times 2nd stage). Synthesis tool was smart enough to reduce size of ROMs memory bits from 35840 bits to 25600 bits (there are few identical values inside every ROM, synthesis tool placed additional decoding logic for input address to reduce memory size). But this is still too much.

yes, I have different address for every ROM access, and I need to access all ROMs at the same clock cycle for performance.

I want design to be generic, though I ordered Virtex 2Pro board from Digilent so this will be my target.

Michal K

Reply to
mikel

If you have 40 different 6-bit addresses for 40 different 64x14 ROMs, I don't see how you can do better than 40instances*4LUTs*14bits = 2240 LUTs (or 280 CLBs in your current architecture). Implementing each ROM with fewer than 4 LUTs per bit would be possible for some 6-in-1 out functions.

Each ALM in the Stratix-II series (roughly equivalent but twice the LUT size as a Xilinx slice) can provide a 64x1 ROM.

You could use a BlockRAM to provide 2 ports of 14 bits each (up to 36 bits available) to displace 56 LUTs each. The 4.5 kbit Altera M4K blocks would be more "efficent" since only 64 entries are needed in your application and there are typically many more M4K blocks than BlockRAMs in equivalent A vs X devices.

It's quite possible you could time-multiplex your 14-bit lookups at 2x, 3x, even 4x your main design speed since the ROM lookup time as implemented in distributed CLB SelectRAM is one LUT plus MUXF5 plus MUXF6, roughly less than 2 levels of logic in a pipelined implementation.

The bottom line is that you have to pull out 40 unique 14-bit values. If there is no convenient way to reduce the uniqueness, the replication has to be there.

What does help is that each LUT or LE can give you 16 bits of ROM. Each ALM can give you 64 bits of ROM.

Reply to
John_H

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.