ROM resource sharing

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View

How to implement on-chip ROM memory resource sharing in FPGA? I
implemented discrete cosine transform core using parallel distributed
arithmetic approach, in which hardware multipliers are substituted by
precomputed MAC results stored in LUT/ROM. Single ROM instance is 64x14
bits. Problem is that the ROM must be replicated many times to enable
high throughput (replicated 9 times for first DCT stage and replicated
11 times for 2nd stage after transposition). This ends up having more
than 25kbits of ROM memory in the core, which is pretty big. I know
there are dual port memories with dual read port capability, but this
will 'only' halve resources needed. Any better ideas?


Re: ROM resource sharing
You have 20 different addresses for the 20 replications, correct?
Which FPGA family are you using?

Quoted text here. Click to load it

Re: ROM resource sharing

Actually, the LUT/ROM is replicated twice as much as I said before (18
times 1st stage, 22 times 2nd stage). Synthesis tool was smart enough
to reduce size of ROMs memory bits from 35840 bits to 25600 bits (there
are few identical values inside every ROM, synthesis tool placed
additional decoding logic for input address to reduce memory size). But
this is still too much.

Quoted text here. Click to load it

yes, I have different address for every ROM access, and I need to
access all ROMs at the same clock cycle for performance.

Quoted text here. Click to load it

I want design to be generic, though I ordered Virtex 2Pro board from
Digilent so this will be my target.

Michal K

Re: ROM resource sharing
Quoted text here. Click to load it

If you have 40 different 6-bit addresses for 40 different 64x14 ROMs, I
don't see how you can do better than 40instances*4LUTs*14bits = 2240 LUTs
(or 280 CLBs in your current architecture).  Implementing each ROM with
fewer than 4 LUTs per bit would be possible for some 6-in-1 out functions.

Each ALM in the Stratix-II series (roughly equivalent but twice the LUT size
as a Xilinx slice) can provide a 64x1 ROM.

You could use a BlockRAM to provide 2 ports of 14 bits each (up to 36 bits
available) to displace 56 LUTs each.  The 4.5 kbit Altera M4K blocks would
be more "efficent" since only 64 entries are needed in your application and
there are typically many more M4K blocks than BlockRAMs in equivalent A vs X

It's quite possible you could time-multiplex your 14-bit lookups at 2x, 3x,
even 4x your main design speed since the ROM lookup time as implemented in
distributed CLB SelectRAM is one LUT plus MUXF5 plus MUXF6, roughly less
than 2 levels of logic in a pipelined implementation.

The bottom line is that you have to pull out 40 unique 14-bit values.  If
there is no convenient way to reduce the uniqueness, the replication has to
be there.

What does help is that each LUT or LE can give you 16 bits of ROM.  Each ALM
can give you 64 bits of ROM.

Site Timeline