How do I meet this memory IO with least resources on FPGA?

- G
- G Iveco
  
  Contact options for registered users
posted
16 years ago

Sat, Nov 3, 2007 4:44 PM

Hi, there

My design needs a 16X16 matrix, each of 32-bit. The matrix must be read row by row or column by column each in one clock..

Direct register implementation takes a lot of resources and routing can be difficult. Will 256 pieces of 32-bit RAM with single address work in Xilinx?

TIA!

- K
- KJ
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Nov 3, 2007 5:58 PM

Yes. You write the code to implement whatever logic function you require.

Start wirting some code and simulating it until it's functionally working the way you want it to be. As a BACKGROUND task start running your code through the synthesis process and look to see how what you've written is being implemented and what sort of clock cycle performance you can expect. If it's not to your liking then start perusing for other more elegant ways of implementing your logic but don't get so focused on the synthesis task that you forget to handle the primary task which is to get functionally correct code.

KJ

- G
- G Iveco
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Nov 3, 2007 7:08 PM

Thank you KJ. I understand in large designs simulation and synthesis had better go concurrently to make sure the design passes both steps nicely.

But my question is, for memory based systems, For very large memory, register implementation takes N times silicon than RAM.

For a small memory, RAM have overheads like RW, sensing, amplifier, etc which may be equivalent to a few hundred registers in terms of silicon and power.

as a result, in the 2nd case, how much is this RAM overhead comparing to a

32-bit register in Xilinx?

If there are good comparisons, then I can skip the trouble of testing..

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Nov 3, 2007 7:33 PM

I'd suggest using the LUT rams as much as possible. Look in the datasheet. AFAIK Xilinx is one of the few FPGA vendors that has RAM in the logic slices. If you use these lut rams in a smart way, you can cram many times more logic in a device with lut ram than in a device without lut ram.

--
Reply to nico@nctdevpuntnl (punt=.)
Bedrijven en winkels vindt U op www.adresboekje.nl

- K
- KJ
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Nov 3, 2007 8:25 PM

But it really doesn't matter. When you follow the proper template, your code can be synthesized to use internal RAM or LUTs. That decision will be made by the synthesis tool. So look up the form of VHDL that will infer memory, write your code in that fashion, avoid use of wizards and such, and your code will synthesize to fit into the resources that are on the chip. It makes no difference whether the memory gets implemented in logic cells or memory arrays as long as it

- implements the intended function

- meets the performance requirements

- Fits in the targetted device.

Testing which 'method' is better is pointless. Write code that can be inferred properly to the targetted part and leave the rest for the tools to implement.

KJ

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Nov 4, 2007 3:32 AM

Here is my best-case estimate: You obviously need 16 x 32 = 512 parallel outputs In Virtex-5 each LUT can be used with 5 address bits and 2 outputs ( 32 x 2 RAM) That means you need 256 LUTs = 32 CLBs. And nothing else. This optimized packing requires that the software is smart enough to configure the LUTs appropriately. Worst-case, that is not yet the case, and you need 64 CLBs total. And nothing else. Even the small 'LX50 has 3600 CLBs total, (but not all of them can be used as memory). Just so you can have an educated guess. Let the software do the crunching... Peter Alfke

- G
- G Iveco
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Nov 4, 2007 5:03 AM

Thank you nico and Alfke.

I tried using registers coded by RTL and estimate the hardware requirement and found my gigantic math module can fit in a Virtex 2, V3000.. It's old technology though.

In Virtex 4, will XC4VLX40 be able to handle this? The documentation of two specs are different, only Slice count can be used as references.

IOs are no issue here.

- E
- evilkidder
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Nov 4, 2007 9:52 AM

It may be possible to implement this structure using RAM's, and thus far fewer resources, but that will depend on a couple things.

First, does each 16x16 matrix need to be accessed using both row and colomn addressing? Second, how is data written into the matrix - do entire rows/cols need to be written in a single cycle as well?

The first requirement is easy to deal with. The second one, in conjunction with the first, makes life difficult.

If you can handle writing data in one element at a time then you can construct a pair of ram structures one of which handles column addressing while the other handles row addressing. Both get written with the same data.

Thanks, Andy.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Nov 4, 2007 4:46 PM

Easy. In Virtex-4, you again need 512 outputs, each driven by a 32-bit RAM. In Virtex-4, each 32 x 1 RAM consists of two LUTs plus a free multiplexer. Call out RAM32x1S as shown in the "CLB Overview" fig 5-6 on page 219 of the Virtex-4 Handbook. That consumes 1024 LUTs, or 128 CLBs. The LX40 has 128 x 36 CLBs, which is 36 times more than you need. Peter Alfke, Xilinx Applications ===============================

- A
- Alvin Andries
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Nov 5, 2007 12:53 AM