fpga space estimate

- R
- Roger Bourne
  
  Contact options for registered users
posted
18 years ago

Thu, Apr 20, 2006 6:35 PM

Hello all,

I would like some feeback :

I am planning to make a design in FPGA that has 4 2nd-order cascaded IIR filters. Now the question/feedback/advice which I am seeking is the following:

To what resolution can I have the input and output databuses of the IIRs ? Assume there is nothing else but the IIRs in the FPGA

P.S the FPGA is spartan 3 (400k gates)

I made a rough estimate : I would be needing ~800-1000FFs (there is atotal of 8k) ~14 16-bit adders (do not know the total) ~8 18x18 dedicated multipliers (there is a total of 16) and a whole bunch of muxes. I estimate about ~2000 4:1 muxes/demuxes

The above bunch of logic is for

4 2nd order IIRs 16 bit input databus for each IIR 16 bit output databus for each IIR 64 bit feedfwd & feedbck coeeficients for each IIR An input DC gain of 2^12 for each IIR One, and only one, 96 bit adder responsible for all the sums One, and only one, 27x64 bit multiplier responsible for all the multiplication The adder and the multipler will function at a much higher frequency than the sample rate, hence permitting them to do all the operations for all the IIRs, Sample rate is 1MHz. I am assuming that the sample rate can be multiplied up by a factor of at least of 50. 50 would give at LEAST 1cycles/operation. There are 20 sums and 20 multiplication to be done per sample period.

Hence, I arrived to the conclusion that such a digital filter design will take me ~25% of the space of the FPGA. Does this sound accurate ? However I do not know how to account for routing overhead.

I would appreciate previous projects citiings and how much % of the FPGA they occupied.

Thx in advance

-Roger

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Apr 20, 2006 7:27 PM

You could reduce your resource requirements significantly by implementing a multi-channel, multi-stage mechanism that manipulates your data and coefficients through one BlockRAM - eliminating most of the multiplexers - and pipelines some of the operations such as the multiply to use fewer resources overall.

For these kinds of things, a little pseudocode and a spreadsheet can help to visualize how to break up the problem and verify the soultion.

Are you looking specifically for a tiny solution?

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Apr 20, 2006 7:36 PM

Thats seems reasonable in terms of HW resources but I would throw in a guard of atleast another 50% till you have done an actual synthesis with P/R. For most data paths even hand placed, I usually see 1/3 of the resources can't be used, conflicts of placement etc. . So fo N known flops used, add atleast another 20% which can't be used. For your really wide 96bit adders and 64bit mult you want to pipeline those and that adds many flops. YMMV

John Jakson transputer guy

- R
- Roger Bourne
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 3:46 PM

I am looking for a solution that fits in the FPGA. Tiny?, not really, as long as every thing fits.

BlockRAM. Great idea! I checked the timing specs of the blockram module, and it seems pretty fast.1clock cycle to write and 1 clock cycle for read. max freq of ~160MHz. No need for a complex multiplexing network. In fact, there is no need for delay elements (FFs)alltogether!.

However, I never used RAM on an FPGA (that is the reason I did not initially lean towards that solution). Is there some obvious, flagrant , blatant drawback when using RAM , instead of FFs ? Especially since there is 36 times more RAM bits than available FFs (288K vs 8K). And in RAM, ALL the bits can be used!

According to the timing waveform in the specs, it only requires 1 cycle for read and 1 cycle for write --so I do not think loss of cycles between data transters will be an issue, especially if the data rate is ~150 times slower than the fastest clock available. The module that performs the multiplication can thus be time-multiplexed.

It is sounds like it is working on a DSP, rather than a FPGA, if one foregoes the use of FFs...:-)

-Roger

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 4:01 PM

Roger,

Depending on speed, using an FPGA can be exactly like using a DSP.

The use of the BRAM basically means you are building a custom DSP machine, which will 'execute' a fixed program (based on a FSM), manipulating the BRAM contents much like a DSP would bring operands in and out of the ALU, to and from memory.

Personally, if something this slow would work, to make your life easier, you might consider Microblaze as the processor, and execute both program and data from BRAMs. That way the program (which may already be in c code) could remain in c code.

Or, alternatively, use a "real" DSP processor, as (let's be honest) the FPGA may be extreme overkill for what you may be doing.

If the speed there is just not fast enough, there may be hardened FFT filter structures that are serial, rather than parallel, which still may be fast enough (faster than a DSP), and yet use fewer resources (than a full parallel one). The SRL16s are particularly good at this, as you have up to 16 FFs for the SLICEs with SRLs/LUTRAM.

Remember that a parallel multiply may not be needed, and a serial multiplier may be a lot less hungry (for resources, overall).

Many extreme audio applications (see NAB conference) use serial processing of many audio streams at once on a signle FPGA for a superb cost/performance point.

formatting link

Finally, if the problem can be partitioned in time into more than one piece, I have seen people calculate part 1, store results in an external SRAM, reconfigure, and then read in last part and calculate part 2, store results in external SRAM, etc...

Austin

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 4:19 PM

If you want a dedicated port to a controller to allow on-the-fly update of coefficient values, a dual-port RAM would implement the controller on one port and the data I/O on the other. If you have a fixed configuration, you can dedicate one port for read, one for write, and your data can flow at the full 320 MHz BlockRAM rate. Dual-ports are great.

Initializing BlockRAM contents always seems a little tough with the synthesis and simulation tools never quite making it practical to get everything flowing just right. If you look into the help or app notes from the various tools, you could have pre-initialized BlockRAMs for fixed coefficients to make life simpler.

For your application, this really *is* best implemented in a DSP mindset; you can keep your resources low (1 MAC) and maintain the values in a register file with limited I/O in your algorithm. Since you have 100x+ the sample rate to do your processing, the system flows beautifully. The only question for me would be how complex the state machine or microcode would need to be to have the system work beautifully without adding a generic processor like the MicroBlaze or similar. This is where prototyping with pseudo-code and an Excel spreadsheet get me to my results with a simple implementation.

For me, these kinds of tasks are great fun.

- R
- Roger Bourne
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 4:51 PM

pseudo-code ??? What exactly do you mean by peudo-code?

-Roger

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 6:15 PM

Just writing down what ssteps you'd take to implement the code in your data path. It's helpful to "see" the data pipeline by looking at the steps and the loops to manipulate the data.