#### Do you have a question? Post it now! No Registration Necessary

**posted on**

May 3, 2005, 6:04 pm

I have been tasked with trying to implement a FFT algorithm in a

FPGA/DSP architecture. The algorithm would be a N point FFT with 1000

frequency bins. Each frequency bin would require a multiply, by the

constant e^jx, and then accumulate every 1 microsecond. This turns out

to be 1000 multiply accumulates happening in parallel every 1

microsecond. Does anyone have experience doing something similar in an

FPGA/DSP and can they point me in the right direction as far as

choosing a FPGA/DSP development board? Any help would be appreciated.

FPGA/DSP architecture. The algorithm would be a N point FFT with 1000

frequency bins. Each frequency bin would require a multiply, by the

constant e^jx, and then accumulate every 1 microsecond. This turns out

to be 1000 multiply accumulates happening in parallel every 1

microsecond. Does anyone have experience doing something similar in an

FPGA/DSP and can they point me in the right direction as far as

choosing a FPGA/DSP development board? Any help would be appreciated.

Re: Multiply Accumulate FPGA/DSP

1000 MAC in parallel ... that's a lot !

come on, just to store all the accumulators in parallel, with just like

48 bits accumulator , that would be 48000 regs ...

1 microsecod is 1000 ns so the way to go is to have like 20 units in

parallel and do the job every 20 ns which sounds a lot better. Then

use a block ram. Each block ram would have to "remember" 50 accumulator,

not a problem.

Sylvain

Re: Multiply Accumulate FPGA/DSP

Or ten MACs in parallel every ten nanoseconds; I'm imagining a little

circuit (two BRAMs, one multiplier) which reads the input, multiplies

it by a constant read from one block RAM, and adds it to an

accumulator in another, plus a sequencer over the block RAM locations,

the whole thing replicated ten times in an XC3S1000 (dev. boards are

$200 or so from www.xess.com).

Though you're using a complex multiplier, which is roughly four

integer multipliers, so you might have difficulty with ten-fold

replication in the 3S1000; and from what I've read here, running with

only five-fold replication, so a cycle time of 5ns, might require

quite elaborate design to get the speed sufficient; it might even be

too fast for the multipliers.

Have another circuit the other side which uses the other port on the

accumulator BRAMs to read out the accumulated data when the time comes.

This is a back-of-an-envelope design, I'd be really happy if someone

with actual FPGA experience could point out what's wrong with it.

Tom

Re: Multiply Accumulate FPGA/DSP

I just want to ask how will you enter your 1000 frequancy pins, how

many bits are you representing you frequancy points. I mean if you have

8 bits per point then you need 8000 pins which I think is to much for

any FPGA avaliable.

I think you mean 1000 analog inputs which also requars some form of ADC

, which also leads to the same problem.

May be you will enter them sequentaly which will take time to enter

them to the FPGA.

Best regards

many bits are you representing you frequancy points. I mean if you have

8 bits per point then you need 8000 pins which I think is to much for

any FPGA avaliable.

I think you mean 1000 analog inputs which also requars some form of ADC

, which also leads to the same problem.

May be you will enter them sequentaly which will take time to enter

them to the FPGA.

Best regards

Re: Multiply Accumulate FPGA/DSP

I need 1000 frequency "bins", where each bin is a descrete frequency.

As Thomas Womack pointed out above, it is beter defined as a N-point

DFT with 1000 frequency bins, where N = 1024. For each sample, every

microsecond, there is 24-bits of data lets call that x(n). During that

microsecond there must be 1000 MACS in parallel to calculate the N10%24

DFT. This would happen for 1024 samples to calculate the N-point DFT.

I hope that is a better description. Thanks for the input.

As Thomas Womack pointed out above, it is beter defined as a N-point

DFT with 1000 frequency bins, where N = 1024. For each sample, every

microsecond, there is 24-bits of data lets call that x(n). During that

microsecond there must be 1000 MACS in parallel to calculate the N10%24

DFT. This would happen for 1024 samples to calculate the N-point DFT.

I hope that is a better description. Thanks for the input.

Re: Multiply Accumulate FPGA/DSP

Bart,

consider time / frequency as a third dimension. You have a certain job

to do in a given time. Then look at the perforamnce of your multiplier,

registers, etc, and you find that they will work at multiple 100 MHz.

Then get creative and do certain things sequentially, and other things

in parallel. You have an enormous amount of creative freedom, and

pipelining is essentially free in an FPGA.

Remember, any circuit that does not work close to its speed limit

represents waste.

Peter Alfke

consider time / frequency as a third dimension. You have a certain job

to do in a given time. Then look at the perforamnce of your multiplier,

registers, etc, and you find that they will work at multiple 100 MHz.

Then get creative and do certain things sequentially, and other things

in parallel. You have an enormous amount of creative freedom, and

pipelining is essentially free in an FPGA.

Remember, any circuit that does not work close to its speed limit

represents waste.

Peter Alfke

Re: Multiply Accumulate FPGA/DSP

Hi Peter,

Well, I've seen a fair share of 15-25ns CPLD designs, filled 60% and running

at 4 or 8MHz. Sometimes applications can simply be slow. And developed,

debugged and programmed in under an hour and a half. And, especially

nowadays, without a smaller or slower part that is any cheaper.

But, that's good, isn't it? It would be horrible if the lower end of the

market couldn't take advantage of modern technology.

Best regards,

Ben

Well, I've seen a fair share of 15-25ns CPLD designs, filled 60% and running

at 4 or 8MHz. Sometimes applications can simply be slow. And developed,

debugged and programmed in under an hour and a half. And, especially

nowadays, without a smaller or slower part that is any cheaper.

But, that's good, isn't it? It would be horrible if the lower end of the

market couldn't take advantage of modern technology.

Best regards,

Ben

Re: Multiply Accumulate FPGA/DSP

Us in apps do tend to mostly see the corner cases - extreme speed, extreme

size, trying to shoehorn that last MHz out of the silicon while trying to

shoehorn a few hundred extra lines of code into the silicon etc. I'm

getting the feeling that we tend to see the exceptions, more than the rule.

In the last two years I have seen, with the introduction of Cyclone and

(slightly less so) Spartan 3, the performance bar at the lower end of the

spectrum has been raised considerably. The amount of performance and

capacity that is available for under $10 nowadays is just amazing compared

to three years ago.

It's a fun field we're working in.

Best regards,

Ben

Re: Multiply Accumulate FPGA/DSP

Peter, while this is true from a device utilization standpoint, there is

also development time, life cycle costs etc to consider. For someone

that is not well versed in the nuances, this sometimes significant cost

can weigh in favor of a larger design clocked at a relatively slow clock.

--

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

We've slightly trimmed the long signature. Click to see the full one.

Re: Multiply Accumulate FPGA/DSP

Designing close to the limit is a nice idea. But unless the part has

been completely and correctly characterized by the vendor, designing

too close to its speed limit can be fatal. Having been burnt by speed

files that changed for the worse after I'd completed a design, I now

try to keep a healthy margin between my design requirements and the

speed limit du jour.

Bob Perlman

Cambrian Design Works

Re: Multiply Accumulate FPGA/DSP

Maybe I misspoke. I meant to say that a cicuit that runs at a fraction

of its speed capabilty can be miade to do multiple jobs sequentially.

That obviously only applies when the designer runs the circuitry at

half or quarter speed or less. Only then can you seriously think about

time-sharing or time multiplexing.

it's good to have friends who watch over me :-)

Peter Alfke

of its speed capabilty can be miade to do multiple jobs sequentially.

That obviously only applies when the designer runs the circuitry at

half or quarter speed or less. Only then can you seriously think about

time-sharing or time multiplexing.

it's good to have friends who watch over me :-)

Peter Alfke

Re: Multiply Accumulate FPGA/DSP

Bart, as others have pointed out, it sounds like you are doing a brute

force DFT. The FFT reduces the computations by exploiting symmetry

present in the evenly spaced bins. Most FFTs are done with a variation

of the Cooley-Tukey algorithm which factors DFTs with a power of 2

number of points by successively breaking the DFT into half sized DFTs

and combining the results with a phase rotation. Your post seems to

indicate that you are looking instead for a 1000 point transform. You

can either use a 1024 point FFT by padding the input data to fill out

the size and accepting the slightly smaller bin size, or if you need the

1000 point DFT, you can use some of the other FFT algorithms to arrive

at a 1000 point transform. Either way, you'll greatly reduce the number

of multiplies by using a Fast Fourier Transform instead of the DFT. The

Smith and Smith book (

(Amazon.com product link shortened)

) provides a pretty good coverage of the various FFT algorithms that

you'd need for either approach. It is presented more from a software

perspective than from hardware, but nevertheless it provides a

comprehensive background to permit you to build a hardware

implementation that is far more efficient than what you are proposing.

The other point I should make is that you can use a process clock that

is faster than your sample clock, which I think you said is only 1 MHz.

Our FFT cores will run at over 300 MS/sec in current FPGA devices, and

they don't use anywhere near the 1000 multiplies you are looking at.

--

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

We've slightly trimmed the long signature. Click to see the full one.

#### Site Timeline

- » VHDL help with adding modules
- — Next thread in » Field-Programmable Gate Arrays

- » Negative hold time from Quartus
- — Previous thread in » Field-Programmable Gate Arrays

- » No more gate-level simulation. for Cyclone V !!!
- — Newest thread in » Field-Programmable Gate Arrays

- » Funny Pandemia behaviors and observations
- — The site's Newest Thread. Posted in » Electronics Design

- » OT: Beste NG fuer das Thema "Deutsche Kueche"?
- — The site's Last Updated Thread. Posted in » Electronics (German)