Multiply Accumulate FPGA/DSP

I have been tasked with trying to implement a FFT algorithm in a FPGA/DSP architecture. The algorithm would be a N point FFT with 1000 frequency bins. Each frequency bin would require a multiply, by the constant e^jx, and then accumulate every 1 microsecond. This turns out to be 1000 multiply accumulates happening in parallel every 1 microsecond. Does anyone have experience doing something similar in an FPGA/DSP and can they point me in the right direction as far as choosing a FPGA/DSP development board? Any help would be appreciated.

Reply to
bart
Loading thread data ...

1000 MAC in parallel ... that's a lot ! come on, just to store all the accumulators in parallel, with just like 48 bits accumulator , that would be 48000 regs ...

1 microsecod is 1000 ns so the way to go is to have like 20 units in parallel and do the job every 20 ns which sounds a lot better. Then use a block ram. Each block ram would have to "remember" 50 accumulator, not a problem.

Sylvain

Reply to
Sylvain Munaut

Or ten MACs in parallel every ten nanoseconds; I'm imagining a little circuit (two BRAMs, one multiplier) which reads the input, multiplies it by a constant read from one block RAM, and adds it to an accumulator in another, plus a sequencer over the block RAM locations, the whole thing replicated ten times in an XC3S1000 (dev. boards are $200 or so from

formatting link

Though you're using a complex multiplier, which is roughly four integer multipliers, so you might have difficulty with ten-fold replication in the 3S1000; and from what I've read here, running with only five-fold replication, so a cycle time of 5ns, might require quite elaborate design to get the speed sufficient; it might even be too fast for the multipliers.

Have another circuit the other side which uses the other port on the accumulator BRAMs to read out the accumulated data when the time comes.

This is a back-of-an-envelope design, I'd be really happy if someone with actual FPGA experience could point out what's wrong with it.

Tom

Reply to
Thomas Womack

I'd not call that an FFT; I'd call it a calculation of a thousand points of a DFT. It may well be possible to do it with less than one complex gigaMACs, by using an FFT, but regrettably I'm not awake enough to remember how to do that filter transformation.

Tom

Reply to
Thomas Womack

Reply to
Symon

I just want to ask how will you enter your 1000 frequancy pins, how many bits are you representing you frequancy points. I mean if you have

8 bits per point then you need 8000 pins which I think is to much for any FPGA avaliable.

I think you mean 1000 analog inputs which also requars some form of ADC , which also leads to the same problem.

May be you will enter them sequentaly which will take time to enter them to the FPGA.

Best regards

Reply to
ahosyney

bin, not pin.

--
 [100~Plax]sb16i0A2172656B63616820636420726568746F6E61207473754A[dZ1!=b]salax
Reply to
Tobias Weingartner

I need 1000 frequency "bins", where each bin is a descrete frequency. As Thomas Womack pointed out above, it is beter defined as a N-point DFT with 1000 frequency bins, where N = 1024. For each sample, every microsecond, there is 24-bits of data lets call that x(n). During that microsecond there must be 1000 MACS in parallel to calculate the N=1024 DFT. This would happen for 1024 samples to calculate the N-point DFT. I hope that is a better description. Thanks for the input.

Reply to
bart

Bart, consider time / frequency as a third dimension. You have a certain job to do in a given time. Then look at the perforamnce of your multiplier, registers, etc, and you find that they will work at multiple 100 MHz. Then get creative and do certain things sequentially, and other things in parallel. You have an enormous amount of creative freedom, and pipelining is essentially free in an FPGA. Remember, any circuit that does not work close to its speed limit represents waste. Peter Alfke

Reply to
Peter Alfke

Hi Peter,

Well, I've seen a fair share of 15-25ns CPLD designs, filled 60% and running at 4 or 8MHz. Sometimes applications can simply be slow. And developed, debugged and programmed in under an hour and a half. And, especially nowadays, without a smaller or slower part that is any cheaper.

But, that's good, isn't it? It would be horrible if the lower end of the market couldn't take advantage of modern technology.

Best regards,

Ben

Reply to
Ben Twijnstra

Ben, that's the problem with glaring generalizations: there always are exceptions. Peter Alfke

Reply to
Peter Alfke

Hi Peter,

Us in apps do tend to mostly see the corner cases - extreme speed, extreme size, trying to shoehorn that last MHz out of the silicon while trying to shoehorn a few hundred extra lines of code into the silicon etc. I'm getting the feeling that we tend to see the exceptions, more than the rule.

In the last two years I have seen, with the introduction of Cyclone and (slightly less so) Spartan 3, the performance bar at the lower end of the spectrum has been raised considerably. The amount of performance and capacity that is available for under $10 nowadays is just amazing compared to three years ago.

It's a fun field we're working in.

Best regards,

Ben

Reply to
Ben Twijnstra

Peter, while this is true from a device utilization standpoint, there is also development time, life cycle costs etc to consider. For someone that is not well versed in the nuances, this sometimes significant cost can weigh in favor of a larger design clocked at a relatively slow clock.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

Designing close to the limit is a nice idea. But unless the part has been completely and correctly characterized by the vendor, designing too close to its speed limit can be fatal. Having been burnt by speed files that changed for the worse after I'd completed a design, I now try to keep a healthy margin between my design requirements and the speed limit du jour.

Bob Perlman Cambrian Design Works

Reply to
Bob Perlman

Maybe I misspoke. I meant to say that a cicuit that runs at a fraction of its speed capabilty can be miade to do multiple jobs sequentially. That obviously only applies when the designer runs the circuitry at half or quarter speed or less. Only then can you seriously think about time-sharing or time multiplexing.

it's good to have friends who watch over me :-) Peter Alfke

Reply to
Peter Alfke

Bart, as others have pointed out, it sounds like you are doing a brute force DFT. The FFT reduces the computations by exploiting symmetry present in the evenly spaced bins. Most FFTs are done with a variation of the Cooley-Tukey algorithm which factors DFTs with a power of 2 number of points by successively breaking the DFT into half sized DFTs and combining the results with a phase rotation. Your post seems to indicate that you are looking instead for a 1000 point transform. You can either use a 1024 point FFT by padding the input data to fill out the size and accepting the slightly smaller bin size, or if you need the

1000 point DFT, you can use some of the other FFT algorithms to arrive at a 1000 point transform. Either way, you'll greatly reduce the number of multiplies by using a Fast Fourier Transform instead of the DFT. The Smith and Smith book (
formatting link
) provides a pretty good coverage of the various FFT algorithms that you'd need for either approach. It is presented more from a software perspective than from hardware, but nevertheless it provides a comprehensive background to permit you to build a hardware implementation that is far more efficient than what you are proposing.

The other point I should make is that you can use a process clock that is faster than your sample clock, which I think you said is only 1 MHz. Our FFT cores will run at over 300 MS/sec in current FPGA devices, and they don't use anywhere near the 1000 multiplies you are looking at.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.