FFT Speeds

- R
- Rick C
  
  Contact options for registered users
posted
4 years ago

Mon, Mar 30, 2020 7:11 PM

I'm not building anything, I'm just curious. I worked on a machine in the

80's called an attached array processor capable of 100 MFLOPS, the ST-100. It could do a 1K complex FFT in 0.86 msec. What sort of device would be r equired to do this today? A desktop would far exceed this performance. Lo w end single chip processors like an ARM CM3 would probably not reach this performance level. But I bet the processor in an rPi would exceed it.

Can anyone nail this down a bit more? Are there ARM CM4 processors capable of a sub ms 1K complex FFT in 32 bit floating point? Looking at the Cypre ss PSoC 61 is says "with single-cycle multiply (Floating Point and Memory P rotection Unit)" but I think those are three separate things, single cycle integer multiply, floating point and MPU.

Anyone know how fast a CPU like this can crank a 1k FFT?

Of course the rack cabinet machine I'm talking about was fully capable of m oving data around to keep up with the FFT so the computations were the bott le neck. I expect in many applications in MCUs the FFT time is not the bot tle neck, at least not at the 1K size, rather getting the data and moving i t around is nearly as slow and the rest of the application is running on th e same CPU. The ST-100 was the equivalent of a floating point coprocessor so it only handled the math.

--

  Rick C. 

  - Get 1,000 miles of free Supercharging 
  - Tesla referral code - https://ts.la/richard11209

- D
- dalai lamah
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Tue, Mar 31, 2020 2:14 PM

A beefed-up microcontroller should be enough:

formatting link

C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles, i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM and very tight.

Some models are also dual core.

--
Fletto i muscoli e sono nel vuoto.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 1:07 AM

All other things being equal, doubling the length of an FFT adds one complex multiply per sample. So performance is O(log2(N))

Clifford Heath

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 1:41 AM

I think you mean N log N. Usually we think of complex multiplication as

4 real multiplications plus 2 additions. I wonder if (on machines with 1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they have some interesting optimizations, parallelism, etc.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 2:40 AM

Ahh yes, sorry. Mis-translated by omitting "per sample" :)

Memory/cache behaviour is often the limiting factor. Bailey's approach to ordering the computation can radically assist though:

Ultimately I want an RPi-like module with a GPU that supports CUDA... for a lower price than the Jetson Nano :). It's a shame to have to jump to an FPGA or funky DSP chip when a GPU is perfect for the job.

Clifford Heath

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 3:53 AM

I assume CUDA is a development language for GPUs? The rPi GPU has a not fu lly documented GPU so it may or may not support CUDA. But an FPGA is easy enough to find on a daughtercard for the rPi. Or are you saying you just d on't want to mess with an FPGA at all and would rather use software on a GP U?

Personally I much prefer coding FPGAs in HDL than mucking about with the co mplexities of trying to get a sequentially executing CPU to appear to be pr ocessing in parallel. A GPU might be a bit fun though. I've worked with T ransputers and programmed an attached array processor (4 ALUs in parallel). I expect a GPU would be a step up from either of those.

Too bad the rPi GPU is not accessible. That could make for an interesting target.

BTW, thanks to those who pointed out FFT benchmarks. Looks like a single c hip CPU does come pretty close to two rack cabinets from 30+ years ago. It doesn't warm the room as well though.

--

  Rick C. 

  + Get 1,000 miles of free Supercharging 
  + Tesla referral code - https://ts.la/richard11209

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 7:45 AM

Yes.

Correct. Same for ODroid.

Yes, that. I want the array processor already defined for me, I just want to tell it what to do.

I would too, if I had already learnt to do it.

Yes, exactly.

In 1986 I was at HP, and we took delivery of one of their 3D Turbo SRX graphics rendering boxes... with hardware radiosity for rendering surfaces, etc... minimally configured, as we only needed it for configuration testing. But it was my desktop computer... That thing would still dim the lights when you turned it on!

The performance was dwarfed by single-chip PC graphics cards in less than ten years.

Clifford Heath

- C
- Clive Arthur
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 4:32 PM

Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT, could one use multiple shift and add multipliers instead of one fast hardware multiplier? Roughly how many shift-add multipliers could you fit in the same space as a fast multiplier?

I use a 128 point FFT for a comms system, and a dsPIC handles this quickly enough with it's two hardware multipliers and ancillary logic, but I've wondered if an FPGA with lots of parallel shift-add multipliers might be quicker.

In my application, I can't use FPGAs with inbuilt processors or multipliers.

--
Cheers 
Clive

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 5:43 PM

CUDA is basically Nvidia-specific. You're more likely to find OpenCL on those sorts of boards. E.g. web search on "opencl odroid" gets some hits, though it's not clear it uses the odroid's gpu.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 7:20 PM

e complexities of trying to get a sequentially executing CPU to appear to b e processing in parallel.

rs.

Ok, sometimes I don't completely follow what people have posted. By shift and add multiplier I assume you mean a single adder that is used sequential ly to generate a multiply in N clocks?

I can't think why you would use such a multiplier if you need many of them to do the job. I'm not an expert on multiplier design, but I believe a mod ified Booth's algorithm is not only faster but more compact. The logic is a bit involved but someone explained to me how it uses fewer computations, but I completely forget the details.

It's almost certain that multiple multipliers of any type could be faster.. . for some situations. A 128 point (complex/real?, xx bit?, fixed/float) F FT is not very large, so not a huge amount of computations. If you split t he work you need to consider the time to offload the data and return the re sults. A friend many years ago told me about evaluating an attached array processor on his PDP 11/70 in grad school. The calcs were fast, but the I/ O speed was slow enough it was faster on the CPU.

I expect this is real time work so you might want to stream the data into t he FPGA and do more processing there. I would be looking at the problem as by default work being done on the FPGA and offloading to a CPU the parts t hat are harder to do in an FPGA like network connectivity, etc. Many of my designs have no CPU.

Take a look at the iCE5LP series from Lattice. 2 or 4 DSP blocks with 1100 , 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0.5 m m pitch, 7 x 7 mm, 48 pin QFN. A nice part for many small designs.

Can you explain why you can't use the built in multipliers? I get not want ing to use a CPU, but a multiplier is a multiplier whether you construct it or it's already in the FPGA.

--

  Rick C. 

  -- Get 1,000 miles of free Supercharging 
  -- Tesla referral code - https://ts.la/richard11209

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Wed, Apr 1, 2020 9:27 PM

But doubles the number of samples, so it's N log2(N).

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC / Hobbs ElectroOptics 
Optics, Electro-optics, Photonics, Analog Electronics 
Briarcliff Manor NY 10510 

http://electrooptical.net 
http://hobbs-eo.com

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 2:26 AM

Many thanks for the pointer. It looks like the ODroid-XU3 and -XU4 (which have the Mali T628MP6 GPU) already have OpenCL support.

I might need to reassign my ODroid-C2 to a different task and get an XU4.

Clifford Heath.

- C
- Clive Arthur
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 8:40 AM

That's because it's not my field and I don't know the correct form of words.

Yes.

The last time I used a hardware multiplier outside a CPU it was made with TTL chips and wire wrapped. A Wallace tree IIRC taken from a big orange Texas databook. My experience is somewhat limited.

There is a possible future requirement to run at high temperatures and in a very high vibration space-constrained environment. This limits the size of the FPGA and the package type in practice to say something with

20k gates - the actual part numbers known to work is confidential, but they won't be the latest greatest.

It had occurred to me that, as the FFT needs lots of complex multiplications, these could be done using lots of smaller simple shift-add multipliers instead of one large 'normal' multiplier - and for all I know, that's the way an expert would do it. But only if the speed/size trade off worked.

--
Cheers 
Clive

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 9:00 AM

the complexities of trying to get a sequentially executing CPU to appear to be processing in parallel.

T,

rs

liers.

ds.

No, I think your terminology is good, I just don't recall exactly what thin gs mean from time to time. I've reached a point where I often know a topic (well, the parts I remember) but not the terminology.

ed sequentially to generate a multiply in N clocks?

hem to do the job. I'm not an expert on multiplier design, but I believe a modified Booth's algorithm is not only faster but more compact. The logic is a bit involved but someone explained to me how it uses fewer computatio ns, but I completely forget the details.

I've heard of Wallace trees, but never used them. I think they are more us eful with the sort of function blocks you have in TTL/ECL chips. When desi gning with gates I believe the modified Booth's algorithm is faster, but I' m not sure. It certainly fits in an FPGA better because they all have fast adder chains in the fabric logic so the Wallace tree logic would be less e ffective.

er... for some situations. A 128 point (complex/real?, xx bit?, fixed/floa t) FFT is not very large, so not a huge amount of computations. If you spl it the work you need to consider the time to offload the data and return th e results. A friend many years ago told me about evaluating an attached ar ray processor on his PDP 11/70 in grad school. The calcs were fast, but th e I/O speed was slow enough it was faster on the CPU.

to the FPGA and do more processing there. I would be looking at the proble m as by default work being done on the FPGA and offloading to a CPU the par ts that are harder to do in an FPGA like network connectivity, etc. Many o f my designs have no CPU.

1100, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0 .5 mm pitch, 7 x 7 mm, 48 pin QFN. A nice part for many small designs.

wanting to use a CPU, but a multiplier is a multiplier whether you construc t it or it's already in the FPGA.

Ok, that makes sense. The qualified part doesn't have multiplier blocks.

A shift-add multiplier is one layer of a pipelined multiplier. So have N s hift-add multipliers of M bits or N/M Booth's multipliers. The amount of l ogic will be about the same other than whatever optimization factor the Boo th's will offer. I'm not sure it is less logic, but faster for sure. Anyw ay, you should be able to put a lot of them into 20 kLUTs. No, wait, you s aid "gates".

I think that tells me which FPGA familly you are working with. No one uses "gates" anymore and these guys do a lot of "qualified" business. That exp lains the lack of multipliers. 20 kgates isn't really much logic either. I'm guessing around 3,000 logic elements and these can be either logic or F F but not both?

I wish I could remember more, but these days you can put A

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 9:50 AM

Implementing e.g. a 32 x 32 Wallace tree multiplier with MSI chips is nasty, due to the huge number of interconnections. The product consisting of two input AND gates produces 1024 lines. Then full adders (preferably with carry look ahead) are required to count the number of "1" bits. This requires multiple stages with the adder propagation delays in each stage.

This is not a problem in e.g. in FFT calculations, since the multiplies can be pipelined, producing one product every adder propagation delay time but the total time taken by a single multiply can be 10 or more adder propagation times,

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 12:02 PM

Going back to this basic question, If I presume you are asking not about the fairly limited number hard core multipliers that many FPGAs have included, but about implementation in the FPGA fabric, and are comparing one design which has a full tree implemented in parrallel, and can run at a throughput of 1 multiply per cycle (it may take several cycle to get a new result, but a new multiply can be done every cycle) verse an implementation that uses reduced hardware but serially iterates the answer in N cycles, my first impression is that the additional control mechanisms to iterate probably take more logic than the fully parallel solution and thus costs space, and thus has a lower throughput for a given number of logic cells dedicated to multipliers. Someone would really need to design this to be sure. Maybe the fact that the sequencer could drive many serial multipliers might make a difference.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 6:16 PM

,

you

s

I have designed shift/add multipliers and the control logic is not much. T here is a log2(N) bit counter and minimal other logic. The shift/add multi plier only works with unsigned multipliers (or is it the multiplicand?) so there needs to be a 2's complement before and after the multiplier. I'm pr etty sure the Booth's algorithm handles sign automagically, but I won't swe ar to it. I did this over a decade ago. I'm pretty sure if you are doing a bunch of multipliers the shift/add uses more real estate for the same thr oughput, but I don't recall the details of modified Booth's so I can't say for sure.

--

  Rick C. 

  +- Get 1,000 miles of free Supercharging 
  +- Tesla referral code - https://ts.la/richard11209

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 6:33 PM

Booth's algorithm does handle signed operands in two's complement notation.

--

-TV

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Apr 2, 2020 7:39 PM

n.

Your mention of two's complement as if there were other choices these days made me recall that in the Forth language group there is a discussion of st andardizing on 2's complement... finally.

Seems someone looked and there are literally no remaining hardware computer s running anything other than 2's complement. There is an architecture run ning 1's complement, but it is no longer a hardware implementation but only simulated on other machines, lol. There must be some commercial niche for software written for that machine.

So no real need to specify 2's complement these days although I suppose in an FPGA there could be anything. Some ADCs are still signed magnitude, so that conceivably could be carried through the chip.

--

  Rick C. 

  ++ Get 1,000 miles of free Supercharging 
  ++ Tesla referral code - https://ts.la/richard11209

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Fri, Apr 3, 2020 3:33 AM

IEEE 754 floating point, relevant to this discussion, is sign-magnitude.