FFT Speeds

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I'm not building anything, I'm just curious.  I worked on a machine in the  
80's called an attached array processor capable of 100 MFLOPS, the ST-100.  
 It could do a 1K complex FFT in 0.86 msec.  What sort of device would be r
equired to do this today?  A desktop would far exceed this performance.  Lo
w end single chip processors like an ARM CM3 would probably not reach this  
performance level.  But I bet the processor in an rPi would exceed it.  

Can anyone nail this down a bit more?  Are there ARM CM4 processors capable
 of a sub ms 1K complex FFT in 32 bit floating point?  Looking at the Cypre
ss PSoC 61 is says "with single-cycle multiply (Floating Point and Memory P
rotection Unit)" but I think those are three separate things, single cycle  
integer multiply, floating point and MPU.  

Anyone know how fast a CPU like this can crank a 1k FFT?  

Of course the rack cabinet machine I'm talking about was fully capable of m
oving data around to keep up with the FFT so the computations were the bott
le neck.  I expect in many applications in MCUs the FFT time is not the bot
tle neck, at least not at the 1K size, rather getting the data and moving i
t around is nearly as slow and the rest of the application is running on th
e same CPU.  The ST-100 was the equivalent of a floating point coprocessor  
so it only handled the math.  

--  

  Rick C.

  - Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds


Quoted text here. Click to load it

A beefed-up microcontroller should be enough:

http://www.ti.com/lit/an/spry288a/spry288a.pdf

C28x family runs up to 200 MHz and does 1k complex FFT in 53219 cycles,
i.e. 0.27 ms. Of course there are caveats, you must run everything in RAM
and very tight.

Some models are also dual core.

--  
Fletto i muscoli e sono nel vuoto.

Re: FFT Speeds
On 1/4/20 1:14 am, dalai lamah wrote:

Quoted text here. Click to load it

All other things being equal, doubling the length of an FFT adds one  
complex multiply per sample. So performance is O(log2(N))

Clifford Heath

Re: FFT Speeds
Quoted text here. Click to load it

I think you mean N log N.  Usually we think of complex multiplication as
4 real multiplications plus 2 additions.  I wonder if (on machines with
1 cycle MAC) that makes complex FFT 4x slower than real FFT, or if they
have some interesting optimizations, parallelism, etc.

Re: FFT Speeds
On 1/4/20 12:41 pm, Paul Rubin wrote:
Quoted text here. Click to load it

Ahh yes, sorry. Mis-translated by omitting "per sample" :)

Quoted text here. Click to load it

Memory/cache behaviour is often the limiting factor. Bailey's approach  
to ordering the computation can radically assist though:

<https://www.davidhbailey.com/dhbpapers/fftq.pdf

Ultimately I want an RPi-like module with a GPU  that supports CUDA...  
for a lower price than the Jetson Nano :). It's a shame to have to jump  
to an FPGA or funky DSP chip when a GPU is perfect for the job.

Clifford Heath

Re: FFT Speeds
On Tuesday, March 31, 2020 at 10:40:48 PM UTC-4, Clifford Heath wrote:
Quoted text here. Click to load it
  
Quoted text here. Click to load it

I assume CUDA is a development language for GPUs?  The rPi GPU has a not fu
lly documented GPU so it may or may not support CUDA.  But an FPGA is easy  
enough to find on a daughtercard for the rPi.  Or are you saying you just d
on't want to mess with an FPGA at all and would rather use software on a GP
U?  

Personally I much prefer coding FPGAs in HDL than mucking about with the co
mplexities of trying to get a sequentially executing CPU to appear to be pr
ocessing in parallel.  A GPU might be a bit fun though.  I've worked with T
ransputers and programmed an attached array processor (4 ALUs in parallel).
  I expect a GPU would be a step up from either of those.  

Too bad the rPi GPU is not accessible.  That could make for an interesting  
target.  

BTW, thanks to those who pointed out FFT benchmarks.  Looks like a single c
hip CPU does come pretty close to two rack cabinets from 30+ years ago.  It
 doesn't warm the room as well though.  

--  

  Rick C.

  + Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
On 1/4/20 2:53 pm, Rick C wrote:
Quoted text here. Click to load it

Yes.


Correct. Same for ODroid.

Quoted text here. Click to load it

Yes, that. I want the array processor already defined for me, I just  
want to tell it what to do.
Quoted text here. Click to load it

I would too, if I had already learnt to do it.

Quoted text here. Click to load it

Yes, exactly.


In 1986 I was at HP, and we took delivery of one of their 3D Turbo SRX  
graphics rendering boxes... with hardware radiosity for rendering  
surfaces, etc... minimally configured, as we only needed it for  
configuration testing. But it was my desktop computer... That thing  
would still dim the lights when you turned it on!

The performance was dwarfed by single-chip PC graphics cards in less  
than ten years.

Clifford Heath

Re: FFT Speeds
On 01/04/2020 04:53, Rick C wrote:

<snipped>
Quoted text here. Click to load it

Probably a dumb question, I'm no FPGA expert, but for a fixed-point FFT,  
could one use multiple shift and add multipliers instead of one fast  
hardware multiplier?  Roughly how many shift-add multipliers could you  
fit in the same space as a fast multiplier?

I use a 128 point FFT for a comms system, and a dsPIC handles this  
quickly enough with it's two hardware multipliers and ancillary logic,  
but I've wondered if an FPGA with lots of parallel shift-add multipliers  
might be quicker.

In my application, I can't use FPGAs with inbuilt processors or multipliers.

--  
Cheers
Clive

Re: FFT Speeds
On Wednesday, April 1, 2020 at 12:32:27 PM UTC-4, Clive Arthur wrote:
Quoted text here. Click to load it
e complexities of trying to get a sequentially executing CPU to appear to b
e processing in parallel.  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
rs.

Ok, sometimes I don't completely follow what people have posted.  By shift  
and add multiplier I assume you mean a single adder that is used sequential
ly to generate a multiply in N clocks?  

I can't think why you would use such a multiplier if you need many of them  
to do the job.  I'm not an expert on multiplier design, but I believe a mod
ified Booth's algorithm is not only faster but more compact.  The logic is  
a bit involved but someone explained to me how it uses fewer computations,  
but I completely forget the details.  

It's almost certain that multiple multipliers of any type could be faster..
. for some situations.  A 128 point (complex/real?, xx bit?, fixed/float) F
FT is not very large, so not a huge amount of computations.  If you split t
he work you need to consider the time to offload the data and return the re
sults.  A friend many years ago told me about evaluating an attached array  
processor on his PDP 11/70 in grad school.  The calcs were fast, but the I/
O speed was slow enough it was faster on the CPU.  

I expect this is real time work so you might want to stream the data into t
he FPGA and do more processing there.  I would be looking at the problem as
 by default work being done on the FPGA and offloading to a CPU the parts t
hat are harder to do in an FPGA like network connectivity, etc.  Many of my
 designs have no CPU.  

Take a look at the iCE5LP series from Lattice.  2 or 4 DSP blocks with 1100
, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0.5 m
m pitch, 7 x 7 mm, 48 pin QFN.  A nice part for many small designs.  

Can you explain why you can't use the built in multipliers?  I get not want
ing to use a CPU, but a multiplier is a multiplier whether you construct it
 or it's already in the FPGA.  

--  

  Rick C.

  -- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
On 01/04/2020 20:20, Rick C wrote:
Quoted text here. Click to load it

That's because it's not my field and I don't know the correct form of words.

Quoted text here. Click to load it

Yes.

The last time I used a hardware multiplier outside a CPU it was made  
with TTL chips and wire wrapped.  A Wallace tree IIRC taken from a big  
orange Texas databook.  My experience is somewhat limited.

Quoted text here. Click to load it
There is a possible future requirement to run at high temperatures and  
in a very high vibration space-constrained environment.  This limits the  
size of the FPGA and the package type in practice to say something with  
20k gates - the actual part numbers known to work is confidential, but  
they won't be the latest greatest.

It had occurred to me that, as the FFT needs lots of complex  
multiplications, these could be done using lots of smaller simple  
shift-add multipliers instead of one large 'normal' multiplier - and for  
all I know, that's the way an expert would do it.  But only if the  
speed/size trade off worked.

--  
Cheers
Clive

Re: FFT Speeds
On Thursday, April 2, 2020 at 4:40:42 AM UTC-4, Clive Arthur wrote:
Quoted text here. Click to load it
the complexities of trying to get a sequentially executing CPU to appear to
 be processing in parallel.
Quoted text here. Click to load it
T,
rs
liers.
ds.

No, I think your terminology is good, I just don't recall exactly what thin
gs mean from time to time.  I've reached a point where I often know a topic
 (well, the parts I remember) but not the terminology.  


Quoted text here. Click to load it
ed sequentially to generate a multiply in N clocks?
Quoted text here. Click to load it
hem to do the job.  I'm not an expert on multiplier design, but I believe a
 modified Booth's algorithm is not only faster but more compact.  The logic
 is a bit involved but someone explained to me how it uses fewer computatio
ns, but I completely forget the details.
Quoted text here. Click to load it

I've heard of Wallace trees, but never used them.  I think they are more us
eful with the sort of function blocks you have in TTL/ECL chips.  When desi
gning with gates I believe the modified Booth's algorithm is faster, but I'
m not sure.  It certainly fits in an FPGA better because they all have fast
 adder chains in the fabric logic so the Wallace tree logic would be less e
ffective.  


Quoted text here. Click to load it
er... for some situations.  A 128 point (complex/real?, xx bit?, fixed/floa
t) FFT is not very large, so not a huge amount of computations.  If you spl
it the work you need to consider the time to offload the data and return th
e results.  A friend many years ago told me about evaluating an attached ar
ray processor on his PDP 11/70 in grad school.  The calcs were fast, but th
e I/O speed was slow enough it was faster on the CPU.
Quoted text here. Click to load it
to the FPGA and do more processing there.  I would be looking at the proble
m as by default work being done on the FPGA and offloading to a CPU the par
ts that are harder to do in an FPGA like network connectivity, etc.  Many o
f my designs have no CPU.
Quoted text here. Click to load it
1100, 2048 or 3520 logic cells (LUT4s + FF) and 8 or 16 kB block RAM in a 0
.5 mm pitch, 7 x 7 mm, 48 pin QFN.  A nice part for many small designs.
Quoted text here. Click to load it
wanting to use a CPU, but a multiplier is a multiplier whether you construc
t it or it's already in the FPGA.
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Ok, that makes sense.  The qualified part doesn't have multiplier blocks.  
  


Quoted text here. Click to load it
  
Quoted text here. Click to load it

A shift-add multiplier is one layer of a pipelined multiplier.  So have N s
hift-add multipliers of M bits or N/M Booth's multipliers.  The amount of l
ogic will be about the same other than whatever optimization factor the Boo
th's will offer.  I'm not sure it is less logic, but faster for sure.  Anyw
ay, you should be able to put a lot of them into 20 kLUTs.  No, wait, you s
aid "gates".  

I think that tells me which FPGA familly you are working with.  No one uses
 "gates" anymore and these guys do a lot of "qualified" business.  That exp
lains the lack of multipliers.  20 kgates isn't really much logic either.  
I'm guessing around 3,000 logic elements and these can be either logic or F
F but not both?  

I wish I could remember more, but these days you can put  
A <= B * C in your code and you will get a library function block.  Let t
he tools work it out.  

--  

  Rick C.

  -+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
On Thu, 2 Apr 2020 09:40:34 +0100, Clive Arthur

Quoted text here. Click to load it

Implementing e.g. a 32 x 32 Wallace tree multiplier with MSI chips is
nasty, due to the huge number of interconnections. The product
consisting of two input AND gates produces 1024 lines. Then full
adders (preferably with carry look ahead) are required to count the
number of "1" bits. This requires multiple stages with the adder
propagation delays in each stage.

This is not a problem in e.g. in FFT calculations, since the
multiplies can be pipelined, producing one product every adder
propagation delay time but the total time taken by a single multiply
can be 10 or more adder propagation times,


Re: FFT Speeds
On 4/1/20 12:32 PM, Clive Arthur wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it

Going back to this basic question, If I presume you are asking not about
the fairly limited number hard core multipliers that many FPGAs have
included, but about implementation in the FPGA fabric, and are comparing
one design which has a full tree implemented in parrallel, and can run
at a throughput of 1 multiply per cycle (it may take several cycle to
get a new result, but a new multiply can be done every cycle) verse an
implementation that uses reduced hardware but serially iterates the
answer in N cycles, my first impression is that the additional control
mechanisms to iterate probably take more logic than the fully parallel
solution and thus costs space, and thus has a lower throughput for a
given number of logic cells dedicated to multipliers. Someone would
really need to design this to be sure. Maybe the fact that the sequencer
could drive many serial multipliers might make a difference.

Re: FFT Speeds
On Thursday, April 2, 2020 at 8:02:57 AM UTC-4, Richard Damon wrote:
Quoted text here. Click to load it
<snipped>
Quoted text here. Click to load it
,

 you
Quoted text here. Click to load it
s

I have designed shift/add multipliers and the control logic is not much.  T
here is a log2(N) bit counter and minimal other logic.  The shift/add multi
plier only works with unsigned multipliers (or is it the multiplicand?) so  
there needs to be a 2's complement before and after the multiplier.  I'm pr
etty sure the Booth's algorithm handles sign automagically, but I won't swe
ar to it.  I did this over a decade ago.  I'm pretty sure if you are doing  
a bunch of multipliers the shift/add uses more real estate for the same thr
oughput, but I don't recall the details of modified Booth's so I can't say  
for sure.  

--  

  Rick C.

  +- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
On 2.4.20 21:16, Rick C wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it


Booth's algorithm does handle signed operands in two's complement notation.

--  

-TV


Re: FFT Speeds
On Thursday, April 2, 2020 at 2:33:20 PM UTC-4, Tauno Voipio wrote:
Quoted text here. Click to load it
n.

Your mention of two's complement as if there were other choices these days  
made me recall that in the Forth language group there is a discussion of st
andardizing on 2's complement... finally.  

Seems someone looked and there are literally no remaining hardware computer
s running anything other than 2's complement.  There is an architecture run
ning 1's complement, but it is no longer a hardware implementation but only
 simulated on other machines, lol.  There must be some commercial niche for
 software written for that machine.  

So no real need to specify 2's complement these days although I suppose in  
an FPGA there could be anything.  Some ADCs are still signed magnitude, so  
that conceivably could be carried through the chip.  

--  

  Rick C.

  ++ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
Quoted text here. Click to load it

IEEE 754 floating point, relevant to this discussion, is sign-magnitude.

Re: FFT Speeds
On Thursday, April 2, 2020 at 11:33:24 PM UTC-4, Paul Rubin wrote:
Quoted text here. Click to load it

That's not the main data format on any machines I know of.  No one uses flo
ating point for logic or flags.  It was an issue in Forth because of the wa
y people perform logic using non-flags or math using flags.  A one's comple
ment or a signed magnitude data format breaks any assumptions you make abou
t mixing the two.  

Some people complain that the practice happens making programs potentially  
non-portable so it looks like they are going to standardize on two's comple
ment so these programs will be compliant and the griping will stop.  No one
 is complaining the change will break anything.  

--  

  Rick C.

  --- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: FFT Speeds
On Thu, 2 Apr 2020 12:39:55 -0700 (PDT), Rick C

Quoted text here. Click to load it


by simply inverting all bits in a word. Thus inserting the inverters
in one data path and both inverting and some other operation can be
used in a single instruction.


must be added to the result, requiring an adder and the ripple carry
can propagate through all bit positions. Thus complementing a number
needs an ADD instruction and takes as long as an ordinary ADD, Of
course these days with carry look-ahead logic, the ADD is nearly as
fast as AND/OR instructions.

Quoted text here. Click to load it

With linear ADCs the zero line can be freely selected. One could put
the zero at 1/3 FSD and hence large  numeric range on one side.

However, in floating point ADCs the range just around zero is of most
interest, thus the sign/magnitude is more appropriate than biased
unsigned or 2'c complement. The floating point sign/magnitude is used
e.g. in digital telephony, so that weak signals are reproduced better.



Re: FFT Speeds
On Friday, April 3, 2020 at 2:34:42 AM UTC-4, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it
tion.
ys made me recall that in the Forth language group there is a discussion of
 standardizing on 2's complement... finally.  
Quoted text here. Click to load it
ters running anything other than 2's complement.  There is an architecture  
running 1's complement, but it is no longer a hardware implementation but o
nly simulated on other machines, lol.  There must be some commercial niche  
for software written for that machine.  
Quoted text here. Click to load it
er
"1"
Quoted text here. Click to load it

Exactly, and the 2's complement doesn't have the various issues the 1's com
plement does.  It only has a single value for zero.  I can't recall how the
y deal with that.  If you subtract 2 from 1 you would get an all ones word  
which is zero, so you have to subtract another 1 to get -1 which is all one
s with the LSB zero.  So it's not really simpler, just messy in different w
ays.  

Actually the invert and add 1 is pretty easy to do in an ALU.  You already  
have the adder, just add a conditional inverter in front and you have a sub
tract, oh, don't forget to add the 1 through the carry in.  Easy and no dif
ferent timing than the ALU without the inverter if it's done in the same LU
T in an FPGA.  In fact, I've seen macros for add/sub blocks.  

The same sort of trick can be used to turn an adder into a mux.  In the ins
truction fetch unit of a CPU I designed there are multiple sources for the  
address, plus there is a need for an incrementer.  This is combined in one  
LUT per bit by using an enable to disable one of the inputs which only pass
es the other input like a mux.  I think I use that for the return address.
  


Quoted text here. Click to load it
in an FPGA there could be anything.  Some ADCs are still signed magnitude,  
so that conceivably could be carried through the chip.  
Quoted text here. Click to load it

You are talking about u-Law/A-Law compression.  Yes, very familiar with tha
t.  Not really relevant to the native data format on CPUs though.  I know i
n u-Law there is a bias which can muck things up if you aren't careful.  So
 it's not even signed magnitude either.  

--  

  Rick C.

  --+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.

Site Timeline