FFT timing

- J
- John Larkin
  
  Contact options for registered users
posted
16 years ago

Sat, Apr 19, 2008 2:42 AM

Hi,

Does anybody have an estimate of how long it might take to do a million-point FFT on a modern Intel PC? Google is no help.

Our data would be from a 16-bit ADC, but I get the impression that, on an Intel machine, a floating-point fft might be faster than fixed-point.

John

- L
- linnix
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Apr 19, 2008 3:14 AM

If I did the test right, 15 to 20 seconds on a Pentium 200.

- A
- Antonio Pasini
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Apr 19, 2008 7:06 AM

Il 19/04/2008 4.42, John Larkin ha scritto:

Have a look at

formatting link

it's one of the fastest free fft library on PC. They maintain an huge benchmark suite.

- A
- Andrew Holme
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Apr 19, 2008 8:19 AM

I use FFTW. A one-million point FFT in double precision arithmetic takes

172 ms on my 1.7 GHz Pentium.

See code and output below.

Time = 172 ms

0, -2.8582e-011 1, -2.8871e-011 2, -2.9773e-011 3, -3.1409e-011 4, -3.4026e-011 5, -3.8109e-011 6, -4.4659e-011 7, -5.6043e-011 8, -7.9395e-011 9, -1.5043e-010 10, 0.5 11, 1.3611e-010 12, 6.4959e-011 13, 4.1423e-011 14, 2.9773e-011 15, 2.2866e-011 16, 1.8322e-011 17, 1.5123e-011 18, 1.276e-011 19, 1.0951e-011 Press any key to continue

#include "fftw3.h" #include #include #include

#define LEN 1000000

fftw_complex *t, *f; fftw_plan t2f;

#define PI 3.1415926535

int main() { int i;

t = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * LEN); f = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * LEN);

t2f = fftw_plan_dft_1d(LEN, t, f, FFTW_FORWARD, FFTW_ESTIMATE);

for (i=0; i

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Apr 19, 2008 3:09 PM

Cool! I had no idea if it was milliseconds or minutes. We'd only have to do it every second or so, so we have lots of time.

(I'm getting ready to make a proposal/presentation next week, and I have to simulate intelligence.)

Thanks, guys.

John

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Apr 19, 2008 4:04 PM

No contest.

es

Worth pointing out here that if you have the option to choose your array length then FFTs for exact powers of two are quite a bit faster on almost all architectures.

And 2^20 =3D 1024^2 =3D 16^5 which is probably the fastest power of two Radix FFT kernel in common use.

Multiples of 5 and 10 although very fast can't compete with the highly optimised radix 16, 8, 4, 2 kernels. Although for 10^6 datapoints you will be limited by memory bandwidth and cache considerations to around

1000Mflops (on a 3GHz P4 box). On shorter lengths fitting entirely in cache a 1024 FFT is roughly 2x faster than the best 1000 FFT.

You may also be able to steal a march by transforming a pair at the same time with one complex transform and a diddle (there is a small price to pay in crosstalk for this). Or by using a real to complex conjugate symmetric version of the FFT (assuming your data is a real valued sampled time series of some sort).

FFTW is very good for this sort of thing.

Regards, Martin Brown

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Apr 20, 2008 7:24 PM

This is a bit of an old wives' tale. Powers of two have some advantage in arithmetic cost over other highly composite sizes (powers of 2/3/5), but it is not overwhelming. The main advantage of powers of two is not intrinsic, but is simply because most FFT libraries are mainly optimized for power-of-two sizes. In fact, because of limited cache associativity and direct-mapped caches based on low-order address bits, power-of-two sizes (where the subtransforms access data at power-of-two strides) can sometimes have a decided *disadvantage* for large sizes.

With a library like FFTW that is heavily optimized for non-power-of- two sizes as well as for powers of two, the difference is fairly small. With FFTW, it is often not worth it to increase a power-2/3/5 transform length to the next power-of-two size unless the length increase is less than 10%. It is especially not worth it for multi- dimensional transforms.

For example, on my 2.6GHz Intel Core 2 Duo with gcc 4.1 and FFTW 3.2 in 64-mode for double precision with SSE2, size 800 is faster than size 1024, sizes 3600 and 3840 are faster than size 4096, and sizes

25000 and 30000 are faster than 32768. (You can try it yourself with FFTW's self-test program: tests/bench - opatient 800 1024 3600 3840 4096 25000 30000 32768)

On my abovementioned 2.6GHz machine, the difference is closer to 25% with FFTW 3.2. In double precision, size 1024 takes 7.84us, while size 1000 takes 9.92us.

(And with regards to the original poster, size 1000000 is actually sometimes faster than size 2^20.)

Regards, Steven G. Johnson

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 4:03 AM

Actually, you should be able to do significantly better than that with FFTW on a modern Intel machine (with SSE/SSE2 enabled). Especially if you are doing lots of million-point FFTs and can afford to wait a few minutes one time to run the planner with FFTW_PATIENT (and then save the resulting plan with fftw_export_wisdom).

On my 2.6GHz Intel Core 2 Duo, in 64-bit mode, in double precision, with FFTW 3.2alpha3 in FFTW_PATIENT mode, a million-point FFT takes

43ms, and in single precision (which should be sufficient for the original poster's 16-bit ADC) a million-point FFT takes 22ms. Using two CPUs, the million-point single-precision FFT takes 13ms.

Another poster suggested padding to 2^20 = 1048576 (the next power of

2), but on my machine that is actually slightly slower: it takes 49ms in double precision and and 26ms in single precision.

Regards, Steven G. Johnson

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 9:30 AM

kes

I must have another look at FFTW then. I haven't seen the version that exploits SSE2. And I did notice in the past that certain Microsoft compilers made a poor fist of optimising the FFTW machine generated kernels. Which compiler did you use for these benchmarks?

Can't argue with these hard numbers or a famous FFT guru. Thinking about it after seeing your other post I concluded that for cache bandwidth limited single transforms you could be well be right since the 5% increase in length would make for at least a 5% increase in absolute timing. I am a bit surprised though that an optimised 2^N transform cannot beat a radix 10 in terms of time per element processed though - overheads in radix 5 are quite a bit higher. Does the saved wisdom provide or export a human readable version of the actual radix kernels used and their order of application?

I thought the problems with slower large 2^N transforms was confined to a particularly unfortunate IBM vintage design of workstation but apparently not. Thanks for posting the numbers.

I confess genuine surprise that 2^N no longer reigns supreme (if only by a small margin) when you have a free choice of very long transform length. You have to concede though that 2^N is a lot simpler to code.

Regards, Martin Brown

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 9:57 AM

I still believed it when I posted my reply. And I believe it is still true provided that the array fits cleanly into cache.

Interesting. That makes sense of the numbers you posted for 10^6 vs

2^20. I had not run into this problem with longer transforms (or noticed).

Agreed absolutely.

The old code I have can only do 2,3,5 prime radix (ex radio astronomy). In 1-D benchmarks:

625, 640 and 768 are faster than 1024 and 3072, 3125 are faster than 4096 I don't have figures to hand for anything bigger. And obviously for 2- D the space savings can make other lengths competitive.

Indeed. There must be something wrong with the old radix5/10 kernel I use.

Regards, Martin Brown

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 4:47 PM

gcc 4.1.2

First, the arithmetic difference is not that great; on the order of

25%. Second, for transforms this big the time is dominated by memory access rather than by arithmetic. And, as I mentioned in my previous message, power-of-two strides have a disadvantage on modern memory architectures because of the way caches work (limited associativity based on low-order address bits).

As a fun little exercise, try writing code that performs an NxN matrix transpose (just two nested loops). Then time it as a function of N, and plot time/N*N vs. N. Without caches etc., the plot would be roughly constant. However, if you do this on any current general- purpose machine you will find it is anything but constant---you will see huge spikes whenever N is a power of 2, or for that matter whenever N is a multiple of a power of 2.

(Someday, perhaps, caches will use a proper hash of all the address bits and most of this wild variation will go away.)

The wisdom is not human readable. However, you can use fftw_print_plan to print a "nerd-readable" version of the plan. Note that the algorithm is not simply a choice of radices; it has many other choices besides that (see our Proc. IEEE paper on

formatting link

A lot of people are surprised; the myth of the overwhelming advantage of powers of two is deeply ingrained in DSP folklore.

2^N is only a "lot" simpler to code if you don't care about performance. If you care about performance, a textbook radix-2 implementation is worthless (5 to 40 times slower than the fastest FFTs, depending on N), you have to use larger radices anyway, and the complications involved in developing a highly optimized FFT aren't so centered around coding the radix butterfly. (For FFTW, all our butterflies and base cases are generated by a program anyway, so for us there is no difference.)

Regards, Steven G. Johnson

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 4:52 PM

There's probably nothing wrong with it; it's just optimized mainly for the power-of-two case. Most FFT programs are. Programmers have limited time to devote to optimization, and so it is sensible to devote most of that time to powers of two since they are the ingrained choice of most DSP practitioners.

In FFTW, the generation of the radix kernels etc. is automated, so it is somewhat unusual in that we don't have make much of a tradeoff between optimizing powers of two and non powers of two.

Steven

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 5:01 PM

(I should mention that there are faster ways to perform matrix transpositions than the simplest algorithm of two nested loops, in order to improve cache-line utilization; actually, FFTW has optimized in-place and out-of-place transpose routines for both square and non- square matrices [the nonsquare in-place case being *much* harder, of course]. But the magnitude of the timing variation of the two-loop transposition as a function of N is surprising to most programmers who have an oversimplified mental model of what is going on inside the CPU.)

Steven

- B
- Bob Lied
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Apr 21, 2008 5:24 PM

You're right -- that was fun, and not at all what I would have predicted. I already knew my intuition about program performance was nearly useless, but this little exercise sure reinforces to measure before optimizing.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Apr 22, 2008 4:14 PM

Thanks.

I hadn't realised the hit was so big.

I remember demand paged virtual memory. It was almost guaranteed that every year some new graduate student would transpose (in those days big) a 1024 square image with a pair of nested loops instead of using the optimised library functions. It could grind a VAX to a standstill as almost every memory access generated a page fault.

Would an XOR of some midlevel bits be sufficient to break the degeneracy?

Thanks for the paper reference. I am curious how the wisdom works in a bit more detail.

It used to be true in the old days. Apart from some of the Winograd transforms as I recall.

Yes. Although the best optimising compilers do seem to have narrowed the differences a bit. Compact inner loop code can help. Incidentally do you find the split radix 16 length transform worth the effort of coding?

and the

And that is a stroke of genius.

Regards, Martin Brown

** Posted from

formatting link

**

- S
- Steven G. Johnson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Apr 23, 2008 4:30 AM

The factor of 5--40 that I quoted was based on current optimizing compilers on recent machines (the order of magnitude variation is for different N). Compiler differences have an impact, but it is usually on the order of 30%, not factors of 5 (except perhaps for a few special cases like dense-matrix operations for which compilers have been specifically tuned).

I'm not exactly sure what you mean by "split radix 16". Traditionally, "split radix" refers to a blending of radices 2 and 4 (although there have been some recent generalizations by Swamy et al. to higher radices while keeping the same operation count).

In any case, we use a variant of split radix in generating FFTW's codelets (the hard-coded transforms for the FFT base cases and the radix butterflies), but it is not used for the decomposition of general N. (We typically find large radices to be faster, e.g. radix

16--64 is typical, and for very large N > 2^19 or so we typically do one step of radix-sqrt(N).) More detail can be found in the papers on

formatting link

Steven

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Apr 23, 2008 9:57 AM

Thank you for an interesting insight into FFTW's inner workings.

I will take a look at these papers and your most recent code (although I have a feeling the MS C++ compiler won't optimise it quite so well as GCC).

Regards, Martin Brown

** Posted from

formatting link

**

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Apr 27, 2008 5:34 PM

On an actual Intel the FP performance lags the integer performance considerably. If you can organize it to use SSE2 instructions the 16- bit integer performance will really rock. Well done, you could be looking at well less than 1 second.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Apr 27, 2008 5:54 PM

Damn, i am going to have to take a look at the code. I wonder how they get around the advantages in coefficient calculation.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Apr 27, 2008 6:10 PM

I had been aware of step and stride issues that were discovered with vector co-processors 30+ years ago. And i had some inkling of related cache issues. I would never have thought it could be so counter-intuitive.