PowerBasic rocks!

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 3:32 AM

But one can't estimate runtime by using your advice "you need to divide that by the rate at which your system executes instructions." The curves peak and then go down, but obviously bigger FFTs don't run faster than mid-sized ones.

John

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 7:43 AM

You could always look at their raw timing data in the .tar.gz files - before they factor out the algorithms known O(Nlog2N) behaviour.

Sometimes you can be amazingly obtuse. The graphs they publish are the deviations from ideal maximum theoretical performance. An ideal machine with infinitely fast memory would give a straight line independent of array size running at about 4500 MFLOPs or so for a P4 3GHz. They deliberately factor out the NlogN behaviour because it is well known to all practitioners and what is interesting is where the sweet spots for padded array lengths with optimum performance reside.

What is particularly interesting are the shorter 1-D transform lengths that are faster than anything you might pad the data up to.

Assuming that your system originates real data make sure you use the real to complex conjugate symmetric forms of FFT. They are faster.

For very long arrays the performance assymptotically approaches something limited by memory bandwidth around 1000 MFLOPS on P4 3GHz.

If you are thinking of using FFTW in a time critical environment then you will need to let it run and develop a custom set of wisdom for the particular hardware as a part of commissioning. There can be enough differences (eg in ram make or sizes) between PCs notionally of the same batch that wisdom from one machine is not always optimal for another.

Unless FFTW is allowed to use the right amount of the right wisdom you will get lacklustre results. Some reorderings that it does are not obvious to human experts. It is just observed that the codelets running in a particular order happen to be faster. It tries all permutations with commendable patience. The cache interactions are far from intuitive on larger transforms with multiple factors.

It is no longer true that power of two transforms are faster on modern CPU architectures. Steve picked me up on that a while back. The explanation is not in the algorithm (which should be faster) but a defect in the Intel associative cache implementation.

Regards, Martin Brown

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 1:54 PM

I was pointing out to JKK that his simple "divide" statement was wrong, as should have been obvious to him from the shapes of the graphs. That's all.

As an engineer who wants to design a realtime system, actual runtimes are what matter to me. That can be found in the raw data files, pretty much.

John

- M
- mrdarrett
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 6:32 PM

y 2

The

Does PowerBasic have a Generate Assembly Code Output option, similar to the -S option on gcc?

That might yield some clues for your C guy.

Michael

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 7:22 PM

I don't think so. I can hang a label anywhere, get the address of the label, and peek/print the hex code I find there, or sic a debugger/disassembler on that address in the EXE file.

It also allows inline assembly, to sort of work in the other direction.

We can see the assembly version of the C programs, so that's where we tweak.

It's looking like we're running mostly memory-speed-limited, so code tweaks are not going to help a lot. Looks like we're going to run the

64M sum in maybe 0.22 seconds on the embedded system, which is good enough.

John

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 9:36 PM

Or half a dozen lines of assembler for the innermost loop.

I find it hard to believe that any 32 bit native code generating compiler could fail to do the job in under 5s on a P4 3GHz or better. The MSC optimisation disabled full debug was about as sloppy and flaccid as it is possible to be and still managed to do a loop in 0.4s.

The Intel profile directed optimiser might find something better than any of the rival compilers. I think you can evaluate it for free.

OK. So generate an "int 3" debug trap and catch it in the runtime debugger then you can see exactly what instructions the mystical PowerBasic has produced. I expect it is a variant of one of the two forms I posted.

It would be interesting to see the code for the C version that runs so slowly and to know the nature of the CPU it is running on. The last Intel Pentium that was really tetchy about caching was the P2. You surely cannot be designing that into a new product?

Almost any decent optimising compiler should get it down to around 0.22s with DDR2 ram. There could be an advantage in using DDR3 for this app.

Regards, Martin Brown

- M
- mrdarrett
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 11:44 PM

I'm curious to know the runtimes on the same machine: PowerBasic and a simple unoptimized C version.

Free DOS/Windows Borland C++ command line compiler here, if gcc is unavailable on your PowerBasic machine:

formatting link

from

formatting link

Michael

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 21, 2009 4:46 AM

targeting,

be

CPU

can

Some do and some don't.

Pretty much. Not better than 1/3 digit to usually right order of magnitude. Less than 50% difference on a benchmark normally does not mean much.

Not the good ones. The ones that are need re-education.