PowerBasic rocks!

brings,

Mega flops = Millions of floating point operations per second.

Reply to
Greg Neill
Loading thread data ...

Obviously. But how do you translate that into execution time in seconds?

John

Reply to
John Larkin

brings,

If you know the algorithm (fft) you can look up the number of floating point operations for a given dataset size.

Reply to
Greg Neill

You need to know the floating-point performance of the system on which the code is run. They can't give an absolute time when they don't know whether you're running the code on a 66MHz 486 or a 3GHz Xeon. They can only tell you how many instructions it will take; you need to divide that by the rate at which your system executes instructions.

Reply to
Nobody

Obviously not. Look at this one:

formatting link

They specify exactly the CPU, clock speed, OS, and RAM available.

Now look at the shape of the curves.

John

Reply to
John Larkin

the

whether

The graphs depict the performance for given benchmarks on the given architecture. The shape of the curves has a lot to do with the available cache and its performance, and how much of the data can reside in the cache. Things get even more interesting when secondary storage gets involved along with the operating system.

This is why organizations have their actual codes becnhmarked on prospective vendor's machines during RFPs -- peak performance numbers (which vendors like to quote in marketing material) can fall apart on real-world production data sets with real-world codes. Everybody has excellent linpack numbers...

Reply to
Greg Neill

So, pick a point on any graph and tell me the runtime in seconds.

John

Reply to
John Larkin

Nice. Intel clearly knows what the hell they are doing.

Reply to
Abbey Somebody

the

whether

tell

Without knowing the details of the individual algorithms (which I suppose I could look up, but I'm a lazy bastard this evening), I couldn't say; I'd need the total flop count for a given problem size and code.

The practical way to use the graphs is to get hold of the same codes and run them on your own machine. If one of the codes "looks like" an application of yours (or at least a major kernel of your application), then you can compare performances to estimate roughly how their machine compares to yours in terms of time to solution.

Reply to
Greg Neill

Do they? What do the curves mean?

John

Reply to
John Larkin

You do not know what Megaflops are or powers of 2?

Reply to
Abbey Somebody

You could always read the index page where they *define* their MFLOPS in terms of FFT execution time in us.

formatting link

"In the pages below, we plot the "mflops" of each FFT, which is a scaled version of the speed, defined by:

mflops = 5 N log2(N) / (time for one FFT in microseconds) / 2 for real-data FFTs where N is number of data points (the product of the FFT dimensions). See the methodology page for more detail. "

FFTW is a very cache aware algorithm implementation. You expect the curves to reflect the behaviour of the particular CPU multilevel cache. And also any to show up any peculiarities of the caching algorithm - a weakness in Intels associative cache penalises FFTs with certain stride lengths.

MFLOPs isn't a bad measure for this benchmarking since it shows the relative performance of the CPU and algorithm. Otherwise you have a graph that climbs steeply with execution time like NlogN which is a known systematic property of the algorithm. What they want to emphasise on these graphs is what fraction of theoretical ideal performance can be realised in practice as a function of increasing array size.

Each time a cache boundary is crossed performance decreases abruptly.

Regards, Martin Brown

Reply to
Martin Brown

Why presume anything? Take it into the debugger and look at a disassembly of the generated code.

If this simplification *is* genuinely faster then the PowerBasic code generator is not as good as you think it is. All the other optimising compilers have the main loop down to 4 or 5 instructions and adding a few NOPs to the loop does not change the timing at all (since it is memory bandwidth limited). The MSC 2003 compiler had it down to:

ZOT: movsx eax, [esi+2*edx] add [edi+4*edx], eax inc edx loop ZOT

Indexing on ecx was no faster.

The older 97 compiler did it by adding 2 and 4 to the index registers in the loop. There isn't a lot of scope for instruction scheduling here but hand tuning can gain a small advantage the compiler cannot see.

Lets see how your precious PowerBasic handles a cache aware implementation. Then we can see if its optimiser is any good.

Regards, Martin Brown

Reply to
Martin Brown

Of course I do. But what I want to know is execution time of an actual FFT.

Calculate one for me.

John

Reply to
John Larkin

So all one has to do is read the mflops points off the graph, plug them all into this equation, and solve for values of T.

Which makes my original point that FFTW is annoying.

John

Reply to
John Larkin

The GCC compiler did it in 5 ops for the up-count case, with an immediate compare to check for done. I'll have my guy try it again, counting down to zero.

I haven't tried to disassemble the PowerBasic output. The bottom line is execution time anyhow.

The loop is not totally bus/ram speed limited. Computing matters some, too.

I wonder if cache alignment matters. I'll have to try that, too.

John

Reply to
John Larkin

On a sunny day (Mon, 18 May 2009 06:45:43 -0700) it happened John Larkin wrote in :

FFTW is actually quite god. I have used it. For image processing, on a Duron 950:

720 points forward 189.592896 us 720 points backward 178.331604 us 360 points forward 88.463379 us 360 points backward 83.996216 us

Not powers of 2.... And a very small cache: model name : AMD Duron(tm) Processor cpu MHz : 959.482 cache size : 64 KB

How long does a 720 point FFT take in PowerBASIC?

Reply to
Jan Panteltje

Wouldn't a left shift or two be a bit faster than multiplying edx by 2 and by 4?

Regards,

Michael

Reply to
mrdarrett

It is a shift; multiplication is just the assembler syntax. Indices can't be multiplied by arbitrary scale factors, only by 1, 2, 4 or 8.

Reply to
Nobody

It isn't CPU bound. The code spends all its time waiting for memory. The unoptimised code with full debugging enabled is only 33% slower!

Even if it did a slow multiply it would not alter the timings!!!

You can add several NOPs to this loop and it will not alter execution time. The right number added might in some borderline circumstances even speed it up!

However, you will often find compilers these days generating code that uses load effective address to compute small multiplications.

LEA EAX, [EAX+2*EAX] // 3x LEA EAX, [EAX+4*EAX] // 5x etc.

The internal hardware implementation is obviously a shift, but that doesn't mean that the notation used in the assembler has to be written in cryptic micromanaged C style. Most compilers are smart enough to turn power of two multiplications into shifts and have been for a decade.

The old way of doing this loop with incremented pointers benchmarks as just marginally faster by ~1%. The optimised cache aware algorithm is

10-15% faster than the simple loop depending on the CPU and compiler.

Regards, Martin Brown

Reply to
Martin Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.