PowerBasic rocks!

- G
- Greg Neill
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 11:23 PM

brings,

Mega flops = Millions of floating point operations per second.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 11:34 PM

Obviously. But how do you translate that into execution time in seconds?

John

- G
- Greg Neill
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 1:11 AM

brings,

If you know the algorithm (fft) you can look up the number of floating point operations for a given dataset size.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 2:14 AM

You need to know the floating-point performance of the system on which the code is run. They can't give an absolute time when they don't know whether you're running the code on a 66MHz 486 or a 3GHz Xeon. They can only tell you how many instructions it will take; you need to divide that by the rate at which your system executes instructions.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 2:31 AM

Obviously not. Look at this one:

formatting link

They specify exactly the CPU, clock speed, OS, and RAM available.

Now look at the shape of the curves.

John

- G
- Greg Neill
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 3:16 AM

the

whether

The graphs depict the performance for given benchmarks on the given architecture. The shape of the curves has a lot to do with the available cache and its performance, and how much of the data can reside in the cache. Things get even more interesting when secondary storage gets involved along with the operating system.

This is why organizations have their actual codes becnhmarked on prospective vendor's machines during RFPs -- peak performance numbers (which vendors like to quote in marketing material) can fall apart on real-world production data sets with real-world codes. Everybody has excellent linpack numbers...

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 3:19 AM

So, pick a point on any graph and tell me the runtime in seconds.

John

- A
- Abbey Somebody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 3:27 AM

Nice. Intel clearly knows what the hell they are doing.

- G
- Greg Neill
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 3:29 AM

the

whether

tell

Without knowing the details of the individual algorithms (which I suppose I could look up, but I'm a lazy bastard this evening), I couldn't say; I'd need the total flop count for a given problem size and code.

The practical way to use the graphs is to get hold of the same codes and run them on your own machine. If one of the codes "looks like" an application of yours (or at least a major kernel of your application), then you can compare performances to estimate roughly how their machine compares to yours in terms of time to solution.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 4:17 AM

Do they? What do the curves mean?

John

- A
- Abbey Somebody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 4:30 AM

You do not know what Megaflops are or powers of 2?

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 11:13 AM

You could always read the index page where they *define* their MFLOPS in terms of FFT execution time in us.

formatting link

"In the pages below, we plot the "mflops" of each FFT, which is a scaled version of the speed, defined by:

mflops = 5 N log2(N) / (time for one FFT in microseconds) / 2 for real-data FFTs where N is number of data points (the product of the FFT dimensions). See the methodology page for more detail. "

FFTW is a very cache aware algorithm implementation. You expect the curves to reflect the behaviour of the particular CPU multilevel cache. And also any to show up any peculiarities of the caching algorithm - a weakness in Intels associative cache penalises FFTs with certain stride lengths.

MFLOPs isn't a bad measure for this benchmarking since it shows the relative performance of the CPU and algorithm. Otherwise you have a graph that climbs steeply with execution time like NlogN which is a known systematic property of the algorithm. What they want to emphasise on these graphs is what fraction of theoretical ideal performance can be realised in practice as a function of increasing array size.

Each time a cache boundary is crossed performance decreases abruptly.

Regards, Martin Brown

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 11:50 AM

Why presume anything? Take it into the debugger and look at a disassembly of the generated code.

If this simplification *is* genuinely faster then the PowerBasic code generator is not as good as you think it is. All the other optimising compilers have the main loop down to 4 or 5 instructions and adding a few NOPs to the loop does not change the timing at all (since it is memory bandwidth limited). The MSC 2003 compiler had it down to:

ZOT: movsx eax, [esi+2*edx] add [edi+4*edx], eax inc edx loop ZOT

Indexing on ecx was no faster.

The older 97 compiler did it by adding 2 and 4 to the index registers in the loop. There isn't a lot of scope for instruction scheduling here but hand tuning can gain a small advantage the compiler cannot see.

Lets see how your precious PowerBasic handles a cache aware implementation. Then we can see if its optimiser is any good.

Regards, Martin Brown

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 1:41 PM

Of course I do. But what I want to know is execution time of an actual FFT.

Calculate one for me.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 1:45 PM

So all one has to do is read the mflops points off the graph, plug them all into this equation, and solve for values of T.

Which makes my original point that FFTW is annoying.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 1:51 PM

The GCC compiler did it in 5 ops for the up-count case, with an immediate compare to check for done. I'll have my guy try it again, counting down to zero.

I haven't tried to disassemble the PowerBasic output. The bottom line is execution time anyhow.

The loop is not totally bus/ram speed limited. Computing matters some, too.

I wonder if cache alignment matters. I'll have to try that, too.

John

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 3:27 PM

On a sunny day (Mon, 18 May 2009 06:45:43 -0700) it happened John Larkin wrote in :

FFTW is actually quite god. I have used it. For image processing, on a Duron 950:

720 points forward 189.592896 us 720 points backward 178.331604 us 360 points forward 88.463379 us 360 points backward 83.996216 us

Not powers of 2.... And a very small cache: model name : AMD Duron(tm) Processor cpu MHz : 959.482 cache size : 64 KB

How long does a 720 point FFT take in PowerBASIC?

- M
- mrdarrett
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 5:33 PM

Wouldn't a left shift or two be a bit faster than multiplying edx by 2 and by 4?

Regards,

Michael

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 6:00 PM

It is a shift; multiplication is just the assembler syntax. Indices can't be multiplied by arbitrary scale factors, only by 1, 2, 4 or 8.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 7:53 PM

It isn't CPU bound. The code spends all its time waiting for memory. The unoptimised code with full debugging enabled is only 33% slower!

Even if it did a slow multiply it would not alter the timings!!!

You can add several NOPs to this loop and it will not alter execution time. The right number added might in some borderline circumstances even speed it up!

However, you will often find compilers these days generating code that uses load effective address to compute small multiplications.

LEA EAX, [EAX+2*EAX] // 3x LEA EAX, [EAX+4*EAX] // 5x etc.

The internal hardware implementation is obviously a shift, but that doesn't mean that the notation used in the assembler has to be written in cryptic micromanaged C style. Most compilers are smart enough to turn power of two multiplications into shifts and have been for a decade.

The old way of doing this loop with incremented pointers benchmarks as just marginally faster by ~1%. The optimised cache aware algorithm is

10-15% faster than the simple loop depending on the CPU and compiler.

Regards, Martin Brown