PowerBasic rocks!

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 8:48 PM

Not literally all. It computes between cache misses. Small loop improvements seem to help a little.

John

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 9:46 PM

The thing is almost totally stuck in molasses waiting for slow memory access with this unfavourable one time sequential access pattern.

You don't even know what code was generated by your mystical PowerBasic. You are stabbing in the dark. Though it may still be fast enough to be useful on current CPUs.

I will concede that the old traditional incremental pointer version produced by optimising compilers of the MSC 97 era is just marginally faster (by about 1% - bordering on unmeasurable without instrumentation)

- and it is only faster after a manual reordering of the generated code.

ZOT: movsx eax, [esi] add esi, 2 add [edi], eax add edi,4 loop ZOT

(MSC 97 generated code had both additions at the end of the loop) With reordering both the additions are effectively free.

I am still waiting to see you code a cache aware version in PowerBasic that will prove if their code generator is any good or not.

Regards, Martin Brown

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, May 18, 2009 10:21 PM

It's not mystical. Its output is just about as fast as any optimized C we've run. The first time we both did it, the standard PBCC output was about 4x faster than the C, which was a message to my C guy that he needed to work on it. If I hadn't run the Basic, I might have accepted the initial C speed as realistic. He fiddled with optimizations until he got close.

I don't know if there's any way to ping the cache to make it prefetch blocks before they're needed. Even if, the realtime execution speed increase is likely to be small, since we're mostly thrashing DRAM.

At any rate, the production instrument will run Linux and the signal averager will be in C. All my C guy has to do, to make things work, is write a routine that runs as well as my PowerBasic does.

I suppose I could drop some inline assembly into my Basic program, but it's not worth the trouble right now. I do that sometimes for stuff like PCI Bios calls.

John

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 1:42 AM

You're just demonstrating that you are still clueless about this.

The whole point is that the complexity of the interaction between the algorithm and the system architecture makes it too complex to make any such predictions - unless you have the exact hardware and datasets as benchmarked. All you can do is to run the FFTW optimisation code on your own hardware and use the results to make some sensible decisions about FFT size, etc. Thankfully, people who understand the architecture complexities better than you do have automated most of that process. I find it impressive and instructive, whereas you seem to find it neither.

Clifford Heath.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 2:39 AM

But the exact systems, CPUs, clock rates, OS's, compiler, compiler options, and RAM *are* specified for these graphs.

formatting link

but the speeds are still in mflops.

Did you miss that somehow?

All you can do is to run the FFTW

No, you can take their mflops mumbers and plug them into an equation - a backwards equation - and theoretically convert that to runtime. As others here have pointed out.

If the only test is experiment, why have the graphs?

Thankfully, people who

I'd prefer they just post the runtimes in seconds.

If you ask the simple question (of, say, Google) to the effect of "about how long does it take to run a million-point single-real complex FFT on a 2 GHz Pentium-class computer" it's very hard to find any hard numbers. Why?

John

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 2:55 AM

A nearly meaningless definition of their own, only useful for indicating relative orders of magnitude.

Theoretically, and you'll get an order of magnitude result.

because it illustrates the variability, and re-inforces that you should test using your own data and hardware.

I agree, but you'd still need to test things yourself. Your hardware might run 3-5x faster or slower on the same benchmark, even given the same cache and cache size, width and speed, and the same CPU speed.

Because the hard numbers that matter to you are the ones you get from installing their code and running it on your hardware. How hard is that?

Clifford Heath.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 3:59 AM

Because they are meant to show the relative efficiency of the algorithms. If they used absolute timings, the graphs would be dominated by the n.log(n), when they're trying to show the scale factor.

There's no point unless you are using the exact same system they used for the tests.

Because it provides a rough guide to how various FFT implementations perform relative to each other.

Probably because there isn't enough of a market to make it worth anyone's while running extensive FFT benchmarks on every single hardware platform in existence.

You aren't likely to get numbers for the exact system you're targeting, and even a "close" system is different enough that the numbers would be meaningless for your purposes.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 8:07 AM

16-bit

bottle-neck

and back

S(x)

task.

elements being

a

all the pre-

right.

Can be, doesn't have to be.

You are talking way beyond your knowledge here, worse you could study a few datasheets and know better.

Not at all. With average random access times of 35 ns, 1/2 billion accesses doesn't take long. But you are are right about the advantages of sequential accesses, the performance jumps 10X or better.

Harvard architecture does not even matter for this issue.

Wrong again, three or two blocks may actually be slower than one. There are a lot of prefetch and cache dependencies involved. Without knowing the actual algorithms involved both of us are just guessing.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 2:09 PM

FFTW ran a lot of them, plenty enough to give people a pretty good idea of what they can accomplish. It sure looks like they actually did timings and then plugged the results into their mflops-v-time equation to convert to mflops.

That's crazy. The systems are exactly specified. If I use the same CPU at the same clock speed with the same RAM and run the same code, I can expect runtimes to be consistant to a percent at least, with the expected measurement jitter from timer interrupts and such. These are clocked digital systems, for Pete's sake.

It sometimes matters whether an FFT will take 40 milliseconds or 40 minutes.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 2:24 PM

being

the pre-

You're a decade or two behind the times here.

All modern DRAM can do relatively slow setup followed by fast block streaming. Blocks of ram are slammed into and out of cache. It's been like that for some years.

formatting link

"It is worth noting that the improvement [in random access time] over

11 years is not that large. Minimum random access time has improved from tRAC = 50 ns to tRCD + tCL = 23.5 ns, and even the premium 20 ns variety is only 2.5× better. CAS latency has improved even less, from tCAC = 13 ns to 10 ns. However, the DDR3 memory does achieve 32 times higher bandwidth; due to internal pipelining and wide data paths, it can output two words every 1.25 ns (1600 Mword/s), while the EDO DRAM can output one word per tPC = 20 ns (50 Mword/s)."

Note the 20 ns random access time but the 625 ps word rate in streaming mode... generally 64 bits wide. My BASIC add loop does, in effect, a memory access every about every nanosecond.

Last time I checked, 1/2 billion times 35 ns was 17 seconds. I'm doing my add loop in 220 milliseconds. Closer to 100x than to 10.

No, I'm writing code based on my knowledge of the architecture and measuring execution times. I don't know what the hell you're doing.

John

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, May 19, 2009 11:26 PM

You're still clueless, and completely not getting what we're saying.

You won't be able to buy the exact same board, and *any* change in

*any* chip, especially in the multi-level memory hierarchy, can have a seriously non-linear effect on the performance.

There are cliff-edges everywhere, and if you drop off one and fall into cache cycling, you lose an order of magnitude or more. That's why fftw has an adaptive optimiser that actually runs all its options to see how the times come out.

Clifford Heath.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 12:04 AM

Well, then I suppose all software runtimes vary by "an order of magnitude or more" even in a dedicated system. So all benchmarks are worthless. I bet the RTOS guys are going to be surprised.

John

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 12:57 AM

For a smart guy, you're remarkably resistant to education. Try reading "What every programmer should know about memory" (Drepper) for example here: and see if you still think things are so simple and deterministic.

Clifford Heath.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 12:57 AM

On Tue, 19 May 2009 17:04:03 -0700, John Larkin wrote:

Actually, they do report execution times in their raw data files. This is from the Pentium4gcc file...

intel-mkl srif 2 12.8 3.90625e-07 0.001793 intel-mkl srib 2 11.907 4.1992187e-07 0.001785 intel-mkl srif 4 33.032 6.0546875e-07 0.001794 intel-mkl srib 4 35.556 5.625e-07 0.001798 intel-mkl srif 8 89.302 6.71875e-07 0.001785 intel-mkl srib 8 91.429 6.5625e-07 0.001819 intel-mkl srif 16 254.41 6.2890625e-07 0.001811 intel-mkl srib 16 221.41 7.2265625e-07 0.001815 intel-mkl srif 32 512 7.8125e-07 0.00179 intel-mkl srib 32 457.14 8.75e-07 0.001812 intel-mkl srif 64 808.42 1.1875e-06 0.001841 intel-mkl srib 64 890.43 1.078125e-06 0.001818 intel-mkl srif 128 1303.3 1.71875e-06 0.001835 intel-mkl srib 128 1280 1.75e-06 0.001832 intel-mkl srif 256 1962.2 2.609375e-06 0.001823 intel-mkl srib 256 1927.5 2.65625e-06 0.001835 intel-mkl srif 512 2474.1 4.65625e-06 0.001849 intel-mkl srib 512 2441.3 4.71875e-06 0.001856 intel-mkl srif 1024 2786.4 9.1875e-06 0.001896 intel-mkl srib 1024 2786.4 9.1875e-06 0.001885 intel-mkl srif 2048 2944.8 1.9125e-05 0.001949 intel-mkl srib 2048 2906.8 1.9375e-05 0.001967 intel-mkl srif 4096 3171.1 3.875e-05 0.002073 intel-mkl srib 4096 3110.9 3.95e-05 0.002096 intel-mkl srif 8192 2991.5 8.9e-05 0.002604 intel-mkl srib 8192 3132.2 8.5e-05 0.002376 intel-mkl srif 16384 2654.8 0.000216 0.003488 intel-mkl srib 16384 2594.8 0.000221 0.003499 intel-mkl srif 32768 2354 0.000522 0.005221 intel-mkl srib 32768 2140.8 0.000574 0.005222 intel-mkl srif 65536 1710 0.001533 0.011373 intel-mkl srib 65536 1452.3 0.001805 0.011243 intel-mkl srif 131072 1046.1 0.005325 0.013426 intel-mkl srib 131072 875.05 0.006366 0.013497 intel-mkl srif 262144 753.77 0.01565 0.029185 intel-mkl srib 262144 677.92 0.017401 0.029118

I think it did the last 256K complex FFT in 17.4 milliseconds. Horray! A number!

The trend looks N*ln(N).

On another machine/OS, it runs in 31 milliseconds.

John

- J
- Jim Thompson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 1:24 AM

Naaaah! Larkin is just resistant to admitting he might be wrong. Leaf back thru past threads to see the consistent defensive stance. Larkin is so persistently defensive that it wouldn't surprise me to see him appointed to the Obama obfuscation team ;-)

...Jim Thompson

--
| James E.Thompson, P.E.                           |    mens     |
| Analog Innovations, Inc.                         |     et      |
| Analog/Mixed-Signal ASIC\'s and Discrete Systems  |    manus    |
| Phoenix, Arizona  85048    Skype: Contacts Only  |             |
| Voice:(480)460-2350  Fax: Available upon request |  Brass Rat  |
| E-mail Icon at http://www.analog-innovations.com |    1962     |
             
 Stormy on the East Coast today... due to Bush\'s failed policies.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 2:04 AM

the

whether

tell

I browsed through quite a few. After considering a bit, it all looks quite normal to me.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 2:09 AM

the

which the

whether

tell

the

=20

Thank you Martin. Much clearer than what i was considering saying.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 2:48 AM

FWIW, the page says that the tests were done by IBM.

Sure; because the resulting graphs are more useful, as they remove the n.log(n) and just show the scale factor.

But so few people will be using *exactly* the same system that the absolute timings aren't useful.

Aren't you looking at this for an embedded system?

So time it.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 2:53 AM

Well, duh. That's the asymptotic complexity of an FFT, which is why they plot:

mflops = 5 N log2(N) / (time for one FFT in microseconds)

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, May 20, 2009 3:27 AM

Yup. It's a controller for a line of magabuck+ instruments.

There's a meta-problem above the mere electronics. We get this sort of business by writing proposals. A proposal offers a firm price and a promise of performance. If we think the bleeding edge of performance at the price point is X amount of throughput, we can't offer the customer X - that's too risky - but we can't offer him X/10 either - that's guaranteed to not produce a purchase order. And we can't build a prototype to see what actual performance will be. So all you can do is research, experiment maybe a little for a week or so if you can, think, take a deep breath, and start typing. Since throughput matters on a system that's this expensive to own, it helps enormously to see what actual ZZ-GHz Pentiums and Spartan-XIV FPGAs can do in real life so as to make suitably calibrated promises. We have several X-factors that seem to be hovering around 1.6. And they all meet up at the front-side bus.

Unfortunately, not a lot of stuff is available on execution times of software, or actual throughput capability of, say, a PciExpress bridge chip or a striped disk array in a real system environment. We are committed to signal averaging, and it looks like we have enough horsepower. We might offer to do FFTs on this system too, and it's looking like a million-point FFT might be squeezed in every second or two.

The next-higher meta-level is that you have to over-book on proposals, sort of like over-booking on an airline. If 0% of your proposals produce orders, you're in big trouble. If 100% produce orders, you're in big trouble.

We will, when we have time and the real hardware and OS humming along. We just got the *&^%$#!! PLX drivers to work. But I figure that if the fftw people can do something in 17 milliseconds, we should be able to get close (2:1 maybe?) with other stuff going on. But if their time was 40 minutes, we wouldn't bother. Their posted times are, incidentally, the *best* of many runs, dancing around interrupts and such.

John