Processor question

- T
- Tim Williams
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 23, 2008 8:16 AM

Sure it does. Would you like an MS-DOS executable that runs alone? I have many. It makes 8086 code, since the compiler is copyright 1985... I'm sure the compiler is awfully naieve though, putting pieces together.

Well to be completely specific, I looked into it, and it seems to run a general loop, holding the long (32 bit) integer in some memory location, and making a far call (pushing values onto the stack) to compare the variable to the constant. Now if far calls don't cost much, I would expect this to run maybe 20 times slower than the most optimized loop I can concieve of, but we're talking several orders of magnitude here.

Indeed. But if you may recall, optimization wasn't my question, it was much more general, which is why I asked here.

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 23, 2008 8:39 AM

And therein lies the problem. Count exactly how many instructions it has to execute to get around the loop once. That will give you a rough idea.

RDTSC will give you a better measurement of timing.

The most optimised loop I can think of is a single LOOP instruction with

32 bit register CX containing the loop variable. Some optimising compilers will generate that on a good day.

Old x86 code has to compute 32 bit operations as two 16 bit native code ops so it will be slower. That is a big overhead.

Cache structure really matters when you are handling bulk data that is large compared to the cache size(s) of the processor.

A modern CPU will cache lines of typically 16 bytes on instruction fetch which means that small loops fit into instructon cache on their first execution and stay there for the duration of the loop.

Regards, Martin Brown

** Posted from

formatting link

**

- T
- Tim Williams
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 23, 2008 11:38 AM

Alright, well I count 23 in the loop I observed. So naievely I might assume the code runs about 20 times slower than the most optimal code, or even 40 or 80 times slower counting memory writes and stuff. But it seems to be a lot slower than that. A simple FOR i& = 1 TO 1000000: NEXT, interpreted, takes 7 seconds, evidently 4000 times slower than the assembly code I used (which was itself 4 opcodes).

And something else that's weird, the time taken seems nonlinear. A million took 7 seconds, but as I said in my original post, a billion took "over a minute", which is a whole lot less than a thousand times longer. But I don't see how the processor might be optimizing after a few dozen, let alone a few million... huh maybe load sharing in Windows at work? May have to test this in DOS mode for total concentration...

Indeed. Still runs near 1 opcode per clock cycle, so short jumps aren't a problem. I'm thinking long jumps are what really trash performance, but if you suggest they may still be going in cache, then I don't know what would be taking out so many orders of magnitude.

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 24, 2008 4:32 PM

Also main memory access is incredibly slow compared with cache. If your executable is writing through to main memory every time it stores the variable, that will take awhile.

Cheers,

Phil Hobbs

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, May 25, 2008 3:31 PM

Very true. Even with recent 1000MB/s and faster memory interfaces writes are typically 3 to 10 times slower than reads.