Processor question

I read that, most of the time, jumps make the processor flush the pipeline, or the cache, or the other cache, I suppose in order of severity (short / near / far jump). What kind of preprocessing and caching strategies are people mucking around with these days?

I thought of this question while contemplating QBasic. You see, even with over 1.8GHz available today (I say 1.8 because that's only what I'm sitting in front of now :) ), a simple "sit in the corner and count to a billion" takes QBasic over a minute! In contrast, a "somewhat" more optimized loop written in assembly does 2^32 in just seven seconds.

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

Reply to
Tim Williams
Loading thread data ...

If I recall correctly QBASIC is an interpreted language wherein the code is translatlated to an intermediate form that is then interpreted by a runtime environment. That=20 would make it far slower than a truly compiled code.

I doubt very much that your example could be running into any cache issues due to jumps (er, BASIC goto's).

There are versions of BASIC out there which can create compiled machine language, and they'll run rings around QBASIC.

Reply to
Greg Neill

It is often interpreted, but I don't think it's so simple (and so slow!) as a runtime environment. I don't know just what Microsoft put into the interpreter, but given the similar speed of interpreted vs. compiled programs, I would imagine it's similar.

You mean like, QBASIC? -- I'm using QuickBasic 4.5, which includes a compiler, and it makes bare naked .EXE files. So there. ;-) Executables usually run a little faster than in the IDE, but not by the orders of magnitude I've been seeing.

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

Reply to
Tim Williams

My HP, running PowerBasic (the PBCC version) takes 0.83 seconds for long integers, 9.8 seconds for double floats. That's respectively 830 ps and 9.8 ns per loop!

#COMPILE EXE

FUNCTION PBMAIN () AS LONG

DEFDBL A-Z

B = 1E9

T# = TIMER

FOR A = 1 TO B NEXT

PRINT TIMER-T#

INPUT A$

END FUNCTION

John

Reply to
John Larkin

It isn't being compiled to anything even remotely close to x86 native code then - almost certainly some kind of tokenised interpretted BASIC.

A real compiler to native code will get essentially the same performance as assembler for a simple empty loop (*). Possibly slightly better these days as the best compilers will exploit instruction order that can best fill the pipeline(s) for parallel execution, branch prediction and branch target alignment. A loop running entirely inside one instruction cache line should have only one stall when the loop exits.

(*) a few are known to spot empty loops and optimise them away entirely.

Regards, Martin Brown

** Posted from
formatting link
**
Reply to
Martin Brown

I've found the QBASIC, the version that shipped with DOS, seems to have a somewhat edgy relationship with windows' idea of how to run a dos program... it could be that the IDE is being slowed by its poor interaction with windows in a way that the compiled program (presumably with little to no user interface) would not be.

Reply to
cs_posting

Could be. If all the modules are installed, then that would explain how a "Hello World!" can take up 30kB...

I debugged the bare FOR loop, and discovered it uses a FAR call for comparing the DWORD variable (which is moved out of memory to increment, back into memory, then pushed onto and off of the stack for the call). My question is, is the far call the fatal operation that slows it down, and why, in terms of today's hardware?

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

Reply to
Tim Williams

Simplest thing to do is step through the executing code to follow how it does it. An empty FOR loop should not be that hard to disassemble.

I have a sneaky feeling it will be along the lines of a chain of "stuff things on stack, call far routine, put things back in memory" for every atomic operation in your code. Whereas a true compiler will put the variable into an appropriate register and generate a very tight loop.

Regards, Martin Brown

** Posted from
formatting link
**
Reply to
Martin Brown

The historical execution time cost ratio between interpreted and compiled is centered at about 10:1 in favor of compiled. More recently that seems to have decreased to about 5:1 due to interpretive build environments and execution environments that use a fractional / partial JIT compiler. Java development / runtime environments are a fine example of this.

Reply to
JosephKK

So having established a bit about QuickBasic's behavior, what kind of preprocessing and caching strategies are people mucking around with these days?

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

pipeline,

with

sitting

loop

Reply to
Tim Williams

Use PowerBasic.

John

Reply to
John Larkin

these

Hardware question != software answer...?

Maybe I phrased the OP too problematically...it was a curiosity... maybe I should get D from BC to write for me...

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

Reply to
Tim Williams

To solve what sort of problem? Cache aware algorithms (and even self tuning code) have been around for a while. FFTW (discussed in an FFT timing thread here recently) incorporates some state of the art tricks.

You need to enable Read Time Stamp Counter instructions (and/or detailed instruction profiling statistics if you are serious). Intel advises against relying on RDTSC counts but unless you are inside a VM it is generally good enough timing for most practical purposes. Intro at:

formatting link

Full performance monitoring statistics requires a Ring-O access driver Cornell University has written one called ia32 (use at your own risk!).

Regards, Martin Brown

** Posted from
formatting link
**
Reply to
Martin Brown

I mean hardware, like, what's on the chip, what flushes cache, etc...... and maybe anything that controls that (BIOS?!).

Tim

-- Deep Fryer: A very philosophical monk. Website @

formatting link

Reply to
Tim Williams

In most cases cache controls itself with a LRU algorithm.

Reply to
JosephKK

The ideas behind caching haven't changed all that much although they have got bigger, heirarichal and more tightly coupled to the CPU.

A slightly dated introduction to caching strategies is online at:

formatting link

Regards, Martin Brown

** Posted from
formatting link
**
Reply to
Martin Brown

Of course the details of what happens depend upon processor, program structure, number of cache levels, and types of caches. In a processor with instructions and data in the same cache, jumps typically would have little effect compared to data access; or if a program jumps all over but in a continuing and well-cached cycle, icache contents could be static. If branch prediction is working (

formatting link
eg) then the pipeline won't completely flush or stall at each branch.

Thousands of different techniques for cache management, memory prefetch, branch prediction, speculative execution, superscalar instruction issue, out-of-order completion, multiprocessor coordination, and hundreds of other buzzword combinations have been studied and reported in detail during recent decades. The better studies start with instruction traces or process-lists from real systems and measure how the mix performs on a test machine, ie, a computer or program set up to simulate whatever discipline is being tested. Typically, it's easy to invent something that works great with one program or another, but difficult to come up with something that helps in all cases.

Sometimes what appears to be a straightforward problem really is complicated. For example, main memory is slow compared to cpu ops; in some systems, thousands of cpu cycles go by while the cpu waits for a cache line to fill. One approach is use of memory barrier ops (eg

formatting link
) to decouple cpu and memory. Simple with one processor, hard to even understand with more than one, although see refs of that article.

-jiw

Reply to
James Waldby

Hmm, interesting.. so by the sound of it my experience may in fact be rather specific to the processors I've used?

Tim

-- Deep Fryer: A very philos> > I read that, most of the time, jumps make the processor flush the

I'm

a
Reply to
Tim Williams

...

Of course; that's how experience works :)

Actually, I think of what you mentioned (empty QBasic loop taking a minute to count to a billion) as more an optimization issue than a processor or language issue. As John Larkin mentioned, PBCC PowerBasic takes less than a second for an empty billion-pass loop (on some unspecified machine). On my 2GHz system, a billion-pass empty C loop takes about 3 nanoseconds if optimized at all; or, if I declare "volatile int j", and put "j=3" inside the loop to keep the loop from disappearing during optimization, it takes 4 seconds if not optimized, else

1.2 seconds.
Reply to
James Waldby

Not really the CPU's fault. Blame the compiler implementation.

You have chosen a poor implementation of a language that does not compile to native code. Its slowness is down to the very long winded way the code is executing. The cache is probably almost completely irrelevant here. You are not striding through large chunks of memory to execute an empty for loop.

You are looking in entirely the wrong place if you want to make your program faster find a different native code compiler, preferably one with a decent optimiser.

Regards, Martin Brown

** Posted from
formatting link
**
Reply to
Martin Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.