Adjusting PC Hyperthreading for Spice Simulation

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Jan 25, 2009 6:59 PM

So you have heard of pipeline bubbling. The pipelines are not that deep, about 7 stages max due to complexity increases. Current and recent processors (about 5 years for x86, more for SPARC and others) support speculative execution and out of order execution to reduce this problem.

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Jan 25, 2009 7:29 PM

Depends on the processor. The G5 and P4 were significantly deeper than that (more like 20 stages). The entire pipe is flushed on a mispredicted branch or context switch. If the target isn't in the cache it has to be reloaded from main memory.

It doesn't reduce the problem, rather makes it occur less often (when the planets line up). The "100 cycles" is still there. Memory with a

100ns access and a 1GHz CPU kinda makes access 100x clock.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 3:25 AM

Indeed. But while that mitigates data cache misses, it doesn't do anything for a code cache miss.

Especially on x86, where instructions are variable length and so can't even be decoded if prior instructions are missing, let alone executed.

Even on a RISC architecture, it makes such techniques much less efficient. If the missing instructions include a branch which will usually be taken, speculatively executing subsequent instructions is a waste of cycles. Similarly for an instruction which uses the contents of a register which is modified by a missing prior instruction.

If anything, it is the reliance upon caching and instruction re-ordering which makes code cache misses such a performance killer, as they mitigated the problems with slow RAM to such an extent that there was little incentive to increase RAM speed.

Although such techniques worked well for "classical" procedural code, they often worked rather less well for e.g. object-oriented code making heavy use of virtual functions, or interpreted languages where a substantial portion of the interpreter can be required for even the simplest functions.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 3:36 AM

Not so on mispredicted branches. Moreover speculative execution of both sides almost eliminates the issue. Also that may have been that much total depth but less than 3% of instructions (and much less than

1 % of execution) need all of them, mostly things like pusha and popa which move multiple registers onto and off of the stack.

Wow, the last time i saw ram with 100 ns access times was back in the

386 days. Even then you could get 70 ns and 60 ns premium parts. Current stuff is like 12 ns to 15 ns access and 60 ns to 85 ns cycle times with multiple consecutive address available at 5 ns intervals. The only place you get killed is on cache writeback block outs, that does have 100 ns plus lags before reading the new data (but that does not apply to instruction caches).

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 6:18 AM

That does not seem to follow.

Kindly explain how you get past the previous instruction to begin decoding the current instruction without decoding the previous instruction.

What? If you speculative execute both results of a conditional branch the pipeline stays full and you just drop the unused results in the bit bucket. The result: conditional branches without significant pipeline bubbles.

Ah, you misunderstand the methods and practice of out of order execution. See the Tomasulo algorithm, and register scoreboarding (which requires "phantom registers").

That is non-factual and barely coherent.

In terms of good instruction cache locality, interpreted languages win hugely. They also tend to win on data locality as well. Think about how they really work. I haven't seen many interpreted OO languages for some reason, maybe there the virtual functions do cause problems.

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 2:10 PM

If the branch target misses the cache and a new DRAM page has to be opened, yes it does. Branches don't do PUSHA/POPA. Memory access is still 100x CPU clock.

Try measuring apples to apples (access a closed page). You'll find that shiny new memory isn't all that much faster than that of twenty years ago. Current processors aren't 1GHz, either. The ratio is still ~100:1.

Huh? I'm missing your point here. Cache castouts aren't be in the performance path.

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 2:27 PM

A code cache miss only occurs on a misspredicted branch. Branch prediction becomes the key.

Variable instruction length doesn't come into it. One instruction can't be fetched until all prior instructions have been fetched.

If it's a branch usually taken, those instructions *will* be in cache and speculatively executed. Only branches usually not-taken cause pipeline flushes. That's the whole point of tracking not/taken.

How can you have a missing prior instruction? How did you get here without the prior instruction already having been fetched? Instruction fetch is an in-order operation.

Increasing RAM speed is *expensive*. Size is far more important because missing RAM is far worse. How many processor cycles does it take to swap pages?

They still work, though obviously the working set is larger for an interpreted language.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Jan 26, 2009 9:52 PM

Well, that's my point.

If the CPU tries to execute the next instruction, but the data isn't immediately available, it can just move on to the following instructions.

OTOH, if it tries to execute the next instruction, but the *instruction* isn't available, it's stuck. It can't do anything until the instruction is available.

If the CPU can normally execute several instructions per clock (regardless of whether their operands are immediately available), a code cache miss means that it has to wait for the existing transfer to complete, then start a new transfer and wait for that to start producing instructions, meaning that it's going to end up many instructions behind where it would have been compared to a cache hit.

Taking both branches is inefficient if one is far more likely than the other; that's the purpose of branch prediction. E.g. with a loop, the branch which executes the next iteration is usually far more common than the branch which exits the loop, so you're better off just speculatively executing the next iteration than hedging your bets.

Only works if you know what the previous instruction is.

If you have:

mov r3,[something] add r0,r1,r2

but [something] isn't available, you can still commence the add immediately.

OTOH, if you have:

mov r2,[something] add r0,r1,r2

the addition can be commenced but can't proceed until r2 holds a concrete value.

And if you have:

add r0,r1,r2

you can't even commence the add with an abstract value for r2, because you have no idea where the value will eventually come from. r2 may even hold a concrete value, but you don't know that until you've seen the missing instruction.

Out-of-order and speculative execution avoid CPU stalls due to data cache misses, but they either don't work or are significantly less efficient for a code cache miss.

Which part of it is unclear?

That depends heavily upon the language complexity. For a simple language like BASIC, with few types and few primitive operations, you'll probably get good locality.

For a complex language with many variants on the basic types, something as simple as adding a list of numbers can end up calling a dozen different addition functions (int+float, int+double, arbitrary-precision-int+double, ...), and it may have to go through several steps just to determine the correct function for each value (e.g. Python will first check whether the LHS has an __add__ method, if not then whether the RHS has an __radd__ (reverse add) method, then if either side has type-cast methods, ...).

Interpreted languages are more likely to have values dynamically allocated and referenced through pointers. E.g. if the value is a 3-tuple, you get a pointer to the tuple which holds 3 pointers to the individual values, which themselves may contain additional levels of indirection, and the various pointers are pointing all over the heap.

Compared to C, where most values are either stored at a small offset from the frame pointer, or are one level of indirection away (i.e. a struct pointer stored at a small offset from the frame pointer).

JavaScript, Python and C# are all interpreted languages with a strong OO bias.

I can't speak to implementation issues for JavaScript or C#, but Python is extremely dynamic.

Everything is an object. Retrieving a member value from an object involves first checking whether the object has __getattr__ or __getattribute__ methods; if it does, the method is called with the name of the field to retrieve the value. This is also done for methods, which are just members which happen to be functions.

None of this can be done a priori due to dynamic typing. Functions don't require their argument values to belong to a specific class, just that they contain the members which the function uses. E.g. a function which expects a file argument might only care that the object has a read() method (which has to be retrieved by name each time).

From what I know of JavaScript, it isn't much different. Its primitive operations are more primitive, but being template-based rather than class-based means that you still can't optimise based upon the expected type of an object, as the code has to work with any object providing the correct interface, with no knowledge of its underlying implementation.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Jan 27, 2009 4:19 AM

whatever).

I am amazed at how badly you misread this.

Here is a typical manufacturer's website discussing DDR2 memory. Please note the availability rather comprehensive timing diagrams for speeds up to DDR2-800. If you understand this manufacturers literature correctly the total read latency is on the order of 20 ns (worst case). For the processor to be 100 x faster cycle time would have to be less than 200 ps, which would be equivalent to about a 20 GHz clock. Current parts are about 1.5 to 3 GHz clocks. Best possible speed ratio in CPU favor 20 to 1.

What do you mean by a closed page? And current memory is not 60 ns any more either, average effective access when used correctly is more like 6 ns.

The issue is dirty cache page write back (data segments) in order to load a new page.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Jan 27, 2009 4:53 AM

For modern devices and even several older ones (pentium, SPARC, PA-RISC, i860, i960, S/360 and S/370, DEC VAX, DEC Alpha, and more) you do not have a point. Out of order execution requires decoding of each instruction to determine data dependencies. If the next instruction(s or more) does not have data dependencies, continuing with execution does no harm. It will not work for a 6502 or a 8080.

Do you think for some reason that there is only block of instructions mapped in the cache? After the first time through a loop both sequences get mapped if they are not already.

Only partially true. More aggressive cache preloading will have both sequences available for when the loop gets near termination.

Correct as far as it goes. But what if the next instruction is div r4,r5,r6; it can be executed because it has no unsatisfied data dependencies. See how it works?

This case cannot occur except at boot time when there are exactly 0 previously executed instructions.

Oh contraire. Because there are still instruction(s) in execution it reduces the size of the pipeline bubble.

Code cache misses are note the killer you pretend because of the pipeline and TLB start the memory access well in advance of the actual need for the code to be present. Moreover it causes a burst read which has a better transfer rate and hugely reduces latency for the next few instructions.

Now you are grasping at hypothetical straws.

You are not making your case here. Go more for deep down details to support your case. Without the deeper facts you are not all that credible. Try to emulate Larry Wahl.

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Jan 27, 2009 1:20 PM

whatever).

Perhaps. I'll look at this again later.

Page? The LINE isn't cast out until the read of the new line is complete (and the memory bus idle). The castout isn't in the critical performance path. The read is.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Jan 27, 2009 9:10 PM

That's the second time in a row that you've claimed disagreement then made the same point I was making.

No-one is disputing that these techniques mitigate data cache misses; the point under discussion is:

Are you saying that code cache misses don't happen? Even once data is read into the cache, it doesn't stay there forever.

Easier said than done. Caching instructions which are executed after the loop terminates should be a lower priority than caching anything which executed during the loop (including any subroutines called from the loop). Also, If you pre-load too early, the data risks being discarded before it is ever used.

Yes, I know how out-of-order and speculative execution work. The point, which you keep overlooking, is that it relies upon knowing the instructions. A code cache miss means that there aren't any instructions to speculatively execute, i.e. the CPU is at a complete halt until it gets those instructions.

When you have a CPU which can do 100 clocks in the time it takes to initiate a transfer, and can do 3 instructions per clock, that's a significant delay.

Did you actually interpret the above as meaning that there were no instructions prior to , or are you acting dumb for the sake of argument?

The number of "active" (commenced but not completed) instructions is likely to be in single digits. That isn't going to make much of a dent in a 300-cycle stall (unless they're all double-precision divisions).

You're assuming either very few branches (or indirect jumps), or very accurate prediction. That may be true if you're writing Fortran to evaluate an algebraic formula, but it's not true in general.

No, I'm explaining what modern interpreted languages are really like. If you're going to write in BASIC, you may as well compile it. The main reason for using interpreted languages is the flexibility provided by dynamic dispatch.

You may not understand the case being made, and making exactly the same points in an attempt to refute it suggests that is the case.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 2:43 AM

whatever).

Oops. forgot to add the link:

formatting link

The very link dates it to May 2007.

Moreover check this, even though it is Wikipedia:

formatting link

Take a look at those cycle times and peak transfer rates. Sustained rates are maybe 20% of peak but still quite a bit of data moving.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 2:59 AM

whatever).

account

Improbable. If the cache line/page selected to be replaced is "dirty" it must be written back before reading the new data or the changes will be lost. Thus the dirty line/page overhead is present in some fraction (usually small) of cache read attempts with misses. Balancing allocated write bandwidth with read performance and cache miss rate is the design issue.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 3:31 AM

See below for question of cost of code cache miss cost.

You keep insisting that memory read is far slower than it actually is. Using recent single core, superscalar, Pentium processors (the most popular and easiest to get data for), look up the number of execution units, the maximum number of simultaneously executing instructions, the number of clock cycles needed to execute the instructions (by profile, is does vary) etc for a processor with say a 2.4 GHz clock, connected to say 1 GB normal DDR2-6400 and other normal bandwidth loads on the busses. For reasonable access and cycle times (misnamed on the page) use this Wikipedia page:

formatting link

How many clocks are needed to get a code space read (for a cache miss) from memory for this configuration? How big of a bubble does that put in the pipeline?

No, you were acting dumb about the missing instruction. I explained the only case it could occur and you disunderstood. Even with branches instruction fetch is fundamentally sequential, you cannot fetch instruction x+1 until you have fetched instruction x.

From the profiler; about 3% to 10% branches, with about 75% successful prediction. See above for discussion of clocks for memory access time. I am NOT discussing using swap disk with its millisecond class overhead.

Have you ever pounded your way through the lowest level code of an interpreter? They emulate a nonexistent virtual machine. The issues related to dynamic dispatch in compiled code do not obtain in interpreted code. The process is that different.

- P
- Paul Hovnanian P.E.
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 3:57 AM

It doesn't even have a psychiatrist. Emacs has a built-in psychiatrist.

--
Paul Hovnanian     mailto:Paul@Hovnanian.com
------------------------------------------------------------------
Time\'s fun when you\'re having flies. -- Kermit the Frog

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 4:55 AM

whatever).

a

account

Nope. It's written to the store queue where it can be written back to memory at the processor's leisure.

Write bandwidth has nothing to do with it, except in the pathological case where you miss on every fetch.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 6:19 AM

Let's back up a couple of days to the point at which I entered this thread:

So, the 350 cycles figure would have been from the era of PC-133 or the first generation of DDR. That would have been the last time that I wrote code which involved timing diagrams.

You can displace instruction x from the cache without displacing instruction x+1.

75% isn't "very accurate"; and for getting two consecutive branches correct, that's only 56%. For code with a lot of conditionals, you can't assume that prefetching is going ensure that the right instructions are cached.

At the lowest level, it is examining each instruction and invoking the corresponding primitive. But this isn't like emulating a real CPU where you have maybe a couple of dozen common instructions; the Python core is around 2500 primitives in 750KB of code. When the VM implementation is hopping all over that much code, you aren't going to have it all in the cache.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 2:18 AM

whatever).

a

(this

years

account

be

Not so much a pathological case, but a low frequency case where all cache lines/pages are dirty at the time of a cache miss.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 2:48 AM

This will set up some time line referents to work with:

formatting link

Taking 1999 as a useful base year lets look at processors:

formatting link

So in 1999 Intel (IA-32) CPU speed was about 500 MHz And memory speed was about 133 MHz. That is not 300:1, it is not even 10:1 and it took multiple clocks to execute most instructions (early Pentium, not even P2).

Possible in some cache schemes, not in most.

Changes very little. Back to back branches are uncommon to rare.

Code locality is a function of most used primitives as well, thus the less common primitives cause most of the cache misses and the commonest primitives are almost always in cache.

Same thing with data locality. I trust you learned about that as well.