Adjusting PC Hyperthreading for Spice Simulation

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 3:13 AM

whatever).

if a

(this

years

RAM

the

account

concurrently

a

be

No. You're incorrectly assuming that a cache line has to be written from the "CPU" into main memory immediately when a cache line has to be cast out. This is *NOT* the case. The write pipe isn't direct. There is a store queue that is around the read pipe. The store happens after the read in all cases except where the store queue is already full. That requires the pathological case where there aren't any memory slots to do the writes.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, Jan 31, 2009 11:11 AM

Pentium 3.

Pentium Pro only went up to 200MHz, Pentium 2 up to 450MHz. Pentium 3 came out in 1999, at 450/500MHz.

At the other end of the scale, my P3/800 used PC-133, and there were P3s up to 1100MHz with a 100MHz FSB and 1400MHz with a 133MHz FSB. That's the kind of system where 300 clocks is feasible for a code cache miss.

After that, DDR appeared, and memory finally started to catch up with the CPU. But prior to that, you had

Possible in any cache scheme. If instruction x+1 is a branch target, it can be both more recently used and more used recently than instruction x.

That depends upon the type of code you're writing. Obviously, branches which are "exactly" back-to-back are rare, but test,branch,test,branch isn't that uncommon; an extreme case is code which embodies a domain of knowledge, classifying its input then applying the corresponding rules (IOW, something akin to a Lisp "cond" statement, except that you would normally try to use a hierarchical decision tree rather than performing the tests sequentially).

That's fine if you have a handful of common primitives and the rest are rare, but glancing over Python's primitives, I'd say that fully half of them are common. The kind of code which would only use a handful of primitives is the kind of code which you would write in C.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Feb 1, 2009 3:30 AM

By the way, why did you snip away my references?

This will set up some time line referents to work with:

formatting link

Taking 1999 as a useful base year lets look at processors:

formatting link

Just the same, even with your unsupported values:

Lets see, even about 10 to 1 clock speed difference cannot translate into over 100 to 1 time cost.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Feb 1, 2009 10:01 AM

Because they weren't necessary to comprehend my reply. I'm used to fora where quoting entire messages is frowned upon.

Clock speed alone tells you nothing. How many clocks is the worst-case latency, assuming an existing burst is in progress on a different row?

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Feb 1, 2009 7:42 PM

I snip from time to time.

You are taking an interesting position, because they completely undercut your premises.

If you want to include clocks to complete things the typically higher CPU clocks per instruction (about 7 to 12 for 90% of instruction stream on X86, ignoring pipelining) compared to clocks per memory access (typically 3 to 5 without burst, 5 to 11 with burst in progress) still comes out against you.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, Feb 1, 2009 10:44 PM

If so, worst case would be 11 memory clocks, with 10 CPU clocks per memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU cycle =

330 cycles.

Or is there some reason why that cannot happen? Remember, we're talking worst case, not average case (average case is a cache hit). And worst-case isn't always some obscure theoretical concept. It's not hard to write code which is memory-bound (so there will usually be a burst in progress) and has poor cache coherence (so cache misses are common), and an instruction fetch will typically be for a different row than a data fetch.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, Feb 2, 2009 6:20 AM

Where in Finnegans fictional fantasies did you get this weird arithmetic? Where did the 3 instructions come from? Where do you get

10 CPU clocks per memory clock? Neither one is supported by the facts. Look again at the references:

This will set up some time line referents to work with:

formatting link

Taking 1999 as a useful base year lets look at processors:

formatting link

The clock ratios you claim just are not there.

While it is possible to write pathological code in assembler, higher level languages will generally prevent it. It may be possible to brute force "C" in this way, but it will readily recognizable as pathological.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Feb 3, 2009 6:21 AM

Okay, so we're arguing over terminology again. Intel considers pipelining to be what the original Pentium had, calling PPro upwards superscaler. Even without multiple ALUs, PPro upwards can execute multiple load/store operations concurrently, alongside one integer and one FP operation.

I did:

DDR 333 may have existed at this point, but so did CPUs with 100/133 FSB.

Oh; so it's the programmers' fault for not writing megabytes of code in hand-tuned assembler?

Real-world code doesn't look anything like the Fortran or Pascal examples you may have learned in college, or the kind of code you would write for a microcontroller.

So you can dismiss that too (along with anything else which contradicts your claims) as a "pathological" case.

But there's no point in citing specific examples. Just download any substantial software package for which source code is available (especially anything written in C++).

If you're programming x86 (i.e. PCs/servers), software where 99% of the CPU cycles are spent in a few KiB of code is the exception rather than the rule.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Feb 4, 2009 3:03 AM

Off the point quite a bit.

Many in this ng write mucho real world assembler for microcontrollers for real world products and create new prosperity by doing so. How could it not be and "look like" real world code?

Wild, while i know of servers that do not spend all their time in the "idle loop", my workplace desktop and my home desktops spend about 99% of their time waiting for something to do. I think i would like to have some large enough SPICE circuits to take enough time to be worth attempting to profile usage while they were running.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Feb 4, 2009 10:45 AM

I should have said "real world PC code", as it's specifically the x86 which started this sub-thread.

The type of code you would write for a microcontroller wouldn't have cache coherence issues if it was run on a PC; even a Celeron's cache is larger than the entire combined RAM + flash of many microcontrollers.

And microcontrollers typically don't run at speeds where everything has to be engineered around memory latency.

But that's not the kind of software Intel/AMD CPUs are designed for. And game code is hardly "inefficient" by PC standards; it's one of the few areas where performance is actually considered important (contrast with e.g. Windows itself, or MS-Office, or similar "bloatware").

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Feb 5, 2009 3:10 AM

Fine. I will go my own way now.