Benchmarking a toy example on SH-4

- N
- Noob
  
  Contact options for registered users
posted
14 years ago

Wed, Apr 14, 2010 4:07 PM

Hello everyone,

I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.

I've written a small piece of assembly to make sure I understand what is going on in the trivial case.

The code:

r4 = loop iteration count r5 = address of the time-stamp counter (1037109 Hz)

.text .little .global _noploop .align 5 _noploop: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop .L1: dt r4 nop nop nop nop nop nop bf .L1 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/

1) the loop kernel consists of 8 instructions 2) nop can execute in parallel with nop, with dt, and with bf 3) the dependency dt -> bf does not induce a pipeline stall 4) bf taken induces a one-cycle pipeline stall

Therefore, an iteration of the loop runs in 5 cycles. Is this correct, so far?

I called noploop with an iteration count of 10^9. It runs in 19623860 ticks = 18.92 seconds

5e9 cycles in 18.92 seconds = 264.2 MHz (close enough to 266.67 MHz)

Can I safely conclude that this CPU does, indeed, run at the advertised frequency?

The system supports DDR1 SDRAM.

Given that the CPU is running very close to peak performance in my toy example, can I conclude that the instruction cache is active? Or is it possible to reach this performance level running straight from RAM?

AFAIU, our system comes with DDR-200. I would have expected DDR-266, wouldn't that make more sense?

Thanks for reading this far :-)

Regards.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 14, 2010 7:11 PM

Seems very reasonable, quite similar to the original Pentium pipeline.

Or at least very close to it, your crystal might be slightly off spec.

It depends: Does that cpu have any kind of prefetch buffer where small loops can run out of the buffer, like some mainframes used to have?

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 9:20 AM

Yes, the pairing rules did bring back Pentium memories.

NB: one cannot pair two arithmetic instructions. I find this a severe limitation.

One percent seems like a large offset, wouldn't you say? (However, the exact frequency of the platform's TSC might not be very important.)

How is a prefetch buffer different from an Icache? They sound conceptually similar.

The documentation explicitly mentions software prefetching for data, and seems to allude to hardware prefetching for instructions, e.g.

"If code is located in the final bytes of a memory area, as defined above, instruction prefetching may initiate a bus access for an address outside the memory area."

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 10:04 AM

I then set out to prove that the memory manager returns non-cached memory.

I wrote a trivial load loop.

r4 = loop iteration count r5 = address of the time-stamp counter (1037109 Hz) r6 = address of one word

_loadloop: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop .L2: dt r4 mov.l @r6,r1 mov.l @r6,r1 mov.l @r6,r1 mov.l @r6,r1 mov.l @r6,r1 mov.l @r6,r1 bf .L2 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/

A load can be paired with dt and with bf, but not with another load. Thus, when r6 points to cached memory, I expect 7 cycles per iteration. If I allocate the word on the stack, on via malloc, all is well.

1e9 iteration in 7e9 cycles => OK

If I allocate the word via the "AVMEM memory manager", not so well.

1e9 iteration in 31.6e9 cycles.

I think it is safe to conclude that the latter memory is not cached, right?

TODO: look at store performance, then write my own memcpy.

Regards.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 11:46 AM

I've never seen a crystal which is spot on, i.e. my current laptop has a

2.2 GHz cpu "Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz" while the speed I measure by comparing RDTSC with os clock time, with ntpd running to tune the system time, is 2.195GHz

They are, but a prefetch buffer, like on the 8088->584 cpus did not snoop any bus activity, so selfmodifying code would not be picked up for instructions already prefetched.

This makes such a buffer simpler than a real cache.

This is the normal pipeline, it always tries to read the next few instruction bytes, even if the last instruction was a branch and memory ends just past that branch.

Anyway, with your AVMEM uncached memory regions you would get a huge speedup by moving as much of the processing as possible into normal ram and only move the final results into frame buffer space.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 12:03 PM

.global _storeloop .align 5 _storeloop: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop .L3: dt r4 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 bf .L3 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/

SUMMARY storeloop takes 7.0 cycles/iteration when STORING a cached word

37.4 cycles/iteration when STORING a non-cached word

I didn't expect non-cached stores to be 20% slower than non-cached loads, while cached stores and cached loads run at the same speed. Why might explain that?

Regards.

- N
- Niels Jørgen Kruse
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 12:44 PM

The reads benefit from DRAM page hits.

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 1:07 PM

Next, a pointer chasing loop.

_chaseptr: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop .L4: dt r4 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 bf .L4 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/

chaseptr takes

12.0 cycles/iteration when working with cached memory. 31.6 cycles/iteration when working with non-cached memory.

6 loads per iteration; 2-cycle latency on a cache hit, thus

12 cycles per iteration. (dt and bf basically come "for free".)

chaseptr is (marginally) faster than loadloop (31.56 vs 31.58) when working with non-cached memory, which is slightly counter-intuitive. (The difference might be insignificant, but it is systematic.)

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 2:57 PM

Epic fail. I got the numbers for non-cached loads wrong by an order of magnitude.

loadloop takes 7 cycles/iteration when LOADING a cached word

316 cycles/iteration when LOADING a non-cached word ^^^

storeloop takes 7.0 cycles/iteration when STORING a cached word

37.4 cycles/iteration when STORING a non-cached word

chaseptr takes 12 cycles/iteration when working with cached memory

316 cycles/iteration when working with non-cached memory ^^^

I'm now trying to understand why reading from non-cached memory is so much slower than writing.

Is the CPU optimizing some (most) of my writes away because I keep writing to the same address?

In my test, cache read bandwidth is 906 MB/s, while non-cached read bandwidth is 20 MB/s.

20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree? Perhaps DRAM is not optimized for my artificial access pattern? (Always hitting the same word.)

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 15, 2010 3:48 PM

There is no difference between writing to contiguous words, and writing to the same word, over and over again.

arrstore 7.0 cycles/iteration when STORING to cached memory

37.4cycles/iteration when STORING to non-cached memory

_arrstore: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop .L5: dt r4 mov.l r0,@( 0,r6) mov.l r0,@( 4,r6) mov.l r0,@( 8,r6) mov.l r0,@(12,r6) mov.l r0,@(16,r6) mov.l r0,@(20,r6) bf .L5 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/

I am perplexed.

- A
- Andy "Krazy" Glew
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 3:51 AM

That's possible. But not so likely, since most systems keep uncached accesses separate, do not combine them, because the uncached memory accesses may be to memory mapped I/O devices that have side effects.

(I have designed systems that have two different types of uncached memory, a UC-MMIO type that permits no optimizations, and a UC-Ordinary type that ermits optimizations. But I am not aware of anyone shipping such a system. HPC guys often ask for it.)

More likely, every time you do an uncached read it looks something like this

Processor sends out address. Wait many cycles while address percolates through processor, across bus, to DRAM Wait a few cycles while DRAM responds. Wait many cycles while data percolates back Wait a few cycles while processor handles data Start next load

Whereas with stores, it is Processor sends out address and data. Store is buffered or pipelined Followup store follows close behind.

Oh, and the long latency of the uncached load may be long enough that the DRAM controller closes the active page, whereas the back to back stores probably score page hits.

You may be able to design microbenchmarks to distinguish store pipelining from store buffering. E.g. if you have a store buffer of 8-10 entries, you might see what you observe.

DRAM is *NOT* optimized for uncached accesses.

With modern DRAM the only way you can approach peak bandwidth is to use burst accesses - typically cache line fills, but also possibly reads of 512b/64B vectors, etc., load-multiple-register instructions.

I.e. you must either het a burst transfer implicitly, via a cache line or prefetch, or explicitly, by instructions that load more than 4 bytes at a time.

- B
- Brett Davis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 6:22 AM

The CPU halts and waits for the read, the write goes to the memory controller to handle, and the CPU goes its merry way. Until the memory controller fills its buffer and stalls, forcing the CPU to halt and wait.

Uncached memory generally means memory mapped serial port registers, etc. You do not want your CPU optimizing away those writes. There are several types of uncached memory generally, with different rules and different performance. From hard volatile for hardware registers, to write only display lists you want the CPU to optimize memory writes to.

You didnt tell us which type(s) you were using or given to use. ;)

Brett

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 8:43 AM

What you're seeing is simply that a) DRAM is really slow these days, when using it as a Random access memory, it is really a paging device to get blocks of data in/out of cache.

b) As Andy wrote, write buffers can hide significant parts of the overhead, while an uncached load on a non-OoO has to wait until everyting arrives.

The conclusion is simply that frame buffers like yours cannot ever be read from, only written to, and if you need to do _any_ kind of processing at all, it will be faster to double buffer, i.e. keep the working frame buffer in normal cacheable ram, and only copy the finished screen image to the hardware frame buffer when everything is done.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 3:41 PM

(Going off on a tangent)

I thought there were many "types" of memory accesses?

For example, the AMD64 architecture defines the following "memory types" with different properties.

Uncacheable (UC) Cache Disable (CD) Write-Combining (WC) Write-Protect (WP) Writethrough (WT) Writeback (WB)

OK.

Lemme see what the documentation says.

"The ST40 cores all have some store buffering at their STBus interface to allow the CPU to continue executing whilst stores are written out in parallel. The degree of buffering varies between core families. In addition, the ST40 bus interface and/or the STBus interconnect may introduce re-ordering of stores or merging of multiple stores to the same quadword into one (so-called write-combining)."

Apparently, this platform provides store queues.

"The SQs are a pair of software-controlled write buffers. Software can load each buffer with

32 bytes of data, and then initiate a burst write of the buffer to memory. The CPU can continue to store data into one buffer whilst the other is being written out to memory, allowing efficient back-to-back operation for large data transfers."

I'd like to have a way to burst reads, rather than writes, since that is the bottle-neck in my situation.

What is the "load" equivalent of a store queue? :-)

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 4:05 PM

OK.

The documentation states:

"Explicit control of re-ordering and combining for writes to the STBus: None. (The ST40 bus interface and the STBus are required to preserve all critical write-order properties.)"

- T
- Tim McCaffrey
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Apr 16, 2010 6:34 PM

Cache line reads.

Seriously, if you have a DMA or data mover device you can sometimes offload the copy on to that, and it can move the data faster because it has been optimized for such things (well, if it was done right).

Have you tried 8 back-to-back loads? With the Pentium III, the fastest was to copy uncached memory was to do the copy in a burst from uncached to cached, and then cached to uncached (the P4 fixed this with more intelligent/aggressive prefetching).

- Tim

- A
- Andy "Krazy" Glew
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Apr 17, 2010 3:06 AM

Amusingly, I defined those types for Intel P6.

UC WP WT WB already existed, but outside the CPU. I invented the MTRRs (not one of my favorite things) to hold the memory types internal.

I invented the WC memory type. Along with a number of memory types that got cut. Included UC-MMIO and UC-MEMORY. Also, RC, FB, ....

Hmm... I have not seen the CD memory type before. Looks like they added one when I wasn't looking. Maybe it is UC-MEMORY, and the old UC is UC-MMIO? I can only hope so.

(I actually wanted the memory type to be a bitmask, with features like speculative loads allowed burst cache in L1, L2, .... writeback/writethrough etc. Validation hated that idea.)

They don't say anything about write combining stores to the same address, but I think they are. Test with sequential stores vs random stores.

If sequential stores are slower, then ...

a) a load into a single cache line sized register - like a LRB 512b / 64B register

b) a load into multiple registers - there is often a "load multiple register" command

c) sometimes a special PREFETCH instruction that loads into a buffer, that you thn read out of 32 or 64b at a time

d) Intel just added a godawful SSE streaming load, that does much of the above.

e) sometimes you have a DMA engine that can do a burst read from UC memory, and write to cacheable memory.

I prefer explicit software - a) or b)

Tell us if SH has any of the above.

(I think I should add that to the comp-arch.nrt FAQ)

- A
- Andy "Krazy" Glew
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Apr 18, 2010 3:58 AM

I found the CD "memory type" in the AMD manual,

formatting link

excerpted at botton of this post.

I quote "memory type", because it is not really a memory type that can be stored in the MTRRs or PAT. It arises from the CR0.CD=1 control register bit setting.

AMD says that CD memory may have been cached due to earlier cacheable access, or due to virtual address aliasing.

If you think about it, however, that may also arise with regular UC memory.

*May*. Should not, if the OS has managed the caches correctly; but may, because bugs hapen.

So basically CD is UC, that snoops the cache. From which I assume that UC does not snoop the cache on at last some AMD systems.

Does UC snoop the cache on Intel systems? Validation people would definitely prefer that it did not. Snooping would waste power and potentially hurt performance. But it might maintain correctness. The downsides might be mitigated by directories.

My druthers would be to have a separate "snoop" bit. You might create an uncached-but-snoopable memory type, to be used for cache management - sometimes a data structure would be accessed cacheably, sometimes not. I would rather have something like a per-memory-access instruction prefix that said "don't cache the next load". Failing that, however, you can set up aliases for virtual memory, and arrange things so that you simply need to OR (ADD) in a particular base address to get the uncached version of an addresses.

My druthers arise from the fact that I am a computer architect who is also a performance programmer. If I am a performance programmer, I want to be able to control cacheability.

However, most computer architects are more sympathetic to validation concerns than they are to performance programmers. Validation wants to eliminate cases, even if it makes the system less regular. (I call this introduction of irregularity in the usually forlorn hope of reducing validation complexity "Pulling a Nabeel" after a validator who practiced it. IMHO it is better to learn about proper experimental design, e.g. Latin Squares, as a way of reducing validation complexity by significantly higher degrees.) Such not-very-sophistiated validation-driven computer architecture teds to want to say "The OS shall not allow aliasing of memory types, whether temporal or spatial." I.e. a given physical memory address should not be in the cache as a result of an earlier cacheable access when it is now globally uncacheable (temporal aliasing). Similarly for virtual address aliasing.

I think this is shortsighted.

a) because performance programmers really do want to be able to practice aliasing

b) because bugs in OSes happen - aliasing happens.

It is especially shortsighted if validation, or, worse, the cache designer takes advantage of this decreed but not enforced prohibition of aliasing, and does something damaging. Like, causing data to become incoherent in weird ways. Not so bad if they do something like causng a machine check if aliasing is detected, e.g. if a UC access hits in a cache. Mainly because you will quickly learn how common such issues are. But silently corrupting the system - not so good.

Glew's morals:

a) Aliasing of memory types happens, both temporal and spatial. Live with it. Better yet, take advantage of it.

b) Orthogonality is good. Consider a separate snoop bit.

But all this is water under the bridge.

--
Unfortunately, AMD's CD memory type does not seem to be a UC-MEMORY vs. UC-MMIO
type.

Actually, WC is in many ways a UC-MEMORY type.  Although it goes further than
UC-MEMORY, allowng stoes to be out of order.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 22, 2010 9:16 AM

AFAIU, my system works along the lines of your latter description.

There's a 29-bit physical address space, and the "top" 3 bits in a virtual address define an "address region" (P0-P4).

if b31 == 0 then region P0 else region P(b30*2+b29+1)

Mask to 29-bit

The physical address is given by taking the virtual address and replacing bits [31:29] by 3 zero bits. This gives a physical address in the range 0x0000 0000 to 0x1FFF FFFF. Only physical addresses in the range 0x0000 0000 to 0x1BFF FFFF may be safely accessed through a virtual address that is handled by masking to 29 bits. If masking gives a physical address in the range 0x1C00 0000 to 0x1FFF FFFF, the behavior of the ST40 is undefined; the sole exception is for accesses to the operand cache RAM mode area 0x7C00 0000 to 0x7FFF FFFF when CCR.ORA=1.

The physical address range 0x1C00 0000 to 0x1FFF FFFF must only be accessed either: o through a P4 virtual address or o for the range 0x1D00 0000 to 0x1FFF FFFF, by setting MMUCR.AT=1 and using an address translation in the UTLB

Regards.