MPC8641D (dual e600 core) memory latency?

Does anyone know what the SDRAM memory latency is going to be for this new embedded controller chip? Latencies on past PPCs were abysmal (~280 ns for MPC7447 1.4 GHz with Marvell Discovery memory controller). I'm hoping the MPC8641D does better with its on-chip memory controller. It would be nice if the latency approached the SDRAM tRC, like modern x86s.

--
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
Reply to
Joseph H Allen
Loading thread data ...

This includes a TLB refill, surely. Strided memory latency (ie. mostly without TLB refill) was ~105ns on the last G4 PowerMacs.

If the MPC8641D still isn't shipping in volume, it is no wonder that FreeScale ditched the Crolles2 alliance for IBMs gang.

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark
Reply to
Niels Jørgen Kruse

Actually I forgot all about the TLB for my tests. The MMU is going to be off for my actual use case (an embedded system), but now that I think about it I'm amazed at how fast the TLB refill is for all of the other processors. Even so, I get ~56 ns on my 2 GHz Pention-M laptop under Cygwin and ~280 ns on a 1.4 GHz mac mini under Linux. For L2 bound, the G4 is not so bad (~6 ns for L2 hit on G4 vs. ~4 ns for Pentium-M).

What I really want to know is how long it takes new I/O written to main memory to get to the CPU. So the performance of the I/O bus obviously is going to matter, but also the coherency protocol. The simple exceed the L2 cache size latency measurement at least gives me an idea of this.

I'm inexperienced with this type of measurement, so I'm open to any hints. Basically I'm doing 200M byte reads from random memory locations, and adjusting the address window (number of 1s in an AND mask) to see the effects on total run time. I don't want to test the CPU itself or random(), so the random number generator is just z = (z>>28) + (z

Reply to
Joseph H Allen

What benchmark are you using?

With lmbench-2.0.4 "lat_mem_rd 32 512", I get (for 30MB working set size):

ns System 54 Athlon 64 3200+ (Socket 754, 2GHz), PC2700 ECC RAM 1 DIMM 64 Athlon 64 X2 4400+ (Socket 939 2GHz), PC3200? ECC RAM 2DIMMs per channel 73 Dual Opteron 270 (Dual-core 2GHz), PC2700?R ECC RAM 2 DIMMs per channel

112 Dual Xeon 5160 (Dual-core), 5000P MCH, PC5300F ECC RAM 3 DIMMs per channel 119 2.26GHz Pentium 4, i845E memory controller, PC2100 RAM 2 DIMMs 142 iBook G4 1066MHz soldered-in memory

The iBook should be very similar to your MacMini. So the latency is not great, but it's not as bad in lmbench as your numbers indicate. Also, your Pentium-M numbers look unreasonably good. Maybe a stride predictor helped?

Try lmbench, especially newer versions (which are less prone to fall victim to a stride predictor). Another benchmark is , but that version produces many TLB misses IIRC (so you get worse number than from lat_mem_rd).

Followups to comp.arch.

- anton

--
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Reply to
Anton Ertl

As you describe it, there is no dependency between loads, so this is a bandwidth test rather than a latency test. You need to set up a pointer chain and chase along it to measure latency.

Because of the hashed page tables and the smaller L2, the 7447 will have to go to main memory for TLB refill much more than the Pentium, preventing overlap between loads. TLB refill blocks further loads from starting memory access on the 7447. You won't have this problem with MMU off (whatever you mean by that).

--
Mvh./Regards,    Niels Jørgen Kruse,    Vanløse, Denmark
Reply to
Niels Jørgen Kruse

You're right, so I tried pointer chasing:

248 ns for 1.47 GHz MPC7447 (a mac mini). 200 ns for 1.266 GHz P-III (some VIA chipset) 172 ns for 3.6 GHz Xeon (E7520, PC2700). 133 ns for 2 GHz Pentium-M (855PM MCH).

Mapping is on in all cases, unfortunately.

The test is this: there is a 2M entry linked list, 64B / entry- a 4B pointer at the beginning. Entries are allocated randomly (pick random address, walk forward until a free slot is found) out of 4M entries (total memory range

256 MB, total bytes with data 8 MB- 2M 4B pointers).

Unfortunately, 128 TLB entires X 4K page size is only 512K (on Pentium-M anyway), which fits in the L2 cache. I don't think there's an easy way to discount the TLB miss effects.

--
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
Reply to
Joseph H Allen

Read and the thread containing it for a discussion of how to take out the stride predictor separately from the other effects.

This resulted in bplat, and in and the thread surrounding it you can find a discussion of the differences between various lmbench/lat_mem_rd latencies, and bplat, and some shortcomings of the parameters hardwired (but still relatively easy to change) into the current bplat.

Followups to comp.arch

- anton

--
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Reply to
Anton Ertl

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.