arm9 memory throughput

N

Nils 17 years ago

Hi folks.

I'm working on an ARM9 system (the DaVinci). Yesterday a workmate and I made some tests to measure the memory speed of the system and we wonder a bit about the numbers we get.

In short the system during our tests looked like this:

RAM is DDR2 at roughly 160Mhz
The CPU is an ARM9 and runs at roughly 300Mhz
We disabled all components that may access the RAM (e.g. video out, DSP ect).

In our tests we cleared a 3Mb chunk of memory using 32 bit writes and measured a troughput of 200 Mb/s. We tried everything possible to improve the speed, e.g. unroll the loop, use store-multiple instructions ect. We always get the 200Mb/s. We can even put up to four nops between the writes and the numbers don't change.

Doing the same using DMA I get numbers around 1.3Gb/s on the same system.

I know that I never get the full theoretic memory throughput but 200mb/s is a lot less than we have expected. Now I want to understand why this happends. Unfortunately I know s**t about memory interfaces, memory latencies and all the other stuff.

Could somone please explain me what the memory and CPU does between the writes?

Thanks Nils

Vote

W

Wilco Dijkstra 17 years ago

some tests to measure the memory speed of

a troughput of 200 Mb/s. We tried

store-multiple instructions ect. We always get the

don't change.

lot less than we have expected. Now I

interfaces, memory latencies and all the

The ARM9 has a writebuffer which you need to enable by marking the memory as bufferable/cacheable. After that the writes go straight into the DDR2 command queue. It won't do merging of writes, and I don't think the DDR2 controller will do so either, so you need to use STM with an even number of registers to get the maximum bandwidth.

You could try loading each cacheline before overwriting it, if everything is configured correctly you should get a similar result as the DMA.

Wilco

Vote

C

CBFalconer 17 years ago

... snip ...

Let's assume a simple testing mechanism. The actual assembly code will be something like:

call recordtime mov r1, #I; number of tests to apply mov r2, A1; starting address to use mov r3, #0; initialize counter ; start of loop lp: mov f4, r2+r3; where to write mov (r4), #0; what we are measuring!!! inc r3 cmp r3, r1 jnz lp; do it again ; end of loop call recordtime call computeanddisplay

Now look at the work done within the loop compared to the writes. Each instruction requires a memory read just to access it. There are 5 of these. At best the COU requires no time to execute things, in which case there is already a 6 : 1 reduction in writing speed from memory access speed.

Smart use of caches etc. can improve this ratio. It will never become 1. And any such improvement costs money.

[mail]: Chuck F (cbfalconer at maineline dot net) [page]: Try the download section.

Vote

J

John Devereux 17 years ago

That ignores the store-multiple instruction and, probably, the fact that the ARM9 has an instruction cache and harvard architecture (internally). So it should be entirely possible to saturate the memory bus with writes.

John Devereux

Vote

D

Didi 17 years ago

Don't know ARM, but on some PPC implementations one can be taken by surprise at the beginning - if the write is a cache miss, the processor will read the entire cacheline and then write to it. To avoid reading first all memory one wants just to write to there is a specific opcode (dcbz) to set the entire cacheline to 0 and validate it so it will take the writes.

Didi

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Original message:

formatting link

Vote

N

Nils 17 years ago

Hi folks.

First off sorry for the long delay. Thanks for all your answers!

Anyways, I made some more tests and got some more facts..

I did my tests with cached memory. The original tests run on montavista linux and I doubt the guys who did the port were bold enough not to enable the caches. Also it seems that the code-cache is working quite nice.

I double checked on the "automotive market" system running QNX and I get comparable performance figures..

After doing some additional tests (e.g. using the store multiple instruction) I got a 30% speed up. It was still far below the theoretical throughput. And now the fun starts: As soon as I disabled the data-cache (only possible on QNX - linux does not let me allocate such memory) I got the same speed as the DMA.

Since it's a TI chip my workmate and me contacted our TI guys to find out what's going on. I see the same effect if I access the memory from the DSP (It's a DSP/ARM system). On the DSP the effect is much more dramatic though (L2-cache having 128 bytes cache-line length.. And it's write allocate, so you get a massive stall once you write to a location that's not cached).

My current guess is that the memory interface is far from clever and does some incredible stupid things. I've found out that - even if I get just a couple of megabytes per second by memsetting stuff - I sustain the whole bandwidth of the sytem and force all other cores/peripherals onto their knees.

If I get news I'll let you know..

Btw - I'm anything but an expert in this area but why do they produce CPU's with several hundrets of megaherz and put less cache on the die than a 486 running at a fraction of the frequency had? All ARM/MIPS embedded devices I've worked with so far wasted most of their processing power potential by whimpy caches and long memory-stalls.

Vote

arm9 memory throughput

Join the Discussion

Didn't find your answer?