Coldfire MCF5475 performance question

- D
- David Hearn
  
  Contact options for registered users
posted
17 years ago

Wed, Jun 7, 2006 9:13 AM

We're trying to use a MCF5475 for some high speed data logging, so we tried doing a bit of a benchmark using a simple app and an oscilloscope and we're getting performance signficantly below that which we were expecting for a 266MHz/410MIPs processor.

As a basic test I wrote a simple application which basically was just a simple loop (important bits detailed below):

typedef struct { uint32 value; unsigned char status; }test_struct;

spy_struct source; spy_struct destination;

while (1) { for (temp_loop = 0; temp_loop < 1000; temp_loop++) { memcpy(&destination, &source, sizeof(test_struct)); memcpy(&source, &source, sizeof(test_struct)); }

// Set output to match (high) MCF_GPIO_PODR_DSPI |= MCF_GPIO_PODR_DSPI_PODR_DSPI2;

// Set output to match (low) MCF_GPIO_PODR_DSPI &= ~MCF_GPIO_PODR_DSPI_PODR_DSPI2; }

Basically we're looping 1000 times, each time copying about 5 bytes of memory using memcpy (provided from Freescale sample code). At the end of that loop we set some GPIO pins high and then low again and repeat the loop. We then use the oscilloscope to measure the time it takes between each GPIO toggle.

We're seeing that it's taking:

a.) 1.5ms to do the whole process if we don't have any memcpy in the loop (just an empty for loop). b.) 15ms to do the whole process if we have 1 memcpy in there c.) 28ms to do the whole process with 2 memcpys in there.

Using a debugger, it appears that one cycle of the loop with a single memcpy takes about 60 instructions.

The difference between the empty loop and the 1 memcpy loop is about

13.5ms (for 1000 iterations). So that's 13.5us for 60 instructions which works out to be 4,440,000 instructions per second - 4MIPs.

Any idea of the factor of 100 difference between the value in the specs (410MIPs) and our example. I realise that each benchmark is different - but an order of 100?

Thanks

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 9:18 AM

I think you should first forget about the memcpy calls, and ask why a thousand empty loops should take 1.5 ms. Look at the assembly code generated there, and try to figure out what is happening.

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 9:32 AM

The empty loop's instructions are:

MOVEQ #63,D0 CMP.L (FFCE,A6),D0 BLT.B 0C007C (this is the exit jump) MOVEQ #01,D1 ADD.L D1(FFCE,A6) BRA.B 0C006C (this is the first instruction above - ie. jump back to beginning of loop).

This takes 1.5ms for 1000 iterations of these 6 instructions.

D

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 9:50 AM

I can't see how that could give 1000 iterations, unless the local variable is initialised with -937 instead of 0. It's also very poor code - are you compiling it with all optimisation off? I find it is normally much easier to see what is happening at the assembly level with basic optimisation enabled.

Are you running this from external memory with all caching and the like disabled?

Have you checked your clock, to see if you are running at 266 MHz ?

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 9:51 AM

Have you enabled the cache?

--

John Devereux

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 2:14 PM

Optimisation is off at present as recommended by P&E for debugging. Also wasn't sure if it would optimise out the loop entirely as it had no instructions within it.

The actual code which generated this empty loop was:

for (temp_loop = 0; temp_loop < 1000; temp_loop++) { }

Running from SDRAM.

As for cache - didn't realise I had to turn it on! Looked at some startup code and found assignment to CACR (cache control register) invalidating the data, branch and instruction caches and not turning them on. I've since set the 3 bits in this register to turn each of the caches on, and the empty loop now takes 106us, a factor of 14 improvement.

Thanks for the advice on that!

This is something I thought of, but wasn't sure where to check this - it's a standard evaluation board with little/no configuration available, and this eval board only had 1 model, so no chance of mistake over purchase.

I'll now go back and look at the other suggestions and see whether there's any more tweaking I can do - but for now, it appears that at least for an empty loop, performance has increased.

Thanks again

D

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 2:15 PM

I wasn't aware I needed to! Having looked into it, the sample code I was using only set the invalidate flag on the data, branch and instruction caches. I've now also enabled these caches and the time for the empty loop has gone from 1.5ms for 1000 iterations, down to 106us for 1000, a factor of 14!

Thanks for the advice!

D

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jun 7, 2006 2:28 PM

Too much optimisation can make debugging hard - it becomes hard to follow what's happening as the compile re-arranges everything. But too little optimisation can also make it difficult, since the compiler puts data on the stack and uses unnecessarily long, slow code sequences. It depends on your compiler and debugger, but I find (with gcc and gdb) that -O gives a reasonable compromise.

You have to watch out for a few things, however - an empty loop like this can be removed entirely. The correct way to deal with this is to declare "temp_loop" to be volatile (a slightly different alternative would be to add an assembly "nop" inside the loop).

The easiest way to check your clock rate is if you have a decent scope, look at the clock output pin (driving the clock to the sdram, for example). You'll have to check what the bus division ratio is for your chip - it's likely to be divide by 3.

The clock rate for most micros (with configurable clocks) is very conservative to start with - the 150 MHz MCF5234 I'm using at the moment comes out of reset at 37.5 MHz (using a 25 MHz reference).

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 6:14 AM

This can't be the code. Please check your output again or for try to post all the relevant stuff. BTW: An asm(" nop;nop;"); before and after the code make it easier to find it.

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @monlynx.de instead !

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 6:15 AM

But still too much. It should be around 22us@266MHz.

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @monlynx.de instead !

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 6:26 AM

I doubt the 410MIPs Freescale advertises (1.5 instructions per cycle!) and more I doubt that you can achieve such (if ever) with memcpy().

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @monlynx.de instead !

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 7:05 AM

Why is that hard to believe? After all, the figure of 410 is for peak Dhrystone mips, which is not at all the same thing as claiming the processor can execute 410M random instructions per second. Since Dhrystones are based on a particular ISA (was it a VAX?), a processor with a more powerful or more efficient instruction set is going to get better scores per MHz. Also, the Coldfire v4 does its branch prediction in the instruction prefetch-decode pipeline, so correctly predicted branches take 0 cycles, and can do limited super-scaling (register moves, amongst other simple instructions, are done in parallel). So for a small loop where everything is in the caches and branch target buffers, you'll get more than one real instruction per clock cycle.

Of course, for a memcpy, you're going to be constrained by the memory bandwidth more than anything else.

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 8:07 AM

*Arg*, overlooked the "Dhrystone" :(

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @monlynx.de instead !

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jun 8, 2006 8:30 AM

Well, in the debugger I was stepping through from the start of the 'for' loop in C, and then stepped into the assembly. Those 6 instructions were repeated, jumping back to the first one each time.

D