Math computing time statistics for ARM7TDMI and MSP430

Hello,

for an estimation of required computing time I would like to roughly know the time that current controllers need for math operations (addition/subtraction, multiplication, division, and also logarithm) in single and/or double precision floating point format (assuming common compilers).

The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or SAM7), and Texas MSP430.

Can anyone provide a link to some statistics?

Thanks, Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh
Loading thread data ...

Might be hard to find...

Look at

formatting link
recently Philips/NXP made some noise about their core being 37 percent to 51 better than other ARM7 cores, because of their wider memory paths.

Generally, the ASM opcodes will give some indication. Some uC lack division, others have it in HW, and that will make a huge difference to that corner of the benchmark.

eg recently we needed extended scaling, and we found the Zilog ZNEO has 64/32=32 divide, and 32*32=64 multiply. To access that, we had to use in-line ASM, but once we did that, the result was maybe 1000x faster than a libary call to shift/subtract SW division in a uC without divide opcodes.

Also likely to be well-maths-resourced are DSP uC like the TMS320F2802

-jg

Reply to
Jim Granville

I've encountered something simular on an arm7tdmi. We needed the

32*32=64 multiply, but could not find a way to let the compiler emit the smlal (IIRC) instruction. So we also ended up doing this in asm. Has anybody found a way to let the compiler do this (ADS or GCC)?
--
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)
Reply to
Stef

Just tried it, and this works for me (gcc 4.1.1):

long long c = (long long) a * (long long) b;

long long d = (long long) c + (long long) a * (long long) b; emits a smlal

Regards,

Dominic

Reply to
Dominic

Just as a general point: If you're considering software DSP applications, unless they're _INHERENTLY_ constrained and will never need to be scalable, ARM is strongly suggested IMHO. MSP430's address space is architecturally limited. Targeting ARM from the get-go will leave the door open for more complex algorithms, larger sample buffers, etc.

Reply to
larwe

Hey, that works, thanks! I tried this small FIR filter example:

long long try_smlal(long *a, long *b) { long long rv = 0; rv += (long long) *(a++) * (long long) *(b++); rv += (long long) *(a++) * (long long) *(b++); rv += (long long) *(a++) * (long long) *(b++); rv += (long long) *(a++) * (long long) *(b++); return rv; }

The result with GCC 3.2.1 is:

020002b8 : 20002b8: e92d4030 stmdb sp!, {r4, r5, lr} 20002bc: e1a03000 mov r3, r0 20002c0: e1a02001 mov r2, r1 20002c4: e493e004 ldr lr, [r3], #4 20002c8: e492c004 ldr ip, [r2], #4 20002cc: e4930004 ldr r0, [r3], #4 20002d0: e4921004 ldr r1, [r2], #4 20002d4: e0c54190 smull r4, r5, r0, r1 20002d8: e1a01005 mov r1, r5 20002dc: e1a00004 mov r0, r4 20002e0: e0e10e9c smlal r0, r1, ip, lr 20002e4: e493e004 ldr lr, [r3], #4 20002e8: e492c004 ldr ip, [r2], #4 20002ec: e0e10e9c smlal r0, r1, ip, lr 20002f0: e593c000 ldr ip, [r3] 20002f4: e5923000 ldr r3, [r2] 20002f8: e0e10c93 smlal r0, r1, r3, ip 20002fc: e8bd8030 ldmia sp!, {r4, r5, pc}

Looks almost optimal, I only don't see why the smull result is placed in r4/r5 and then moved to r0/r1, but on a 20 tap filter it wouldn't really be significant.

Last time I tried is was years ago on an ADS compiler and they I couldn't get it. May have been the wrong casts or just the old compiler.

The other optimization we did with this is to run the function out of the AT91's internal SRAM by first copying it from flash on startup and point a function pointer at the sram. We got the function address OK, but the length was (IIRC) fixed in code. Any tips on copying an entire function during run-time using GCC? Or how to get the length argument in this call:

memcpy(sram_loc, try_smlal, try_smlal_length);

--
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)
Reply to
Stef

larwe schrieb:

Thanks for the note - I already know. However in this application, there is neither much data nor much code. It's just a task that needs some amount of math operations, and I will have to trade power consumption against calculation time... I also tend to using ARM, but I would also like to see some figures.

Thanks, Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh

Jim Granville schrieb:

Thanks for the link - it will probably provide at least some general figures (will have a closer look soon).

I fear that I will need double precision floating point math, for which assembler won't be much better than the RTL of a common compiler, I assume. (I'm well used to programming assembler, so that won't hurt me if it really makes sense.)

Too much power consumption for this application, I think.

Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh

[snip]

Well let me see what I do...

I define this in a header #define IRAM_CODE __attribute__((long_call,section(".icode")))

then

IRAM_CODE void foo(void) { ... }

the linker scrip put the secion .icode into Flash just line initialized data something like this __icode_rom__ = ADDR(.gcc_except_table ) + SIZEOF(.gcc_except_table); .icode : AT(__icode_rom__) { __icode_start__ = . ; *(.icode); *(.idata); . = ALIGN(4); } > iram __data_rom__ = __icode_rom__ + SIZEOF(.icode);

then the crt0.s init code copies it out something like this /* Copy data from ICODE to IRAM */ ldr r2,=__icode_start__ ldr r3,=__icode_rom__ ldr r4,=__data_rom__ b 2f

1: ldmia r3!,{r0,r1} stmia r2!,{r0,r1} 2: cmp r3,r4 blt 1b

Note that this assumes that the code will stay permanently in RAM rather than being overlayed and loaded dynamically. A more dynamic version could be done by having multiple sections then memcpy()ing the one your interested in.

Note 2 that GCC has a problem if you call an IRAM_CODE function from a non IRAM_CODE function *in* *the* *same* *file* (it seems to lose the long_call attrib and uses a relative call that is typically out of range). So the best idea is to put the IRAM_CODE functions in a separate file.

hope that helps. Peter

Reply to
Peter Dickerson

If you are considering the MSP430 based on power consumption, be aware that the ARM parts are not hugely different once the clock rate is taken into account. I don't have good numbers for the MSP430, but they appear to be around 350 uA at 1 MHz. I don't know exactly how that varies with clock rate, but I'll assume the y-intercept is 0 and the slope is linear. The Atmel SAM7S parts are pretty much linear with nearly no offset other than the bias for the internal LDO. The slope is about 650 uA per MHz. So between the MSP430 and the SAM7S it is about a 2 to 1 power difference. I can't say if the processing power of the 32 bit device makes up for any of this or not.

I have several eval board from Atmel and Philips and would like to run some bench marks to see how the power and speed compares. If anyone would like to provide test code, I would be willing to run it in the next few weeks and make the results public.

Reply to
rickman

None of these support floating point in hardware, so it depends on the libraries you use. On ARM there exist highly optimised FP libraries, the one I wrote takes about 25 cycles for fadd/fsub, 40 for fmul and 70 for fdiv. Double precision takes almost twice as long. You would get 500KFlops quite easily on a 50MHz ARM7tdmi. Of course this is highly compiler/library specific, many are much slower than this, 5-6x slower for an unoptimised implementation is fairly typical.

Doing floating point on the MSP, especially double precision, seems like a bad idea...

Wilco

Reply to
Wilco Dijkstra

Sounds like a badly written library. If the instruction was available the library should have used it in the first place. Even so, making the shift&subtract variant more than 10x slower requires you to really work hard to make it as slow as possible...

Later versions of ADS supported inlined S/UMULL, U/SMLAL was added in RVCT IIRC.

Wilco

Reply to
Wilco Dijkstra

Wilco Dijkstra schrieb:

I was (maybe erroneously) assuming that the RTLs of common compiler packages have about equal performance...

Thanks, this is at least a rough figure I can use at first place.

Not all the math needs double precision - and hey, we've done DP floating point math with a Z80 as well. :-) I know that it will me much more work for the MSP than for an ARM. But from the overall application, it seems reasonable to me to also take the MSP into consideration.

Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh

[something far fancier :-)]

Hey, that does all the work, we did it all by hand (memcpy, function pointer..). I have saved your article and will refer to it next time I need something like this, thanks.

-- Stef (remove caps, dashes and .invalid from e-mail address to reply by mail)

Reply to
Stef

rickman schrieb:

Of course this is true. Even when a given set of calculations has to be done, the consumed /energy/ may be fairly the same (ARM faster with more current, MSP with lower current but takes longer) - however it's not /only/ math that has to be done here. The overall current consumption, especially at those times when there's no math to do, is also relevant. To me it seems that these aspects are more easy to take care of when using the MSP, so that's why I am also considering it.

That sounds interesting. But as Wilco mentioned, math performance can be expected to depend on the used libraries - so you'd have to take care about them. Also, at least I can't provide test code yet. For the time being, I will look at the benchmarks that Jim pointed to, and consider the numbers given by Wilco (though I am really interested how long a logarithm takes in a "good" [tm] library... :-) ).

Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh

Cycles

MPS430, 32 bit floats, imagecraft complier, typical cycles add 158 sub 184 mul 332 div 620

ARM, keil complier 32 bit floats, typical cycles add 53 sub 53 mul 48 div 224 sqrt 439 log 435

ARM, GNU complier 32 bit floats, typical cycles add 472 sub 478 mul 439 div 652 sqrt 2387 log 13,523

8051, keil complier, 32 bit floats, typical cycles add 199 sub 201 mul 219 div 895 sqrt 1117 log 2006

max cycles up to 2x typical

Reply to
steve

This must be with an old GCC. In GCC 3.4, the generic floating-point code was rewritten in ARM assembler.

formatting link

Clocks for gcc-3.3.1, clocks for gcc-3.4.3, speedup (32-bit float):

__addsf3: 514 73 7.0x __subsf3: 511 74 6.9x __mulsf3: 428 49 8.7x __divsf3: 634 142 4.5x

Some further speedup should have happened in GCC 4.0.

Karl Olsen

Reply to
Karl Olsen

formatting link

GNUARM is up to 4.1.1.

Reply to
rickman

formatting link

yes, it was version 3.3.1, nice speed update for 3.4!

Reply to
steve

steve schrieb:

[some cycle data]

Thanks very much - that's perfect. (Including thanks to Karl for the update on GCC cycles.)

Tilmann

--
http://www.autometer.de - Elektronik nach Maß.
Reply to
Tilmann Reh

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.