Math computing time statistics for ARM7TDMI and MSP430

Here are some AVR figures , IAR Full opt for speed, run in AVR Studio simulator Obviously, you cant compare exactly without using same source Figures measured in a subroutine after values have been loaded into registers. Only tested one set of data, so I can't say if it is typical or not.

add 173 sub 176 mul 175 div 694 sqrt 2586 log 3255

It looks like the MSP430 and the AVR is about the same speed at the same clock frequency. The IAR sqrt/log libraries seems a little bit on the slow side.

--
Best Regards,
Ulf Samuelsson
ulf@a-t-m-e-l.com
This message is intended to be my own personal view and it
may or may not be shared by my employer Atmel Nordic AB
Reply to
Ulf Samuelsson
Loading thread data ...

thanks, thats some more great data, all this should be put on a web page somewhere, I always thought floating point subroutines were a good test of a processor

Reply to
steve

Or maybe of library writers. Of course a performance test should probably include a correctness test so log doesn't cheat and always return 1.0

Robert

--
Posted via a free Usenet account from http://www.teranews.com
Reply to
Robert Adsett

Back in 2004, I wanted to play with writing a floating point routine for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by

32-bit with 32-bit result floating point divide takes roughly 400-435 cycles on the MSP430. This is substantially less than the 620 cycles mentioned above.

(I think a lot has to do with how much time various compiler folks decide to invest in their libraries. It can be a serious time sink for a compiler vendor to optimize them for a single processor. In my case, I only invested time in writing one routine, the 32fp divide, so it was fun. I didn't have to produce all of the routines with the various combinations of data types which would probably have turned it into 'real work.')

Jon

Reply to
Jonathan Kirwan

Probably because you didn't support all the IEEE 754 exception/rounding modes that compliers support.

Reply to
steve

well if library writers have a tough time writing a fast floating point algorithm for a specific processor, I probably will too!

Reply to
steve

Steve, do you _know for certain_ that the library tested abouve from Imagecraft does support all of them? It's been my own experience that the libraries for floating point don't completely support all types and exceptions. Are you sure this is the case here?

In the example I was testing out, I was examining just one compiler library routine to mimic it's behavior. I think I captured all the elements there, but it's probable that the compiler itself has advanced in the two intervening years and it wasn't Imagecraft's anyway, so your point may remain a good one to keep in mind.

I believe I wouldn't need another 200 cycles, though, to achieve what extra is done in compiler libraries. I'd be very interested in finishing it up, though, so as to exactly match the features of the Imagecraft routine you tested with, if provided with a complete implementation of their 32-bit fp divide for the MSP430 so that I could personally guarantee that I've met the goal. Not that this would prove anything much, except that more time given to informed effort is better than less time. Still, I'd do it for the fun of trying.

Jon

Reply to
Jonathan Kirwan

I was suggesting that library performance may rely as heavily on the writer of the library as it does on the micro. All things being equal it may reveal micro performance. Seldom are all things equal.

Robert

Reply to
Robert Adsett

No I am not certain, imagecraft claims IEEE floating point, which means its should be compatible with IEEE 754 so that it runs identical to IEEE 754 compatible FPU's.

Maybe your MSP430 had the HW multiply?

Reply to
steve

There are myriads of floating point formats in the world, some of which may be easier to implement with integer only hardware. Some that used the IEEE-754 bit layout for sign, exponent and mantissa, might not support unnormalised (extremely small) values, might not handle NaNs etc.

Starting with C99 you had to implement the IEEE/IEC floating point formats to the letter if the compiler defines __STDC_IEC_599__

Paul

Reply to
Paul Keinanen

Well, this is the crux of your earlier point to me, isn't it? Can you find out what exactly it _does_ do? This could completely break your point.

It wouldn't matter if it did. I wrote the assembly code myself and I didn't use HW multiplies to aide the generalized input/output floating point division routine -- and I'm not entirely sure just now how I might. Can you suggest a reason why this question may be germane?

Jon

Reply to
Jonathan Kirwan

Jonanthan, the guy who wrote the code says it takes ~550 cycles on average. Does your stuff do the guard bit etc.? The code is MSP430 specific and not stale code from other CPUs...

Reply to
Richard

Yes. I keep more bits for rounding, if that's what you mean. I suppose I can send the code to you, if you are interested in playing with it -- I've no proprietary interest in it. It handles the standard floating point codes found on the IBM PC. Sign, signed exponent, and hidden bit notation. It does NOT handle denormals (not hard to add) or special codes such as infinities or not-a-numbers. Rounding is only in the usual method; I didn't access a static status bit of any kind to control rounding.

When I last played with this in April 2004, I shaved about 100 cycles off of a compiler's version mostly, I think, because the compiler's library code used a division loop that used a count of 32 when the loop only really needed 24 for a 48/24 divide to produce a 24r24 result (the remainder is used for the rounding.) The central division part of the code, that part that actually takes the unpacked values and divides them, takes from 264 to 312 cycles, with 302 being typical.

Jon

Reply to
Jonathan Kirwan

I should add that I can shave still more time off of this. There are at least two reasons that come to mind:

(1) I had just been playing with a fast division method I've devised, and was mostly playing with that idea back in 2004 without really thinking about how this would play into a floating point routine designed entirely for speed. In my general purpose divide routine, I produce way more bits than I actually need for the result. I produce a perfect remainder, which allows an exact determination of rounding. However, an exact remainder isn't needed as it contains more information than is strictly needed for rounding. If I modified the division routine so as not to have to produce an accurate remainder, the loops can take less cycles -- multiplied by 24, this can account for some real time.

(2) I did NOT use some of my non-restoring methods (what amounts to loop-unrolling of sorts) for division, which would yield about 18% savings on the code I did apply back then. The reason I didn't is that I was focused on just 'getting it right' and not so much on adding in longer stretches of unwound code -- easier to debug that way, if need be.

So even the number I originally posted is not by any stretch the best I can do here, on the MSP430 -- if seriously bent to the task. I'd bet I could get perilously close to 300 cycles for the entire thing, even including denormals. Not that anyone would care that much. But it's probably doable.

Jon

Reply to
Jonathan Kirwan

Ulf, is that figure for div on the IAR compiler right??? On the ARM7/Thumb I've looked at before, there was no direct integer DIV instruction, unlike the case with multiply, so it makes sense to take a while longer than multiply does. But this long?? Since Steve was writing about 32-bit floats, I assume that it what you were doing too, but is it possible that they were promoted to doubles?

On the MSP430, I was stuck doing pairs of registers/instructions to handle shifts (a triplet in one step of the loop) and this has _got_ to be better on a 32-bit register chip. Also, the MSP430 can't get much of anything done in single cycle. Luckily, the core division can be done in registers which are single cycle, but there are conditional jumps in there.

Something sounds wrong to get cycle counts like that. I believe you, it just bugs me. The 32-bit register advantage would have pulled another 48 cycles (24*2) off of the computation in the MSP430, for sure, and in cases where a restore was needed, another 24 cycles -- so the mean (average) would be above 48 and below 72 cycles pulled off -- probably right in the center of that at 60 cycles. That puts it at a projected 340 cycles on the ARM7 right now and I already have ways coded up and fully tested to improve that another 60 cycles, anyway. Which would bring such a thing to the area of 280 cycles on the ARM, as a rough guess -- including overheads.

Jon

P.S. By the way, I just retested a new floating point divide routine on the MSP430, that handles division by zero by returning #INF and deals with #INF on either input parameter, and it runs in an average of 350 cycles right now. This includes the call overhead (5 cycles for the call, 3 cycles for the return.) It correctly rounds without glitch, as it produces a complete fractional remainder that is exactly known.

Reply to
Jonathan Kirwan

Ulf posted AVR 32 bit floating point cycles, AVR has 8 bit registers...

Reply to
steve

Boy! Was my mind out of touch!! I read "AVR" and thought "ARM". Thanks for that, Steve. Sometimes, I need a kick in the head.

Jon

Reply to
Jonathan Kirwan

It is a nice compliment that people find the AVR so fast that their spine believes it is a 32 bitter.

--
Best Regards,
Ulf Samuelsson
ulf@a-t-m-e-l.com
This message is intended to be my own personal view and it
may or may not be shared by my employer Atmel Nordic AB
Reply to
Ulf Samuelsson

...

The figures were for 8-bit AVR, not ARM. Not also that IAR's FP libs are unoptimized C (at least on ARM, I assume the same code is used on AVR). Optimized FP libraries are usually handcrafted assembler.

280??? I do it in about 70. It uses a tiny lookup table to create an 8-bit reciprocal estimate which is used in 3 long division steps. This turned out to be simpler and faster than Newton-Rhapson as it only uses 32-bit multiplies (which are faster than 64-bit muls on most ARMs).

The result is that FP emulation on ARM is extremely fast - in fact hardware FP (eg. ARM11) is only around 5-6 times faster in FP benchmarks eventhough it can issue 1 FP operation per cycle...

Note that integer division and floating point division are completely different things - a standard 3-cycle per bit integer division takes about

120 cycles in the worst case when unrolled (although it takes just 30 on average). When you have to produce a certain minimum number of result bits methods that produce many result bits in a single step become faster. Proving it is correct is a little more involved due to the many approximation steps though :-)

Wilco

Reply to
Wilco Dijkstra

hehe. No, it was entirely my own idiocy. I wouldn't draw too much else from that. ;)

Jon

Reply to
Jonathan Kirwan

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.