Math computing time statistics for ARM7TDMI and MSP430

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 1:32 AM

Yes, Steve was quick to nail me on this! Good thing, too.

That's good to hear. In fact, it's a bit of information that may help me someday.

When I get an ARM board here, I really must get a chance to play with this myself. I love simple challenges.

Interesting. Thanks.

Like I wouldn't know this... But at the core of the code I wrote is, in fact, an integer divide process. In the case of 32-bit floats, this is a 48 bit / 24 bit integer divide that produces a 24 bit quotient and a 24 bit remainder.

Do you have a code snippet to see?

As always.

Thanks, Jon

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 7:40 AM

That would explain the strange ratio between mul and log.

Both in log and sqrt the exponent and mantissa are usually handled separately (sqrt may need a single bit mantissa shift to make exponent even).

To get single precision results a 3rd or 4th order polynomial is typically sufficient. Since the mantissa is (almost) normalised, the polynomial could be calculated using 32 bit integer (or perhaps even

24 bit) arithmetics. log will require one additional multiplication to convert from log2 to log10. With optimised assembler code, the log/FP-mul ratio should not be that big.

Paul

- P
- Peter Dickerson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 10:35 AM

That's pretty impressive. I would have expected that checking for NaNs, Infs and denormals to take sort of time.

I'm using a 51.6 MHz ARM7TDMI and getting very much slower than that, but I am using Thumb from a very slow external bus. It's not a problem for me for now but its nice to know what I can expect if the math workload becomes an issue.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 6:59 PM

You don't need to do it as a wide integer division. It is basically a fixed point division, so you can use 24 / 24 bit, shifting in zeroes in the numerator (it can never become wider than 24 bits). It also doesn't need any preshifting as both values are normalized (assuming denormals are handled specially).

How many cycles does your version take per bit? If it is a lot then multiplicative methods typically win.

RSBS tmp, den, num, LSR #N ; trial subtract, sets carry if OK SUBCS num, num, den, LSL #N ; do actual subtract if OK ADC div,div,div ; update divident

This does a trial subtract to see whether the denominator * 2^N could fit in the numerator. To avoid overflow you need to shift the numerator right rather than the denominator left. Then conditionally do the subtract and add the carry flag in the division result. The overflow avoidance feature means you don't need to precalculate exactly how many bits need to be done, but make a quick guess instead.

My state of the art implementation checks whether there are 4 or fewer result bits, then jumps to code to do 4 bits, otherwise if 8 or fewer bits, jump to code to do 8 bits etc. This way common divisions are really fast, 8 bits are typically done in around 30 cycles. This kind of optimization means it takes serious hardware effort to beat a good software implementation. The Cortex-M3 hardware divider does 4 bits per cycle just to be sure :-)

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 7:02 PM

It doesn't take much time as it is pretty trivial to check for them on ARM due to the powerful instructions. However in many cases you don't need any checks. I aggressively optimise for the common path (at the cost of making uncommon cases slower) and deal with the special cases only when absolutely required. For example there is no need to deal with NaN/Inf in addition, just do the add as if they didn't exist, and if the result overflowed simply return NaN if the input was a NaN, otherwise Inf. Similarly you don't need any denormal handling.

My numbers assume ARM code running from single cycle memory, so if you're using a 16-bit bus with waitstates things will be slow... Unless you're doing many multicycle operations (multiplies, load multiple) it may be better to run at a lower frequency with no wait states. RISC pretty much assumes the memory system can provide 1 instruction per cycle.

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 7:26 PM

Yes. Which is exactly what I do. But it is exactly the same code as does a 48/24. This is territory I know like the back of my own hand. And my 48/24 code is usually _faster_ than most other peoples' code in doing this task. Period.

I know just what you are saying. But there are such things as denormals and float unpacking. And in addition, there are minor single shifts required to precheck for overflow of the integer divide. I can show you what I mean with explicit code, if you like. I'd _love_ to hear your opinion on the details.

On which CPU? It really does depend.

I was thinking more about the overall code. It's from that I can get context. But I'll pop out some ARM reference and look at this, anyway. I think I'd like to apprehend it well.

Your explanation here is why I'd like to see the larger context. I'm perfectly willing to completely expose my own hand. Are you?

I think this makes some sense. I did NOT add this to my own code, but would in cases where I could demonstrate the value. In cases where I unroll the loop once (for example, using a loop count of 12 instead of

24 in the 48/24 case), this may pose some interesting alternatives to consider. But it's a good point to make.

I'm interested. Let me know if you are willing to expose your code. I will return that offer in kind.

Jon

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 8:21 PM

It should be less code than 48/24 as 24 of those 48 bits are zeroes. There is a trick where you can reuse the divident to hold extra numerator bits and shift them into the numerator at the same time as you do the divident update. You were talking about a 16-bit (or was it 8-bit?) CPU, so I believe the extra 24-bits are not for free. But maybe you could show me what you mean.

If you can, that would be great. Denormals would be treated specially for division. In most embedded code denormals are flushed to zero, but for full IEEE754 you can normalise them and then jump into the main code. That works really well for div and mul, as a result there are far fewer special cases to worry about.

You can get rid of the extra overflow check if you do the check in a similar way as I do.

On the CPU you were talking about, MSP430 I thought.

I don't have the source code - see below. I can explain the tricks but nothing more than that. There isn't any point in recreating the code just for this discussion.

The principle works as well if you use a loop. The published version described below uses an 8-way unrolled loop and obvious avoids doing unnecessary iterations. It also jumps into the middle of the loop to avoid doing extra work.

I'm afraid copyright of the code is with my previous employer. However an older version of my division was published in "ARM System Developer's Guide - Designing and Optimizing System Software". They even did mention me. You can get an evaluation version (RealView or Keil) from

formatting link

These tools include the compiler I worked on for 10 years and all my highly optimised library routines, including memcpy/memmove, strcmp, FP libraries etc.

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 8:59 PM

I will send a version to your email address.

I'd like to see how that works. Perhaps, if you look over my example code you can point out where/how this might be done. I'd be very interested. (No, not because of a professional interest. My last full FP library writing took place 30 years ago and it's not my business. But I do have a hobbyist interest, of course.)

Ah. Execution time of the entire code is 340-350 cycles, or so. The core division routine, the one that does the central work here, takes about 240-250 cycles of that. So the division itself is about 10 MSP430 cycles per bit. I'd very much enjoy seeing a better version to learn from.

Well, then I probably won't spend a lot of time thinking about it right now.

Thanks. I'll see if I can get a look. Is it easy to gain access to? Or ... well, let me comment later when I see what you wrote there.

Thing are as they are. okay.

Could you please simplify how I might directly access the routine? I'm not interested in some long investigation/installation and trapsing around through obscurity just to get the code itself. Can you help make this any simpler? If it is published, would you be willing to just copy it into an email and send it to me?? I'd appreciate that.

Jon

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Nov 22, 2006 11:10 PM

I should point out that this produces both a 24-bit quotient and a

24-bit remainder. Just in case that wasn't clear. I use the remainder to perform full-knowledge rounding as the fraction is exactly known.

Jon

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Nov 23, 2006 12:02 AM

I'll do that. There are various techniques to do this.

I'll have a look at the inner loop and see whether it can be improved. At 10 cycles per bit multiplicative methods should be faster. On ARM I went from 2 cycles per bit for fixed point division to 1 cycle per bit for long division. If you can get a decent 8x24-bit multiply on the MSP it should be possible to get it down to 150 cycles.

I've sent you some disassembly. The source code even if published is still under copyright. Luckily you can disassemble anything you like to learn from it. I've done that a lot with competitor's compilers and libaries when I was doing compiler stuff.

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Nov 23, 2006 4:06 AM

Thanks. It is on its way to you.

Only _some_ of the MSP430's include a multiplier. So you can't count on it. And there is a penalty. You write to locations and the sequence must not be interrupted. There is no way to observe and restore the state of it. So if you use it, you need to protect against pre-emption to another process or routine that may also attempt to use it. Not exactly a good thing.

Thanks, I'll see about looking it over when I've a moment to do so. I've two things to do, then. One is perusing the ARM docs at the same time as reading the code. Will be fun. But anything for a chance to learn something new. ;)

Jon

- S
- steve
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Nov 28, 2006 1:16 AM

???? Doesn't make any sense, 1 cycle HW FP operation is 5-6 times faster then a 70-280 cycle emulated FP? I suspect those benchmarks had a few math operations surrounded by tons of logic. Run a ARM11 aor ARM9(with hardware FP) FFT and a ARM7 FFT, both floating point, and look at the timing differences, probably 100 to 1 difference.

steve

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Nov 28, 2006 8:46 AM

It's max 70 cycle for division. Operations like addition and multiplication are much faster. Float addition can finish in as little as 21 cycles, special cases such as x+0 are even quicker.

If there is a lot of integer logic, there is no speedup at all. I'm talking about pure floating point code, including FFTs. Such code has a large number of load and store operations. Additionally there are dependencies between FP operations resulting in interlocks (FP ops have high latencies).

With no loads/stores and no interlocks, the maximum theoretical speedup is

25x for fadd/fsub, 50x for fmac and 5x for division. Actual speed ups are 5-6x on ARM11. Amdahl's law applies...

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 1:02 PM

Thanks to Wilco's kicking my butt about a couple of issues in my code, the code I've written up is now faster on the MSP430 -- a mean of 280 cycles (with no material variation) for 32-bit float division. It supports INF for divide by zero and INF propagation but lacks denormal support.

However, I just did up a short project in TI/IAR's kickstart and the _Div32f function called appears to produce its results in a mean of about 362 cycles, not the 620 mentioned by Steve. To compute that, I did 81 float divisions (reaches the 4k limit) and removed the register load overhead so that only the call itself is considered.

This made me curious and it turns out that the library there uses a restoring division version, but one that also uses a tight 'inner' loop that doesn't do a trial subtraction unless there is a reason to do so (leading bit in the remainder [divisor is normalized so it's a given that it's leading bit is 1]) and doesn't use a loop counter, looking instead for a shift of a '1' out of the quotient (which must occur in almost every case, and the rare other cases are tested for.) This brings their cycles per bit in the low level division part of the code, when streams of quotient '0's are being produced, to 8 cycles. More, though, when '1's are being produced. Which is what brings up their average. The code I use in the low level division uses almost exactly 7 cycles per bit (just under) and is consistent regardless of '1' or '0' being produced in the quotient.

There are some interesting approaches to know about.

Jon

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 8:35 PM

Sounds clever.

Did you see the notes here ? :

formatting link

These are for the MAXQ, which claims 16x16 Mult but the maths numbers seem broadly similar to MSP430, AVR ( within a speed-grade-iteration )

Maxim's MHz and mA/MHz numbers also are all over the place, almost like different teams are working on these devices ?

I think they give sources, so you could compare your Div32f with the ones at Maxim's ?

-jg

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 8:58 PM

I enjoy these things for reasons I'm not entirely certain of. ;)

Perhaps some earlier versions. I read so much and I remember going through a few things similar to this on the MAXQ a year ago or so?

I did a _quick_ look to see if it was easy to find the sources and didn't. Didn't see them easily.

But the first page you mention above describes TI's approach of trying to focus on chip performance and to ignore C compiler issues as a "flawed" approach. Frankly, that's what I'm interested in because I usually mix assembly with C in applications I write and therefore it _IS_ valuable to me to know what the chip can do regardless of the C compiler. When I need to drop into assembly, that's what I want to know about. And if a C compiler isn't entirely all that great for a chip, well that only means I may use more assembly than I otherwise might. But probably not that much more. So I really find that kind of argument ... not quite disingenuous, because I can see that it has a valid facet to it ... but not useful for my case, at least.

So I'm wondering if the benchmarks are going to be in C and if they don't disclose the actual assembly, anyway. It may be the case that they didn't care to, because of their argument against assembly, anyway.

If you know different, let me know and I'll go look.

In any case, the MAXQ isn't getting designed into anything I care about right now.

Jon

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 9:44 PM

Bottom of this page, there are 2 Compiler MSP430 examples + Sources.

formatting link

-jg

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 10:03 PM

Yeah. I think it's just C code. They depend on the libraries. I've already looked at three of them from three different C compiler vendors for the MSP430. So nothing new to be found, there. For the MAXQ, I'd just have to go to the compilers and see if I could dig out their source code for the library routines (by debugger, if not in source freely available.) But I'm not going to take the time for that. I also note that they wrap their floating point with function calls to functions that just do the one operation and return values. So this adds another layer to the benchmark. On the MSP430, function calls alone (without inlining) cost you 8 cycles (5 in, 3 out.)

Did you imagine I'd learn any new detailed techniques from this?

/*******************************************************************************

*

Name : Floating-point Math

Purpose : Benchmark floating-point math functions.

* *******************************************************************************/ float add(float a, float b) { return (a + b); } float mul(float a, float b) { return (a * b); } float div(float a, float b) { return (a / b); } void main(void) { volatile float result[4]; result[0] = 54.567; result[1] = 14346.67; result[2] = add(result[0], result[1]); result[1] = mul(result[0], result[2]); result[3] = div(result[1], result[2]); return; }

Jon

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 10:43 PM

:) - No, but I did glance into the MAP files, and they seem to identify the sizes of the ADD,MUL,DIV,SUB blocks, and those numbers I figured you would be interested in. Plus you can clone the code below to call your Libs and then compare the cycles and bytes.

-jg

/*******************************************************************************

*******************************************************************************/

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Dec 7, 2006 10:58 PM

Nope. Not really. All I care about is learning. And I can't do much of that from reading byte counts.

Yeah, like that would be useful -- not. ;) I can write that kind of stuff stone dead, let alone asleep. And the whole idea of loading down a series of toolsets for a processor I won't be using isn't something I want to engage. If the code is there, I'd look at it. Otherwise, I need to wait for motivation.

Jon