Hello everyone, I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embe dded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by N ordic. I have written a fair amount of code for this device, but now I have to exp lore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers. Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math fu nctions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here: ger-Square-Roots I am interested in understanding exactly what code will be compiled if I #i nclude in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog abov e, is what should be used. How can I determine this? I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. T he only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code. Thanks for any advice you can share.

sqrt() takes a double-precision floating point number, so your int32 will be promoted to double before sqrt is called, and you'll get a double back.

How would my SoC calculate an integer square root?

D

David Brown 9 years ago

Integer maths is optimised with "-O1" or higher. It does not need separate flags. (There are other flags that can affect the way integer overflow is handled, but the default is to follow the C standards, and that is usually what gives the fastest results anyway.)

A 10000 number lookup table will give you the result in a single operation. Do you have 20K of flash (or ram) space to spare? I don't know your particular chip, but many Cortex devices actually have quite a lot of flash space, and while 20K for a square root table might seem excessive, it could work. And it would be the fastest possible method.

You can also do smart things with a smaller table - say, a 157 entry table for the square roots of (x >> 6), shifted left by 3 bits. Your algorithm is then to take x, shift right 6 bits, look it up in the table as y. The integer square root of x is then between y and y + 7. A simple algorithm to check these would take 8 checks, each of which is simpler than binary search rounds. (And the table takes less flash.) A smarter algorithm could probably jump to the correct y faster, based on (x >> 6) and (x & 63).

That would still be slower than using VSQRT, but is a possibility if you want to avoid floating point.

Vote

U

upsidedown 9 years ago

The time used to discuss what compiler flags to use could have better used coding such simple function in assembler :-).

Vote

U

upsidedown 9 years ago

If those cycles are the same as the actual clock oscillator cycles, the 32 MHz oscillator would allow more than 2 million VSQRT instructions each second.

The OP estimated a thousand square roots each second, which would consume less than 0.1 % of the clock cycles. No big deal and not a reason for optimization.

Not following the IEEE usually means it doesn't handle (extremely small) denormalized values, might not generate proper +/- infinity values f(for huge values) or properly handling NaN (Not a number) cases.

Floating point values are nearly always approximations anyway. Only some special cases (small integers) are exact, but you can't represent exactly values like 0.1 or 0.01.

To calculate the sqrt on a hardware with basic floating point instruction but not hardware SQRT instruction, I have used the following approach:

normalize the value between 1..4 so that the exponent is even
divide the exponent by two to get the result exponent
using a polynomical with the mantissa bits to calculate the result mantissa
4th order polynomicals give sufficient results for float and 8th order for double.
I have seen implementations (such as VAX) with 3rd order for single and 6th for double degree.

Vote

D

David Brown 9 years ago

No, it could not - it is better to learn how to use your tools (such as the compiler) properly, so that you can re-use the information to improve more code later, rather than to learn a single inconvenient work-around.

Understanding the point of -ffast-math and optimisation flags, and about float vs. double, will improve /all/ floating point code written by /everyone/ who reads this thread and learned from it. A simple inline assembly function (which is, in fact, slightly fiddly if you want to get the best from it) would merely help one person in one single situation.

Give a man a fish, and all that.

Vote

D

David Brown 9 years ago

He also made it clear that he wants to do other things during the time than just square roots, as well as wanting to minimise the time in order to reduce power consumption.

Mostly true. It is also not uncommon for rounding to be done in a different way, or to lose a LSB here or there compared to IEEE references. Usually these details don't matter. But if you compile without -ffast-math (or the relevant individual flags), the compiler will do its best to get IEEE conformance /exactly/ right - even if that means a library subroutine to handle the final details.

Polynomials like this are sometimes used. Usually the "best" choice of algorithm here depends on the relative speeds of division and multiplication. Early floating point units (such as the VAX) were notoriously slow at division, so an algorithm that uses polynomials (addition and multiplication) is preferred.

Vote

D

Dimiter_Popoff 9 years ago

So calculating the square root of a 14 bit number is fine for you; you will get a 7 bit result. Using my algorithm this will cost you 7 multiply instructions, 7 compares, 7 bset/bclr or some equivalent (e.g. shift + or sort of thing), 7 conditional branches and 7 decrement and branch opcodes.

35, say I forgot something, make that 70 cycles per square root.

70000 cycles per second out of a total of 32M cycles per second makes.... 0.22% of the computing power you have.

Does not sound like something worth all that head banging.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

U

upsidedown 9 years ago

I did not refer to in-line assembly in particular, but also to separate compiled assembler code. Of course, you need to know the parameter passing mechanism to separate compiled/assembled units and for functions how to return results. In most languages, parameters are passed on the stack, some push the first parameter first, while some push the last parameter last.

Understanding the parameter passing mechanism in general is a useful skill.

Vote

U

upsidedown 9 years ago

He just has 99.9 % of the total CPU cycles available for the "other" tasks.

Vote

J

John Devereux 9 years ago

It looks like you did not read to the end of his sentence. It will be consume 100% of the cycles if it only wakes up to do that one thing.

Modern MCUs can achieve very low power operation by sleeping most of the time. Power consumption can then be nearly proportional to the wake duty cycle. So going from "0.1%" to "0.2%" could be the difference between a

2 year and a 4 year life when powered by a coin cell say.

John Devereux

Vote

T

Tim Wescott 9 years ago

Integer-Square-Roots

I notice, in our arguing about what you said, that some folks are thinking you want to get an integer _out_ of your algorithm -- is this true?

I'm not sure if it has much of a bearing on things, because I suspect that using the processor's floating point math is going to be the quickest, even with conversion to and from float. But I'm curious.

Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

Vote

J

jladasky 9 years ago

That's correct. Rounding or even truncating is OK. These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted.

After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start.

I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table. I have 512K of flash, and I only need to store int16's.

Vote

L

Les Cargill 9 years ago

I'd measure the floating point version before piling off into a bespoke ( custom ) square root routine.

Les Cargill

Vote

W

Wouter van Ooijen 9 years ago

The sqrt of 10000 is only 100, so you'd need to store only unsigned chars.

Wouter "Objects? No Thanks!" van Ooijen

Vote

D

Dimiter_Popoff 9 years ago

At 512k flash this is a no brainer of course.

Dimiter

Vote

M

Mike Perkins 9 years ago

ARM documentation suggests this takes 14/15 clocks cycles. There is then the conversion from integer to floating point.

Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk

Vote

M

Mike Perkins 9 years ago

For me the lookup table is the best and simplest, especially on a device with 512kB of ROM. Is your program that big? Given the answer is always going to be of byte length, for the square root of a number up to 10,000 you would only need 10kB of table.

An alternative is a variation of:

formatting link

except you would work to base 256 and have a lookup table of just 256 bytes to find the sqrt of 0 to 255.

Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk

Vote

R

Robert Wessel 9 years ago

Of course he'd only need a 10000 byte table, all the results of isqrt(n) for n

Vote

D

David Brown 9 years ago

Separately compiled assembly usually involves quite a bit more messing around than inline assembly. You have to handle a variety of assembler directives to get it to work. You need to understand more, in order to follow the calling convention properly (including preserving registers if and only if necessary). And the result is a lot less efficient for short sections of code, because of the function call overhead. Inline assembly is shorter, simpler, and can work together with the optimiser - though you still need to know what you are doing.

The calling convention varies a lot between platforms. On targets with plenty of registers, parameters are mostly passed in registers, with the stack used when there are too many registers or for things like structs.

It /can/ be a useful skill, yes. But there is no such thing as "parameter passing mechanism in general" as it varies between platforms. And the skill is not nearly as helpful as the skill of understanding how to use your compiler. Learning some of the basics of how compiler flags work and how compiler optimisation works is /vastly/ more useful than trying to write your own assembly functions. It is also much easier, much safer (if you get it wrong you are likely to get slower code, but still equally correct code), much more portable, and leads to writing clearer and more maintainable code instead of cryptic assembly.

(I strongly advocate learning to /read/ assembly on the platforms you work with, in order to understand the results of your compilation. But for most developers, if you are writing more assembly than a very occasional single-line inline assembly function, you should probably re-consider your development strategy.)

Vote

D

David Brown 9 years ago

True.

Vote

D

David Brown 9 years ago

I is /almost/ a no-brainer. There might be other considerations, such as a slow over-the-air firmware update.

And even if this sounds like a simple answer, it's good that he is giving the different ideas "serious consideration". Some of the ideas and points raised in this thread could be useful learning experience for the future.

Vote

How would my SoC calculate an integer square root?

Join the Discussion

Didn't find your answer?