I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embe dded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by N ordic.

I have written a fair amount of code for this device, but now I have to exp lore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers.

Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math fu nctions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:

formatting link

ger-Square-Roots

I am interested in understanding exactly what code will be compiled if I #i nclude in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog abov e, is what should be used. How can I determine this?

I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. T he only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.

Thanks for your reply, Clifford. That's actually not an encouraging answer though! If that is the case, I think I will write the square root functio n that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point function will spend time calculating digits of precision that I do not need for my application.

No, it does not "stand to reason" at all. The C standard library header goes along with an implementation of the C standard library function for sqrt() - this will do the calculations in double-precision floating point, as required by the C standards, and giving the answer to full IEEE accuracy.

There are a lot of ways you can calculate square roots, with various algorithms, lookup tables, etc., providing different balances between accuracy, run-time, and code space. There is no one "perfect" answer.

And don't dismiss the floating point unit in the M4 too quickly - for some sorts of calculations, it may be faster than using integer arithmetic, even though you are using integer operands at the start.

er though! If that is the case, I think I will write the square root funct ion that was shown in Listing 4 of Jack Crenshaw's blog. The floating-poin t function will spend time calculating digits of precision that I do not ne ed for my application.

the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat

Actually, this method is good if you do not have hardware multiplication, or possibly if code size is more important than speed. However, otherwise method based on table lookup for initial approximation and Newton iteration is likely to be faster.

(replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and use your comiler options). In file 'your_file.s' you will find resulting assembly code. Using gcc-5.3 I get:

AFAICS this is: convertion from integer to double, moving argument to flating point register, calling sqrt from C library, moving argument back to integer registers, convertion from double to integer.

What sqrt from library will do depends on C library, I do not use any...

'sqrt' is in C library. Some frequently used library functions are buit into gcc, but apparenty 'sqrt' is not one of them. So do not look inside gcc. Instad look for C library. It seems that newlib uses implementation which converts back to integers and is doing most calculations on integers. Seem highly suboptimal.

One of the rules of optimization is: Only do it after you have numbers to compare. You are jumping to conclusions without any data. First benchmark and measure, then optimize, measure again and compare.

The FPU hardware is almost certainly faster than anything you could do in software. Unless the largest-value integer you need to take the square root of is small enough that you can just implement a look-up table.

As long as "-ffast-math" and at least -O1 are enabled, as well as appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32 automatically. Note that "sqrtf" is not the same as "sqrt".

Rules about optimisation are often misunderstood - perhaps because "optimisation" is such a general term, and is used for everything between enabling a compiler switch to algorithmic changes to fine-tuning hand-written assembly routines.

The first rule of optimising is from Knuth - "Premature optimisation is the root of all evil." The emphasis is on /premature/. That encompasses not putting effort into optimisation that is not needed - but equally, it allows for optimisation when you know that it /is/ needed.

Measurement is useful - invaluable, even - when you are comparing different solutions, finding bottlenecks, or identifying which changes made which effects. But it is not necessary when you can get a good enough estimation of the scales involved. If you want to plan a route across a country, you might want to calculate carefully the times for driving, trains, or taking a plane - but you can already rule out walking from the start as a poor "algorithm", and also rule out rockets and submarines for various reasons. You don't need any sort of measurements to start with.

The OP here says specifically that he needs to do lots of square roots, and needs them fast, with small integers. Until I know otherwise, I assume he has a reasonable idea of what he is doing, and can have a reasonable estimate that a software-only double-precision IEEE accuracy full range square root function is going to be too slow. It makes sense to think about this from the start, not after benchmarking and measurement - it could easily be a make-or-break issue for the algorithm and his whole project. I assume that the OP /does/ have useful data, such as the number of square roots per second and the clock speed of the device, even if he does not have /all/ the relevant data.

There are also plenty of optimisations that are very "cheap" to do - and in general, should /always/ be in place. Basic compiler optimisation - at least "-O1", but usually "-O2" or "-Os", should be standard. It costs nothing, gives better code (assembly code with "-O1" is usually a lot easier to understand than code from "-O0"), and enables better static warning and error checking.

If you are using floating point, "-ffast-math" should also be standard for most uses - unless you /really/ need full IEEE support.

You don't need to measure anything before using these - just as you don't need to time your car journeys before going out of first gear.

And there is also the common misconception that "optimised code" is difficult to understand, difficult to write, and difficult to debug. Sometimes that is the case - often it is not. The OP needs the square roots of small integers - the first idea that comes to my mind is a lookup table, which is pretty simple.

My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.

. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for sou rce code.

Thanks Waldek, I haven't looked at the .s files and that's definitely helpf ul. If I compile using the "fast math" flag that was suggested by David Br own, that "sqrt" MIGHT compile to a single FPU instruction, VSQRT.F32, whic h apparently only requires 14 clock cycles. While the FPU result may not a chieve IEEE precision, I need speed.

My nRF52 has a 32 MHz CPU clock. The system will need to calculate about 1 ,000 square roots per second. The largest number I expect to see is about

10,000. If this task was all that concerned me, I wouldn't sweat. However : my system also has several other real-time tasks to perform. It's also b attery-powered. I am thinking about optimization because I need to conserv e both time and power.

That's helpful. Now you have me wondering whether there are compiler flags for optimizing integer math as well.

As you can see, a 100-number lookup table should address my needs. A bit l ong, perhaps. A binary search through the table could require 7 iterations .

Keep in mind, as David pointed out, that you want to make sure it's a single-precision floating point: the FPU on the M4 doesn't do double- precision math, so that would still go the horridly slow software route (although a smart, FPU-aware algorithm might get an initial guess from the FPU).

The basic binary digit-by-digit algorithm for number of that scale ought to be at least a bit faster than a binary search of a table that size, and be less code (and no data).

formatting link

The inner loop of that is about a dozen instructions. And if you're really limited to 10,000, you can initialize "bit" to 1

If I understand correctly, the OP wants a square root algorithm accepting an integer and returning an integer (not a float).

That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25 and so on. The sqrt for other values contains a more or less large error.

The discussion about single vs double precision floats is not relevant in this environment.

The OP said that the argument is a small integer and as long as the argument is 6 decimal digits or less, it can be represented exactly in single precision float. Thus the result will have up to 4 decimal digits in integer part and a few decimal digits in the fractional part. Just convert the single precision value to integer to get away the fractional part, if truly an integer result is required.

It should be noted that conversion from float to integer can take a long time on some architectures (if no barrel shifter available), so it may be a good idea to keep integer and float calculations separate and avoid mixing them in the same calculations.

Not all FPU's give accurate square root results from their hardware instructions. The ARM documentation does not give details, so (unless someone else knows better) we can assume that it /does/ give accurate values, but not necessarily following IEEE specifications precisely. I have seen other FPU's equivalent to the M4F's where the square root instruction is explicitly an approximation. It would be good enough for many uses, but for a highly accurate result you would use it as the estimate and then run a couple of rounds of Newton-Raphson.

A smart FPU-aware algorithm /might/ do that, but it is very unlikely in practice. It would not actually save many cycles - you can already get a good initial estimate by simply taking the exponent from the floating point number and halving it. By the time you have handled the conversions between double and floating point and vice versa, and considering the cost of the rest of the double-precision software floating point operations needed in the square root, I can't see you saving more than a couple of percent off the total time. It is not worth building a special version of the library just for this one case - other double-precision maths cannot, in general, get any benefit from using the single-precision hardware.

Yes. The OP has not said if he wants round-down, round-to-nearest, or does not care.

It is very relevant. An easy way to implement his requirements (ignoring range checks) is:

int sqrt_int(int x) { return sqrtf(x); }

But that requires the compiler flags that have been discussed, and "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction and a couple of conversion instructions.

It should work fine - as long as doubles are avoided (to avoid the slowdown).

The M4F has float to integer conversion instructions (as do all cpus with floating point hardware that I have ever seen). Again, flags like

-ffast-math and -O1 are required to be sure that these are used without library calls to handle the possibility of "weird" floats.

Your point is important if doubles are involved, or on an architecture without hardware floating point. Conversions back and forth between integers and floating point formats can be relevant for the cost of the operations. However, they are not /that/ expensive - they are typically not worse than general arithmetic like addition and multiplication, and cheaper than division. They may also give the compiler more optimisation opportunities than floating point values (especially if you have forgotten -ffast-math).

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.