How would my SoC calculate an integer square root?

- J
- jladasky
  
  Contact options for registered users
posted
7 years ago

Wed, Jan 4, 2017 11:27 PM

Hello everyone,

I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embe dded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by N ordic.

I have written a fair amount of code for this device, but now I have to exp lore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers.

Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math fu nctions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:

formatting link

ger-Square-Roots

I am interested in understanding exactly what code will be compiled if I #i nclude in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog abov e, is what should be used. How can I determine this?

I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. T he only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.

Thanks for any advice you can share.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Wed, Jan 4, 2017 11:46 PM

sqrt() takes a double-precision floating point number, so your int32 will be promoted to double before sqrt is called, and you'll get a double back.

- J
- jladasky
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 12:17 AM

I #include in my program,

Thanks for your reply, Clifford. That's actually not an encouraging answer though! If that is the case, I think I will write the square root functio n that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point function will spend time calculating digits of precision that I do not need for my application.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 12:23 AM

No, it does not "stand to reason" at all. The C standard library header goes along with an implementation of the C standard library function for sqrt() - this will do the calculations in double-precision floating point, as required by the C standards, and giving the answer to full IEEE accuracy.

There are a lot of ways you can calculate square roots, with various algorithms, lookup tables, etc., providing different balances between accuracy, run-time, and code space. There is no one "perfect" answer.

And don't dismiss the floating point unit in the M4 too quickly - for some sorts of calculations, it may be faster than using integer arithmetic, even though you are using integer operands at the start.

- L
- lasselangwadtchristensen
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 12:28 AM

f I #include in my program,

er though! If that is the case, I think I will write the square root funct ion that was shown in Listing 4 of Jack Crenshaw's blog. The floating-poin t function will spend time calculating digits of precision that I do not ne ed for my application.

the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat

- A
- antispam
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 12:54 AM

Actually, this method is good if you do not have hardware multiplication, or possibly if code size is more important than speed. However, otherwise method based on table lookup for initial approximation and Newton iteration is likely to be faster.

Use something like:

/mnt/m1/pom/usr/bin/arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -march=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c

(replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and use your comiler options). In file 'your_file.s' you will find resulting assembly code. Using gcc-5.3 I get:

bl __aeabi_i2d vmov d0, r0, r1 bl sqrt vmov r0, r1, d0 bl __aeabi_d2iz

AFAICS this is: convertion from integer to double, moving argument to flating point register, calling sqrt from C library, moving argument back to integer registers, convertion from double to integer.

What sqrt from library will do depends on C library, I do not use any...

'sqrt' is in C library. Some frequently used library functions are buit into gcc, but apparenty 'sqrt' is not one of them. So do not look inside gcc. Instad look for C library. It seems that newlib uses implementation which converts back to integers and is doing most calculations on integers. Seem highly suboptimal.

--
                              Waldek Hebisch

- R
- Reinhardt Behm
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 1:22 AM

One of the rules of optimization is: Only do it after you have numbers to compare. You are jumping to conclusions without any data. First benchmark and measure, then optimize, measure again and compare.

--
Reinhardt

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 1:32 AM

The first rule of optimizing is, "don't" until you find you clearly need it.

--

Rick C

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 2:56 AM

Integer-Square-Roots

The FPU hardware is almost certainly faster than anything you could do in software. Unless the largest-value integer you need to take the square root of is small enough that you can just implement a look-up table.

Benchmark, and see.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com 

I'm looking for work -- see my website!

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 8:38 AM

As long as "-ffast-math" and at least -O1 are enabled, as well as appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32 automatically. Note that "sqrtf" is not the same as "sqrt".

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 8:43 AM

Rules about optimisation are often misunderstood - perhaps because "optimisation" is such a general term, and is used for everything between enabling a compiler switch to algorithmic changes to fine-tuning hand-written assembly routines.

The first rule of optimising is from Knuth - "Premature optimisation is the root of all evil." The emphasis is on /premature/. That encompasses not putting effort into optimisation that is not needed - but equally, it allows for optimisation when you know that it /is/ needed.

Measurement is useful - invaluable, even - when you are comparing different solutions, finding bottlenecks, or identifying which changes made which effects. But it is not necessary when you can get a good enough estimation of the scales involved. If you want to plan a route across a country, you might want to calculate carefully the times for driving, trains, or taking a plane - but you can already rule out walking from the start as a poor "algorithm", and also rule out rockets and submarines for various reasons. You don't need any sort of measurements to start with.

The OP here says specifically that he needs to do lots of square roots, and needs them fast, with small integers. Until I know otherwise, I assume he has a reasonable idea of what he is doing, and can have a reasonable estimate that a software-only double-precision IEEE accuracy full range square root function is going to be too slow. It makes sense to think about this from the start, not after benchmarking and measurement - it could easily be a make-or-break issue for the algorithm and his whole project. I assume that the OP /does/ have useful data, such as the number of square roots per second and the clock speed of the device, even if he does not have /all/ the relevant data.

There are also plenty of optimisations that are very "cheap" to do - and in general, should /always/ be in place. Basic compiler optimisation - at least "-O1", but usually "-O2" or "-Os", should be standard. It costs nothing, gives better code (assembly code with "-O1" is usually a lot easier to understand than code from "-O0"), and enables better static warning and error checking.

If you are using floating point, "-ffast-math" should also be standard for most uses - unless you /really/ need full IEEE support.

You don't need to measure anything before using these - just as you don't need to time your car journeys before going out of first gear.

And there is also the common misconception that "optimised code" is difficult to understand, difficult to write, and difficult to debug. Sometimes that is the case - often it is not. The OP needs the square roots of small integers - the first idea that comes to my mind is a lookup table, which is pretty simple.

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 10:19 AM

Here are my two replies to a similar query some 9 years ago:

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- J
- jladasky
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 11:13 PM

My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.

- J
- jladasky
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 11:22 PM

rch=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c

. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for sou rce code.

Thanks Waldek, I haven't looked at the .s files and that's definitely helpf ul. If I compile using the "fast math" flag that was suggested by David Br own, that "sqrt" MIGHT compile to a single FPU instruction, VSQRT.F32, whic h apparently only requires 14 clock cycles. While the FPU result may not a chieve IEEE precision, I need speed.

- J
- jladasky
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 11:38 PM

My nRF52 has a 32 MHz CPU clock. The system will need to calculate about 1 ,000 square roots per second. The largest number I expect to see is about

10,000. If this task was all that concerned me, I wouldn't sweat. However : my system also has several other real-time tasks to perform. It's also b attery-powered. I am thinking about optimization because I need to conserv e both time and power.

That's helpful. Now you have me wondering whether there are compiler flags for optimizing integer math as well.

As you can see, a 100-number lookup table should address my needs. A bit l ong, perhaps. A binary search through the table could require 7 iterations .

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Thu, Jan 5, 2017 11:44 PM

Keep in mind, as David pointed out, that you want to make sure it's a single-precision floating point: the FPU on the M4 doesn't do double- precision math, so that would still go the horridly slow software route (although a smart, FPU-aware algorithm might get an initial guess from the FPU).

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com 

I'm looking for work -- see my website!

- R
- Robert Wessel
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, Jan 6, 2017 1:54 AM

The basic binary digit-by-digit algorithm for number of that scale ought to be at least a bit faster than a binary search of a table that size, and be less code (and no data).

formatting link

The inner loop of that is about a dozen instructions. And if you're really limited to 10,000, you can initialize "bit" to 1

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, Jan 6, 2017 5:54 AM

If I understand correctly, the OP wants a square root algorithm accepting an integer and returning an integer (not a float).

That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25 and so on. The sqrt for other values contains a more or less large error.

The discussion about single vs double precision floats is not relevant in this environment.

The OP said that the argument is a small integer and as long as the argument is 6 decimal digits or less, it can be represented exactly in single precision float. Thus the result will have up to 4 decimal digits in integer part and a few decimal digits in the fractional part. Just convert the single precision value to integer to get away the fractional part, if truly an integer result is required.

It should be noted that conversion from float to integer can take a long time on some architectures (if no barrel shifter available), so it may be a good idea to keep integer and float calculations separate and avoid mixing them in the same calculations.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, Jan 6, 2017 7:46 AM

Not all FPU's give accurate square root results from their hardware instructions. The ARM documentation does not give details, so (unless someone else knows better) we can assume that it /does/ give accurate values, but not necessarily following IEEE specifications precisely. I have seen other FPU's equivalent to the M4F's where the square root instruction is explicitly an approximation. It would be good enough for many uses, but for a highly accurate result you would use it as the estimate and then run a couple of rounds of Newton-Raphson.

A smart FPU-aware algorithm /might/ do that, but it is very unlikely in practice. It would not actually save many cycles - you can already get a good initial estimate by simply taking the exponent from the floating point number and halving it. By the time you have handled the conversions between double and floating point and vice versa, and considering the cost of the rest of the double-precision software floating point operations needed in the square root, I can't see you saving more than a couple of percent off the total time. It is not worth building a special version of the library just for this one case - other double-precision maths cannot, in general, get any benefit from using the single-precision hardware.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
7 years ago

Fri, Jan 6, 2017 8:12 AM

Yes.

Yes. The OP has not said if he wants round-down, round-to-nearest, or does not care.

It is very relevant. An easy way to implement his requirements (ignoring range checks) is:

int sqrt_int(int x) { return sqrtf(x); }

But that requires the compiler flags that have been discussed, and "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction and a couple of conversion instructions.

It should work fine - as long as doubles are avoided (to avoid the slowdown).

The M4F has float to integer conversion instructions (as do all cpus with floating point hardware that I have ever seen). Again, flags like

-ffast-math and -O1 are required to be sure that these are used without library calls to handle the possibility of "weird" floats.

Your point is important if doubles are involved, or on an architecture without hardware floating point. Conversions back and forth between integers and floating point formats can be relevant for the cost of the operations. However, they are not /that/ expensive - they are typically not worse than general arithmetic like addition and multiplication, and cheaper than division. They may also give the compiler more optimisation opportunities than floating point values (especially if you have forgotten -ffast-math).

(And the ARM architecture has a barrel shifter.)