Do you have a question? Post it now! No Registration Necessary
- Subject
- Posted on
- jladasky
January 4, 2017, 11:27 pm
Hello everyone,
I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an
ARM Cortex-M4. My development system is a Linux box running GCC. My embe
dded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by N
ordic.
I have written a fair amount of code for this device, but now I have to exp
lore a bit deeper. I have an algorithm that may require me to calculate a
lot of square roots, and quickly. However, the numbers in question are all
small integers.
Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved
in this square root calculation. Optimized algorithms for various math fu
nctions exist. A nice discussion of the optimization of the integer square
root algorithm, written by Dr. Jack Crenshaw, is found here:
http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Inte
ger-Square-Roots
I am interested in understanding exactly what code will be compiled if I #i
nclude <math.h> in my program, and then call sqrt() with an int32 argument.
It stands to reason that optimized code, like that shown in the blog abov
e, is what should be used. How can I determine this?
I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. T
he only math-related file that I can find there is called tgmath.h, and it
just #defines aliases which point back to math.h. I am looking for source
code.
Thanks for any advice you can share.
I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an
ARM Cortex-M4. My development system is a Linux box running GCC. My embe
dded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by N
ordic.
I have written a fair amount of code for this device, but now I have to exp
lore a bit deeper. I have an algorithm that may require me to calculate a
lot of square roots, and quickly. However, the numbers in question are all
small integers.
Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved
in this square root calculation. Optimized algorithms for various math fu
nctions exist. A nice discussion of the optimization of the integer square
root algorithm, written by Dr. Jack Crenshaw, is found here:
http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Inte
ger-Square-Roots
I am interested in understanding exactly what code will be compiled if I #i
nclude <math.h> in my program, and then call sqrt() with an int32 argument.
It stands to reason that optimized code, like that shown in the blog abov
e, is what should be used. How can I determine this?
I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. T
he only math-related file that I can find there is called tgmath.h, and it
just #defines aliases which point back to math.h. I am looking for source
code.
Thanks for any advice you can share.
Re: How would my SoC calculate an integer square root?
On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote:
I #include <math.h> in my program,
Thanks for your reply, Clifford. That's actually not an encouraging answer
though! If that is the case, I think I will write the square root functio
n that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point
function will spend time calculating digits of precision that I do not need
for my application.
I #include <math.h> in my program,
Thanks for your reply, Clifford. That's actually not an encouraging answer
though! If that is the case, I think I will write the square root functio
n that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point
function will spend time calculating digits of precision that I do not need
for my application.
Re: How would my SoC calculate an integer square root?
f I #include <math.h> in my program,
er though! If that is the case, I think I will write the square root funct
ion that was shown in Listing 4 of Jack Crenshaw's blog. The floating-poin
t function will spend time calculating digits of precision that I do not ne
ed for my application.
the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is
probably hard to beat
Re: How would my SoC calculate an integer square root?
On 05/01/17 01:28, snipped-for-privacy@gmail.com wrote:
As long as "-ffast-math" and at least -O1 are enabled, as well as
appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32
automatically. Note that "sqrtf" is not the same as "sqrt".
As long as "-ffast-math" and at least -O1 are enabled, as well as
appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32
automatically. Note that "sqrtf" is not the same as "sqrt".
Re: How would my SoC calculate an integer square root?
On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, snipped-for-privacy@gmail.com wrote:
My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.
My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.
Re: How would my SoC calculate an integer square root?
Keep in mind, as David pointed out, that you want to make sure it's a
single-precision floating point: the FPU on the M4 doesn't do double-
precision math, so that would still go the horridly slow software route
(although a smart, FPU-aware algorithm might get an initial guess from
the FPU).
--
Tim Wescott
Wescott Design Services
Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: How would my SoC calculate an integer square root?
On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
If I understand correctly, the OP wants a square root algorithm
accepting an integer and returning an integer (not a float).
That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25
and so on. The sqrt for other values contains a more or less large
error.
The discussion about single vs double precision floats is not relevant
in this environment.
The OP said that the argument is a small integer and as long as the
argument is 6 decimal digits or less, it can be represented exactly
in single precision float. Thus the result will have up to 4 decimal
digits in integer part and a few decimal digits in the fractional
part. Just convert the single precision value to integer to get away
the fractional part, if truly an integer result is required.
It should be noted that conversion from float to integer can take a
long time on some architectures (if no barrel shifter available), so
it may be a good idea to keep integer and float calculations separate
and avoid mixing them in the same calculations.
If I understand correctly, the OP wants a square root algorithm
accepting an integer and returning an integer (not a float).
That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25
and so on. The sqrt for other values contains a more or less large
error.
The discussion about single vs double precision floats is not relevant
in this environment.
The OP said that the argument is a small integer and as long as the
argument is 6 decimal digits or less, it can be represented exactly
in single precision float. Thus the result will have up to 4 decimal
digits in integer part and a few decimal digits in the fractional
part. Just convert the single precision value to integer to get away
the fractional part, if truly an integer result is required.
It should be noted that conversion from float to integer can take a
long time on some architectures (if no barrel shifter available), so
it may be a good idea to keep integer and float calculations separate
and avoid mixing them in the same calculations.
Re: How would my SoC calculate an integer square root?
On 06/01/17 06:54, snipped-for-privacy@downunder.com wrote:
Yes.
Yes. The OP has not said if he wants round-down, round-to-nearest, or
does not care.
It is very relevant. An easy way to implement his requirements
(ignoring range checks) is:
int sqrt_int(int x) {
return sqrtf(x);
}
But that requires the compiler flags that have been discussed, and
"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
and a couple of conversion instructions.
It should work fine - as long as doubles are avoided (to avoid the
slowdown).
The M4F has float to integer conversion instructions (as do all cpus
with floating point hardware that I have ever seen). Again, flags like
-ffast-math and -O1 are required to be sure that these are used without
library calls to handle the possibility of "weird" floats.
Your point is important if doubles are involved, or on an architecture
without hardware floating point. Conversions back and forth between
integers and floating point formats can be relevant for the cost of the
operations. However, they are not /that/ expensive - they are typically
not worse than general arithmetic like addition and multiplication, and
cheaper than division. They may also give the compiler more
optimisation opportunities than floating point values (especially if you
have forgotten -ffast-math).
(And the ARM architecture has a barrel shifter.)
Yes.
Yes. The OP has not said if he wants round-down, round-to-nearest, or
does not care.
It is very relevant. An easy way to implement his requirements
(ignoring range checks) is:
int sqrt_int(int x) {
return sqrtf(x);
}
But that requires the compiler flags that have been discussed, and
"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
and a couple of conversion instructions.
It should work fine - as long as doubles are avoided (to avoid the
slowdown).
The M4F has float to integer conversion instructions (as do all cpus
with floating point hardware that I have ever seen). Again, flags like
-ffast-math and -O1 are required to be sure that these are used without
library calls to handle the possibility of "weird" floats.
Your point is important if doubles are involved, or on an architecture
without hardware floating point. Conversions back and forth between
integers and floating point formats can be relevant for the cost of the
operations. However, they are not /that/ expensive - they are typically
not worse than general arithmetic like addition and multiplication, and
cheaper than division. They may also give the compiler more
optimisation opportunities than floating point values (especially if you
have forgotten -ffast-math).
(And the ARM architecture has a barrel shifter.)
Re: How would my SoC calculate an integer square root?
On 06/01/17 11:21, snipped-for-privacy@downunder.com wrote:
No, it could not - it is better to learn how to use your tools (such as
the compiler) properly, so that you can re-use the information to
improve more code later, rather than to learn a single inconvenient
work-around.
Understanding the point of -ffast-math and optimisation flags, and about
float vs. double, will improve /all/ floating point code written by
/everyone/ who reads this thread and learned from it. A simple inline
assembly function (which is, in fact, slightly fiddly if you want to get
the best from it) would merely help one person in one single situation.
Give a man a fish, and all that.
No, it could not - it is better to learn how to use your tools (such as
the compiler) properly, so that you can re-use the information to
improve more code later, rather than to learn a single inconvenient
work-around.
Understanding the point of -ffast-math and optimisation flags, and about
float vs. double, will improve /all/ floating point code written by
/everyone/ who reads this thread and learned from it. A simple inline
assembly function (which is, in fact, slightly fiddly if you want to get
the best from it) would merely help one person in one single situation.
Give a man a fish, and all that.
Re: How would my SoC calculate an integer square root?
I did not refer to in-line assembly in particular, but also to
separate compiled assembler code. Of course, you need to know the
parameter passing mechanism to separate compiled/assembled units and
for functions how to return results. In most languages, parameters are
passed on the stack, some push the first parameter first, while some
push the last parameter last.
Understanding the parameter passing mechanism in general is a useful
skill.
Re: How would my SoC calculate an integer square root?
On 06/01/17 18:08, snipped-for-privacy@downunder.com wrote:
Separately compiled assembly usually involves quite a bit more messing
around than inline assembly. You have to handle a variety of assembler
directives to get it to work. You need to understand more, in order to
follow the calling convention properly (including preserving registers
if and only if necessary). And the result is a lot less efficient for
short sections of code, because of the function call overhead. Inline
assembly is shorter, simpler, and can work together with the optimiser -
though you still need to know what you are doing.
The calling convention varies a lot between platforms. On targets with
plenty of registers, parameters are mostly passed in registers, with the
stack used when there are too many registers or for things like structs.
It /can/ be a useful skill, yes. But there is no such thing as
"parameter passing mechanism in general" as it varies between platforms.
And the skill is not nearly as helpful as the skill of understanding
how to use your compiler. Learning some of the basics of how compiler
flags work and how compiler optimisation works is /vastly/ more useful
than trying to write your own assembly functions. It is also much
easier, much safer (if you get it wrong you are likely to get slower
code, but still equally correct code), much more portable, and leads to
writing clearer and more maintainable code instead of cryptic assembly.
(I strongly advocate learning to /read/ assembly on the platforms you
work with, in order to understand the results of your compilation. But
for most developers, if you are writing more assembly than a very
occasional single-line inline assembly function, you should probably
re-consider your development strategy.)
Separately compiled assembly usually involves quite a bit more messing
around than inline assembly. You have to handle a variety of assembler
directives to get it to work. You need to understand more, in order to
follow the calling convention properly (including preserving registers
if and only if necessary). And the result is a lot less efficient for
short sections of code, because of the function call overhead. Inline
assembly is shorter, simpler, and can work together with the optimiser -
though you still need to know what you are doing.
The calling convention varies a lot between platforms. On targets with
plenty of registers, parameters are mostly passed in registers, with the
stack used when there are too many registers or for things like structs.
It /can/ be a useful skill, yes. But there is no such thing as
"parameter passing mechanism in general" as it varies between platforms.
And the skill is not nearly as helpful as the skill of understanding
how to use your compiler. Learning some of the basics of how compiler
flags work and how compiler optimisation works is /vastly/ more useful
than trying to write your own assembly functions. It is also much
easier, much safer (if you get it wrong you are likely to get slower
code, but still equally correct code), much more portable, and leads to
writing clearer and more maintainable code instead of cryptic assembly.
(I strongly advocate learning to /read/ assembly on the platforms you
work with, in order to understand the results of your compilation. But
for most developers, if you are writing more assembly than a very
occasional single-line inline assembly function, you should probably
re-consider your development strategy.)
Re: How would my SoC calculate an integer square root?
On 06/01/17 00:44, Tim Wescott wrote:
Not all FPU's give accurate square root results from their hardware
instructions. The ARM documentation does not give details, so (unless
someone else knows better) we can assume that it /does/ give accurate
values, but not necessarily following IEEE specifications precisely. I
have seen other FPU's equivalent to the M4F's where the square root
instruction is explicitly an approximation. It would be good enough for
many uses, but for a highly accurate result you would use it as the
estimate and then run a couple of rounds of Newton-Raphson.
A smart FPU-aware algorithm /might/ do that, but it is very unlikely in
practice. It would not actually save many cycles - you can already get
a good initial estimate by simply taking the exponent from the floating
point number and halving it. By the time you have handled the
conversions between double and floating point and vice versa, and
considering the cost of the rest of the double-precision software
floating point operations needed in the square root, I can't see you
saving more than a couple of percent off the total time. It is not
worth building a special version of the library just for this one case -
other double-precision maths cannot, in general, get any benefit from
using the single-precision hardware.
Not all FPU's give accurate square root results from their hardware
instructions. The ARM documentation does not give details, so (unless
someone else knows better) we can assume that it /does/ give accurate
values, but not necessarily following IEEE specifications precisely. I
have seen other FPU's equivalent to the M4F's where the square root
instruction is explicitly an approximation. It would be good enough for
many uses, but for a highly accurate result you would use it as the
estimate and then run a couple of rounds of Newton-Raphson.
A smart FPU-aware algorithm /might/ do that, but it is very unlikely in
practice. It would not actually save many cycles - you can already get
a good initial estimate by simply taking the exponent from the floating
point number and halving it. By the time you have handled the
conversions between double and floating point and vice versa, and
considering the cost of the rest of the double-precision software
floating point operations needed in the square root, I can't see you
saving more than a couple of percent off the total time. It is not
worth building a special version of the library just for this one case -
other double-precision maths cannot, in general, get any benefit from
using the single-precision hardware.
Re: How would my SoC calculate an integer square root?
On Fri, 06 Jan 2017 08:46:11 +0100, David Brown
If those cycles are the same as the actual clock oscillator cycles,
the 32 MHz oscillator would allow more than 2 million VSQRT
instructions each second.
The OP estimated a thousand square roots each second, which would
consume less than 0.1 % of the clock cycles. No big deal and not a
reason for optimization.
Not following the IEEE usually means it doesn't handle (extremely
small) denormalized values, might not generate proper +/- infinity
values f(for huge values) or properly handling NaN (Not a number)
cases.
Floating point values are nearly always approximations anyway. Only
some special cases (small integers) are exact, but you can't represent
exactly values like 0.1 or 0.01.
To calculate the sqrt on a hardware with basic floating point
instruction but not hardware SQRT instruction, I have used the
following approach:
* normalize the value between 1..4 so that the exponent is even
* divide the exponent by two to get the result exponent
* using a polynomical with the mantissa bits to calculate the result
mantissa
* 4th order polynomicals give sufficient results for float and 8th
order for double.
* I have seen implementations (such as VAX) with 3rd order for single
and 6th for double degree.
If those cycles are the same as the actual clock oscillator cycles,
the 32 MHz oscillator would allow more than 2 million VSQRT
instructions each second.
The OP estimated a thousand square roots each second, which would
consume less than 0.1 % of the clock cycles. No big deal and not a
reason for optimization.
Not following the IEEE usually means it doesn't handle (extremely
small) denormalized values, might not generate proper +/- infinity
values f(for huge values) or properly handling NaN (Not a number)
cases.
Floating point values are nearly always approximations anyway. Only
some special cases (small integers) are exact, but you can't represent
exactly values like 0.1 or 0.01.
To calculate the sqrt on a hardware with basic floating point
instruction but not hardware SQRT instruction, I have used the
following approach:
* normalize the value between 1..4 so that the exponent is even
* divide the exponent by two to get the result exponent
* using a polynomical with the mantissa bits to calculate the result
mantissa
* 4th order polynomicals give sufficient results for float and 8th
order for double.
* I have seen implementations (such as VAX) with 3rd order for single
and 6th for double degree.
Re: How would my SoC calculate an integer square root?
On 06/01/17 11:50, snipped-for-privacy@downunder.com wrote:
He also made it clear that he wants to do other things during the time
than just square roots, as well as wanting to minimise the time in order
to reduce power consumption.
Mostly true. It is also not uncommon for rounding to be done in a
different way, or to lose a LSB here or there compared to IEEE
references. Usually these details don't matter. But if you compile
without -ffast-math (or the relevant individual flags), the compiler
will do its best to get IEEE conformance /exactly/ right - even if that
means a library subroutine to handle the final details.
Polynomials like this are sometimes used. Usually the "best" choice of
algorithm here depends on the relative speeds of division and
multiplication. Early floating point units (such as the VAX) were
notoriously slow at division, so an algorithm that uses polynomials
(addition and multiplication) is preferred.
He also made it clear that he wants to do other things during the time
than just square roots, as well as wanting to minimise the time in order
to reduce power consumption.
Mostly true. It is also not uncommon for rounding to be done in a
different way, or to lose a LSB here or there compared to IEEE
references. Usually these details don't matter. But if you compile
without -ffast-math (or the relevant individual flags), the compiler
will do its best to get IEEE conformance /exactly/ right - even if that
means a library subroutine to handle the final details.
Polynomials like this are sometimes used. Usually the "best" choice of
algorithm here depends on the relative speeds of division and
multiplication. Early floating point units (such as the VAX) were
notoriously slow at division, so an algorithm that uses polynomials
(addition and multiplication) is preferred.
Re: How would my SoC calculate an integer square root?
snipped-for-privacy@downunder.com writes:
It looks like you did not read to the end of his sentence. It will be
consume 100% of the cycles if it only wakes up to do that one thing.
Modern MCUs can achieve very low power operation by sleeping most of the
time. Power consumption can then be nearly proportional to the wake duty
cycle. So going from "0.1%" to "0.2%" could be the difference between a
2 year and a 4 year life when powered by a coin cell say.
It looks like you did not read to the end of his sentence. It will be
consume 100% of the cycles if it only wakes up to do that one thing.
Modern MCUs can achieve very low power operation by sleeping most of the
time. Power consumption can then be nearly proportional to the wake duty
cycle. So going from "0.1%" to "0.2%" could be the difference between a
2 year and a 4 year life when powered by a coin cell say.
--
John Devereux
John Devereux
Re: How would my SoC calculate an integer square root?
AT Thursday 05 January 2017 08:17, snipped-for-privacy@itu.edu wrote:
One of the rules of optimization is: Only do it after you have numbers to
compare. You are jumping to conclusions without any data. First benchmark
and measure, then optimize, measure again and compare.
One of the rules of optimization is: Only do it after you have numbers to
compare. You are jumping to conclusions without any data. First benchmark
and measure, then optimize, measure again and compare.
--
Reinhardt
Reinhardt
Site Timeline
- » Stupid design decisions
- — Next thread in » Embedded Programming
- » New Video: Parametric Oscillations
- — Previous thread in » Embedded Programming
- » FFT Speeds
- — Newest thread in » Embedded Programming
- » FFT Speeds
- — The site's Newest Thread. Posted in » Embedded Programming