Integer Square Root Algorithm for Processor with MUL, DIV

D

David T. Ashley 18 years ago

What is the best integer square root algorithm for a processor with MUL and DIV (integer multiplication and division) built-in?

For one without MUL and DIV (or where the instructions operate on shorter operands than would be required to form the square in question)?

Thanks for all ideas and thoughts.

I will check TOACP, of course.

The application is a low-cost microcontroller with a 3-axis accelerometer. To get the magnitude of the acceleration, we'd want to use sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition is cheap (ADD, ADC). The square root extraction is what I'm not sure how best to do.

The algorithm that comes to mind is just to do a binary search by squaring potential solutions and comparing them to the quantity whose square root is to be extracted.

For example, if the sum of the squares is 16 bits or less, and since the processor has a 8 x 8 = 16 MUL instruction, it should be possible to iterate to the solution in no more than 16 square/compare cycles.

But maybe there is a better way?

And it isn't clear what to do if the sum of the squares is a bit larger.

Thanks for all.

David T. Ashley (dta@e3ft.com) http://www.e3ft.com (Consulting Home Page) http://www.dtashley.com (Personal Home Page) http://gpl.e3ft.com (GPL Publications and Projects)

Vote

H

Hans-Bernhard Bröker 18 years ago

There's never such a thing as "the best" way to do something that's been described so vaguely. Different circumstances render different aspects important, and thus change what "good" actually means.

There's one exception to the above rule: not doing something at all, if you can get away, is always the best way of doing it. So: do you really

*need* that sqrt() taken on your microcontroller?

Failing that you can still use use good old Newton-Raphson iteration to locate the zero of (x^2 - input).

x |--> x - (x^2 - input) / (2*x) = (x + input/x) / 2

Vote

V

Vladimir Vassilevsky 18 years ago

Here is the simplest (but not the very fastest) sqrt algorithm for the CPU with cheap multiplication: //------------------------- s16 sqrt_s16(s32 x) { s32 tmp_result; s16 root, tmp_root, tmp_add;

root = 0; tmp_add = 0x4000;

while(tmp_add) { tmp_root = root + tmp_add; tmp_result = ((s32)tmp_root)*tmp_root; if(tmp_result >= 1; }

return root; } //-----------------------

Vladimir Vassilevsky

DSP and Mixed Signal Design Consultant

formatting link

Vote

D

Dann Corbit 18 years ago

Take a look here:

formatting link

A zillion monkeys can't all be wrong.

** Posted from

formatting link

**

Vote

P

Pubkeybreaker 18 years ago

d

tion

w

s

rate

com=A0 =A0 =A0 =A0 =A0(Consulting Home Page)

formatting link

=A0 = =A0(Personal Home Page)

formatting link

=A0 =A0 =A0 =A0(GPL Publicatio= ns and Projects)

Approximate the square root by the lowest half of the bit representation of x^2 + y^2 (or any other easy, convenient approximation), and then use Newton-Raphson.

Vote

R

Rich Webb 18 years ago

Jack Crenshaw had a series of columns on numerical methods in "Embedded Systems" a few years ago. Has a good book on the subject, as well.

formatting link

(The past tense isn't meant to imply that his current columns aren't great too... ;-)

As far as the columns, the one on integer square roots is at

Rich Webb Norfolk, VA

Vote

V

Vladimir Vassilevsky 18 years ago

How accurate do you want to be?

let X = min(a,b)/max(a,b)

Then you can use a polynomial approximation:

sqrt(a^2 + b^2) ~ max(a,b)*(1 + P(X))

The simplest approximation of the 1st order (somewhat 10% accurate):

sqrt(a^2 + b^2) ~ max(a,b) + 0.318 * min(a,b)

In the 3-dimensional case, you have to do this procedure twice:

sqrt(x^2 + y^2 + z^2) = sqrt(max(x,y,z)^2 + sqrt(mid(x,y,z)^2 + min(x,y,z)^2))^2

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Vote

W

Walter Banks 18 years ago

For no MUL and DIV search for Dann Corbit's postings. Dann posted one a few years ago with N/2 interations. If you can't find it contact me off line.

The following square root routine was originally done for the 6805. It is probably not the fastest

The basic loop spits out one bit of result per pass. This routine takes a 16 bit argument and returns an eight bit integer and 8 bit fraction.

Walter Banks

formatting link

============================================================ #pragma option v; #pragma has MUL;

long SQRT (long l) /* Square root V1.0.

Hayward/Banks.

This routine generates a square root to eight bit resolution taking a 16 bit argument normalizing the argument and returning a 16 bit real number that consists of 8 bits integer and 8 bit fractional parts. The MS8bits are the integer part.

i i i i i i i i . f f f f f f f f | | | | | | | | | | | | | | | | 128 |32 | 8 4 2 1 .5 | | | | | | .00390625 64 16 .25 | | | | --.0078125 .125- | | ----.015625 .0625-- ------.03125

These routines may be adapted for any purpose. No warranty is implied or given as to their usability for any purpose.

(c) Byte Craft Limited Waterloo, Ontario Canada N2J 4E4 (519) 888-6911

Walter Banks/Gordon Hayward */

{ unsigned int n,j,k; unsigned long t; n = 0; while( (l & 0xc000) == 0) { n++; l =1) { j = j | k; t = j * j; if (t > l) j = j & (~k); } t = j; return (t > 8; frac_part = result & 0xff; }

Vote

D

Didi 18 years ago

Actually you can do this in 8 multiplies. This is how I usually implement square root - I have probably reinvented it some 15+ years ago for myself. You just start with $80, square and compare to the number you want the root of, and do successive approximation down to bit 0 (i.e. second time with $40+result so far, then $20 etc.).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Original message:

formatting link

Vote

A

Arlet 18 years ago

If the values of x, y, z don't change too much from sample to sample you could use the previous value as the starting point for Newton- Raphson iteration.

For a different application (not involving acceleration or square roots), I performed only a single Newton-Raphson iteration per sample. For slow moving inputs, this gave perfect results, and for fast moving inputs, I got a free low-pass filter :)

Vote

C

CBFalconer 18 years ago

Depends on needs. Probably your best bet is Newton approximation:

Rn+1 = 0.5 * (Rn + N/Rn)

and repeat until the error is negligible. Start with R == 1.0.

[mail]: Chuck F (cbfalconer at maineline dot net) [page]: Try the download section. ** Posted from http://www.teranews.com **

Vote

U

user923005 18 years ago

d

tion

w

s

rate

(I posted this also from TerraNews, but sometimes those posts take a few days to show up):

Take a look here:

formatting link

A zillion monkeys can't all be wrong.

Vote

P

Paul E. Bennett 18 years ago

See:-

Nice and simple method that works quite well.

******************************************************************** Paul E. Bennett............... Forth based HIDECS Consultancy Mob: +44 (0)7811-639972 Tel: +44 (0)1235-811095 Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. ********************************************************************

Vote

D

Didi 18 years ago

Well my algorithm takes less lines and is a lot easier to read than those in HLL which were suggested, but I hope it counts in spite of that :-). Here is it excerpted from a post I have made many years ago, calculates a 16 bit squre root of a 32 bit number. The example is in CPU32 (or CF) assembly.

*

calc. the square root of d1.l; return it in d1.l
destroys d2,d3,d4
sqrtd1 moveq #15,d2 counter clr.l d3 colect result here set loop,* bset d2,d3 try this move.w d3,d4 extra copy result so far mulu.w d4,d4 square cmp.l d1,d4 higher? bls keepbit branch not bclr d2,d3 else bit back down keepbit dbra d2,loop process all bits move.l d3,d1 update in d1 rts

*

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Original message:

formatting link

Vote

P

Peter 18 years ago

One easy way is to add up the number of odd integers which add up to the number (or just below it).

For example, using this algorithm, the square root of 20 is

1+3+5+7

i.e. 4 numbers

answer=4

Not so efficient for large integers of course!

"David T. Ashley" wrote

Vote

U

user923005 18 years ago

nd

.

ition

ow

g

is

erate

I have Mr. Crenshaw's book "Math Toolkit for REAL-TIME Programming" "Square Root" runs from pages 43-87 and he discusses a boatload of methods. Unfortunately, I have misplaced the CD. Looking for it now.

Vote

D

David Brown 18 years ago

First make sure you really need to find the square root. For many uses, you could just use the calculated square of the magnitude.

Vote

U

user923005 18 years ago

and

er

er.

ddition

how

ing

t is

e

iterate

A benchmark shows this to be the fastest on modern hardware, but I do not know about embedded CPUs. I guess they should be tested.

unsigned long isqrt26(unsigned long input) { unsigned long s =3D 0; unsigned long s21 =3D 1073741824;

if (s21

Vote

U

user923005 18 years ago

UL and

rter

eter.

addition

re how

aring

oot is

the

o iterate

#include

// This one is REALLY bad for large inputs. unsigned long isqrt00(unsigned long x) { unsigned long s =3D 1; unsigned long remainder; if (x =3D s) { remainder -=3D s; s +=3D 2; } return (s - 1) >> 1; } // Integer Square Root function // Uses bisection to find square root

unsigned long isqrt01(unsigned long x) { unsigned long r, s, t; r =3D 0; for (t =3D 0x40000000; t; t >>=3D 2) { s =3D r + t; if (s >=3D 1; } return (r); }

#define LSIZE (sizeof(unsigned long)*CHAR_BIT)

/* Jos came up with this one... */ unsigned long isqrt06(unsigned long n) {

unsigned long l =3D 0, c =3D 0, y, k;

for (k =3D LSIZE >> 1; k--; n =3D 0; i--) if (y < xsq) { oldx =3D x; x -=3D 1 > 1; } } }

// Integer Square Root function // Uses bisection to find square root unsigned long isqrt05(unsigned long n) { unsigned long h, p =3D 0, q =3D 1, r =3D n;

while (q >=3D 1;

if (r >=3D h) { p +=3D q; r -=3D h; } }

return p; }

// Binomial Theorem -- O( 1/2 log2 N ) unsigned long isqrt07(unsigned long N) { unsigned long l2, u, v, u2, v2, uv2, n; if (2 > N) return (N); u =3D N; l2 =3D 0; while (u >>=3D 1) l2++; l2 >>=3D 1; u =3D 1L 24] > 22] =3D 0x4000000) return (sqq_table[n >> 20] > 18] =3D 0x100000) if (n >=3D 0x400000) return (sqq_table[n >> 16] > 14] =3D 0x40000) return (sqq_table[n >> 12] > 10] =3D 0x100) if (n >=3D 0x1000) if (n >=3D 0x4000) return (sqq_table[n >> 8]); else return (sqq_table[n >> 6] >> 1); else if (n >=3D 0x400) return (sqq_table[n >> 4] >> 2); else return (sqq_table[n >> 2] >> 3); else if (n >=3D 0x10) if (n >=3D 0x40) return (sqq_table[n] >> 4); else return (sqq_table[n > 4] > 6] =3D 0x10000) if (n >=3D 0x10000000) if (n >=3D 0x40000000) result =3D sq_table[n >> 16]; else result =3D sq_table[n >> 14] >> 1; else if (n >=3D 0x1000000) result =3D sq_table[n >> 12] >> 2; else result =3D sq_table[n >> 8] >> 4;

else return (sq_table[n] >> 8);

if (((result + 1) * (result + 1)) 2^16 // can be removed to get almost 2 times speed up, but accuracy will suffer. // however worst case will be out by 1.

return (result);

}

/* Fast integer square root */ unsigned long isqrt10(unsigned long x) { unsigned long squaredbit, remainder, root;

if (x =3D=3D 0) return 0;

/* Load the binary constant 01 00 00 ... 00, where the number of zero * bits to the right of the single one bit is even, and the one bit is as * far left as is consistant with that condition.) */ squaredbit =3D (long) ((((unsigned long) ~0L) >> 1) & ~(((unsigned long) ~0L) >> 2)); /* This portable load replaces the loop that used to be here, and was * donated by snipped-for-privacy@xmission.com */

/* Form bits of the answer. */ remainder =3D x; root =3D 0; while (squaredbit > 0) { if (remainder >=3D (squaredbit | root)) { remainder -=3D (squaredbit | root); root >>=3D 1; root |=3D squaredbit; } else { root >>=3D 1; } squaredbit >>=3D 2; } return root; }

#define Ns ((sizeof(unsigned long)) ((Ns - i) r) * r++; */

unsigned long isqrt02(unsigned long x) { unsigned long r, nr, m;

r =3D 0; m =3D 0x40000000; do { nr =3D r + m; if (nr >=3D 1; m >>=3D 2; } while (m !=3D 0);

if (x > r) r++; return (r); }

unsigned long isqrt14(unsigned long input) { unsigned long s; unsigned long s21; long j; s =3D 0; for (j =3D 15; j >=3D 0; --j) { s21 =3D (s =3D 0x40000000) { if (x >=3D 65535UL * 65535UL) return 65535; xn =3D sqq_table[x >> 24] > 22] =3D 0x4000000) xn =3D sqq_table[x >> 20] > 18] =3D 0x100000) if (x >=3D 0x400000) xn =3D sqq_table[x >> 16] > 14] =3D 0x40000) xn =3D sqq_table[x >> 12] > 10] =3D 0x100) { if (x >=3D 0x1000) if (x >=3D 0x4000) xn =3D (sqq_table[x >> 8] >> 0) + 1; else xn =3D (sqq_table[x >> 6] >> 1) + 1; else if (x >=3D 0x400) xn =3D (sqq_table[x >> 4] >> 2) + 1; else xn =3D (sqq_table[x >> 2] >> 3) + 1;

goto adj; } else return sqq_table[x] >> 4;

/* Run two iterations of the standard convergence formula */

xn =3D (xn + 1 + x / xn) / 2; nr1: xn =3D (xn + 1 + x / xn) / 2; adj:

if (xn * xn > x) /* Correct rounding if necessary */ xn--;

return xn; }

/* From:

formatting link

*/ /* Let us examine an implementation of the second method. By Jim Ulery

*/ unsigned short isqrt36(unsigned long val) { unsigned long temp, g =3D 0, b =3D 0x8000, bshft =3D 15; do { if (val >=3D (temp =3D (((g =3D 1); return g; }

/* From:

formatting link

*/ /* Here the division step is replaced with a 1-bit approximation * of that division. Jim Ulery's complete derivation is given here. * But there are still 15 branches here that will not be predicted * very well, as well as loop overhead. So unrolling is the first * improvement we can make. By Mark Crowne */ unsigned short isqrt37(unsigned long val) { unsigned int temp, g =3D 0;

if (val >=3D 0x40000000) { g =3D 0x8000; val -=3D 0x40000000; } #define INNER_ISQRT(s) \ temp =3D (g

Vote

M

Mark Borgerson 18 years ago

I think I would rewrite that if I were using a processor without an FPU! ;-)

Mark Borgerson

Vote

Integer Square Root Algorithm for Processor with MUL, DIV

Join the Discussion

Didn't find your answer?