What is the best integer square root algorithm for a processor with MUL and DIV (integer multiplication and division) built-in?

For one without MUL and DIV (or where the instructions operate on shorter operands than would be required to form the square in question)?

Thanks for all ideas and thoughts.

I will check TOACP, of course.

The application is a low-cost microcontroller with a 3-axis accelerometer. To get the magnitude of the acceleration, we'd want to use sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition is cheap (ADD, ADC). The square root extraction is what I'm not sure how best to do.

The algorithm that comes to mind is just to do a binary search by squaring potential solutions and comparing them to the quantity whose square root is to be extracted.

For example, if the sum of the squares is 16 bits or less, and since the processor has a 8 x 8 = 16 MUL instruction, it should be possible to iterate to the solution in no more than 16 square/compare cycles.

But maybe there is a better way?

And it isn't clear what to do if the sum of the squares is a bit larger.

Thanks for all.

--
David T. Ashley (dta@e3ft.com)
http://www.e3ft.com (Consulting Home Page)

There's never such a thing as "the best" way to do something that's been described so vaguely. Different circumstances render different aspects important, and thus change what "good" actually means.

There's one exception to the above rule: not doing something at all, if you can get away, is always the best way of doing it. So: do you really

*need* that sqrt() taken on your microcontroller?

Failing that you can still use use good old Newton-Raphson iteration to locate the zero of (x^2 - input).

x |--> x - (x^2 - input) / (2*x) = (x + input/x) / 2

Here is the simplest (but not the very fastest) sqrt algorithm for the CPU with cheap multiplication: //------------------------- s16 sqrt_s16(s32 x) { s32 tmp_result; s16 root, tmp_root, tmp_add;

Approximate the square root by the lowest half of the bit representation of x^2 + y^2 (or any other easy, convenient approximation), and then use Newton-Raphson.

For no MUL and DIV search for Dann Corbit's postings. Dann posted one a few years ago with N/2 interations. If you can't find it contact me off line.

The following square root routine was originally done for the 6805. It is probably not the fastest

The basic loop spits out one bit of result per pass. This routine takes a 16 bit argument and returns an eight bit integer and 8 bit fraction.

Walter Banks

formatting link

============================================================ #pragma option v; #pragma has MUL;

long SQRT (long l) /* Square root V1.0.

Hayward/Banks.

This routine generates a square root to eight bit resolution taking a 16 bit argument normalizing the argument and returning a 16 bit real number that consists of 8 bits integer and 8 bit fractional parts. The MS8bits are the integer part.

i i i i i i i i . f f f f f f f f | | | | | | | | | | | | | | | | 128 |32 | 8 4 2 1 .5 | | | | | | .00390625 64 16 .25 | | | | --.0078125 .125- | | ----.015625 .0625-- ------.03125

These routines may be adapted for any purpose. No warranty is implied or given as to their usability for any purpose.

Actually you can do this in 8 multiplies. This is how I usually implement square root - I have probably reinvented it some 15+ years ago for myself. You just start with $80, square and compare to the number you want the root of, and do successive approximation down to bit 0 (i.e. second time with $40+result so far, then $20 etc.).

If the values of x, y, z don't change too much from sample to sample you could use the previous value as the starting point for Newton- Raphson iteration.

For a different application (not involving acceleration or square roots), I performed only a single Newton-Raphson iteration per sample. For slow moving inputs, this gave perfect results, and for fast moving inputs, I got a free low-pass filter :)

Well my algorithm takes less lines and is a lot easier to read than those in HLL which were suggested, but I hope it counts in spite of that :-). Here is it excerpted from a post I have made many years ago, calculates a 16 bit squre root of a 32 bit number. The example is in CPU32 (or CF) assembly.

*

calc. the square root of d1.l; return it in d1.l

destroys d2,d3,d4

sqrtd1 moveq #15,d2 counter clr.l d3 colect result here set loop,* bset d2,d3 try this move.w d3,d4 extra copy result so far mulu.w d4,d4 square cmp.l d1,d4 higher? bls keepbit branch not bclr d2,d3 else bit back down keepbit dbra d2,loop process all bits move.l d3,d1 update in d1 rts

I have Mr. Crenshaw's book "Math Toolkit for REAL-TIME Programming" "Square Root" runs from pages 43-87 and he discusses a boatload of methods. Unfortunately, I have misplaced the CD. Looking for it now.

// This one is REALLY bad for large inputs. unsigned long isqrt00(unsigned long x) { unsigned long s =3D 1; unsigned long remainder; if (x =3D s) { remainder -=3D s; s +=3D 2; } return (s - 1) >> 1; } // Integer Square Root function // Uses bisection to find square root

unsigned long isqrt01(unsigned long x) { unsigned long r, s, t; r =3D 0; for (t =3D 0x40000000; t; t >>=3D 2) { s =3D r + t; if (s >=3D 1; } return (r); }

#define LSIZE (sizeof(unsigned long)*CHAR_BIT)

/* Jos came up with this one... */ unsigned long isqrt06(unsigned long n) {

unsigned long l =3D 0, c =3D 0, y, k;

for (k =3D LSIZE >> 1; k--; n =3D 0; i--) if (y < xsq) { oldx =3D x; x -=3D 1 > 1; } } }

// Integer Square Root function // Uses bisection to find square root unsigned long isqrt05(unsigned long n) { unsigned long h, p =3D 0, q =3D 1, r =3D n;

while (q >=3D 1;

if (r >=3D h) { p +=3D q; r -=3D h; } }

return p; }

// Binomial Theorem -- O( 1/2 log2 N ) unsigned long isqrt07(unsigned long N) { unsigned long l2, u, v, u2, v2, uv2, n; if (2 > N) return (N); u =3D N; l2 =3D 0; while (u >>=3D 1) l2++; l2 >>=3D 1; u =3D 1L 24] > 22] =3D 0x4000000) return (sqq_table[n >> 20] > 18] =3D 0x100000) if (n >=3D 0x400000) return (sqq_table[n >> 16] > 14] =3D 0x40000) return (sqq_table[n >> 12] > 10] =3D 0x100) if (n >=3D 0x1000) if (n >=3D 0x4000) return (sqq_table[n >> 8]); else return (sqq_table[n >> 6] >> 1); else if (n >=3D 0x400) return (sqq_table[n >> 4] >> 2); else return (sqq_table[n >> 2] >> 3); else if (n >=3D 0x10) if (n >=3D 0x40) return (sqq_table[n] >> 4); else return (sqq_table[n > 4] > 6] =3D 0x10000) if (n >=3D 0x10000000) if (n >=3D 0x40000000) result =3D sq_table[n >> 16]; else result =3D sq_table[n >> 14] >> 1; else if (n >=3D 0x1000000) result =3D sq_table[n >> 12] >> 2; else result =3D sq_table[n >> 8] >> 4;

else return (sq_table[n] >> 8);

if (((result + 1) * (result + 1)) 2^16 // can be removed to get almost 2 times speed up, but accuracy will suffer. // however worst case will be out by 1.

return (result);

}

/* Fast integer square root */ unsigned long isqrt10(unsigned long x) { unsigned long squaredbit, remainder, root;

if (x =3D=3D 0) return 0;

/* Load the binary constant 01 00 00 ... 00, where the number of zero * bits to the right of the single one bit is even, and the one bit is as * far left as is consistant with that condition.) */ squaredbit =3D (long) ((((unsigned long) ~0L) >> 1) & ~(((unsigned long) ~0L) >> 2)); /* This portable load replaces the loop that used to be here, and was * donated by snipped-for-privacy@xmission.com */

/* Form bits of the answer. */ remainder =3D x; root =3D 0; while (squaredbit > 0) { if (remainder >=3D (squaredbit | root)) { remainder -=3D (squaredbit | root); root >>=3D 1; root |=3D squaredbit; } else { root >>=3D 1; } squaredbit >>=3D 2; } return root; }

if (xn * xn > x) /* Correct rounding if necessary */ xn--;

return xn; }

/* From:

formatting link

*/ /* Let us examine an implementation of the second method. By Jim Ulery
*/ unsigned short isqrt36(unsigned long val) { unsigned long temp, g =3D 0, b =3D 0x8000, bshft =3D 15; do { if (val >=3D (temp =3D (((g =3D 1); return g; }

/* From:

formatting link

*/ /* Here the division step is replaced with a 1-bit approximation * of that division. Jim Ulery's complete derivation is given here. * But there are still 15 branches here that will not be predicted * very well, as well as loop overhead. So unrolling is the first * improvement we can make. By Mark Crowne */ unsigned short isqrt37(unsigned long val) { unsigned int temp, g =3D 0;

if (val >=3D 0x40000000) { g =3D 0x8000; val -=3D 0x40000000; } #define INNER_ISQRT(s) \ temp =3D (g

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.