Division Algorithms

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I am looking for an algorithm to calculate a floating point divide. There a
re a number of options, but the one that is most clear to me and easiest to
 implement on the hardware I am designing seems to be the Newton?Ra
phson iterative method. I'm trying to understand how many iterations it wil
l take. The formula for that in the Wikipedia article assumes an initial es
timate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterat
ions to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating p
oint numbers. My needs are for something between that range, but would like
 to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the
 reciprocal of the divisor. In the FPGA I am using, a convenient size would
 be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be  
sufficiently more accurate than the above estimate so that one iteration wo
uld be sufficient. I'm not clear on how to calculate this. I know in genera
l the formula converges rapidly with the error being squared, so I'm thinki
ng an initial value that is good to some 11 bits would produce more than 20
 bits for one turn of the Newton-Raphson crank (20 bits being a level of ac
curacy called "good enough" for this application). But I'd like to have som
ething that verifies this rather than using a rule of thumb. At some point  
I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermed
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot
her than ToS which is the register in the ALU (>36 bits). I don't think 11  
bits is quite as good as required, but I'm thinking one crank of the NR mac
hine and I'm "good".  However, I need to be able to show how good. Anyone k
now of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for
 a square root calculation. A measurement using a sensor that is at least t
emporarily deprecated requires a square root. I managed to show the sensor  
was not capable of providing an accuracy at the low end that would produce  
meaningful results. But if they find a better sensor they may want to add t
hat sensor as an option and the square root will be back. No! The HORROR!

Whatever... It's just an algorithm...

--  

Rick C.

- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Division Algorithms
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, snipped-for-privacy@gmail.com  
wrote:
Quoted text here. Click to load it
 are a number of options, but the one that is most clear to me and easiest  
to implement on the hardware I am designing seems to be the Newton?
Raphson iterative method. I'm trying to understand how many iterations it w
ill take. The formula for that in the Wikipedia article assumes an initial  
estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iter
ations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating
 point numbers. My needs are for something between that range, but would li
ke to minimize the number of iterations... ideally one.  
Quoted text here. Click to load it
he reciprocal of the divisor. In the FPGA I am using, a convenient size wou
ld be 2k entries (11 bit address) of 18 bits. I'm wondering if this would b
e sufficiently more accurate than the above estimate so that one iteration  
would be sufficient. I'm not clear on how to calculate this. I know in gene
ral the formula converges rapidly with the error being squared, so I'm thin
king an initial value that is good to some 11 bits would produce more than  
20 bits for one turn of the Newton-Raphson crank (20 bits being a level of  
accuracy called "good enough" for this application). But I'd like to have s
omething that verifies this rather than using a rule of thumb. At some poin
t I will probably have to justify the correctness of the calculations.  
Quoted text here. Click to load it
ediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack  
other than ToS which is the register in the ALU (>36 bits). I don't think 1
1 bits is quite as good as required, but I'm thinking one crank of the NR m
achine and I'm "good". However, I need to be able to show how good. Anyone  
know of good documentation on this matter?  
Quoted text here. Click to load it
or a square root calculation. A measurement using a sensor that is at least
 temporarily deprecated requires a square root. I managed to show the senso
r was not capable of providing an accuracy at the low end that would produc
e meaningful results. But if they find a better sensor they may want to add
 that sensor as an option and the square root will be back. No! The HORROR!
  
Quoted text here. Click to load it

You don't mention throughput or latency requirements, just that you want so
mething non-iterative.  A resource I've used before is the Altera Advanced  
Synthesis Cookbook.  You can find it online and it has algorithms/source fo
r both "approximate" floating-point divider and root modules.  These both s
acrifice a bit of precision for low gate count.  The divider is pipelined a
nd has a latency of about 5 cycles.  Maybe it could be helpful.

Re: Division Algorithms
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
Quoted text here. Click to load it
m wrote:  
Quoted text here. Click to load it
re are a number of options, but the one that is most clear to me and easies
t to implement on the hardware I am designing seems to be the Newton?
?Raphson iterative method. I'm trying to understand how many iterations i
t will take. The formula for that in the Wikipedia article assumes an initi
al estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 i
terations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit float
ing point numbers. My needs are for something between that range, but would
 like to minimize the number of iterations... ideally one.  
Quoted text here. Click to load it
 the reciprocal of the divisor. In the FPGA I am using, a convenient size w
ould be 2k entries (11 bit address) of 18 bits. I'm wondering if this would
 be sufficiently more accurate than the above estimate so that one iteratio
n would be sufficient. I'm not clear on how to calculate this. I know in ge
neral the formula converges rapidly with the error being squared, so I'm th
inking an initial value that is good to some 11 bits would produce more tha
n 20 bits for one turn of the Newton-Raphson crank (20 bits being a level o
f accuracy called "good enough" for this application). But I'd like to have
 something that verifies this rather than using a rule of thumb. At some po
int I will probably have to justify the correctness of the calculations.  
Quoted text here. Click to load it
rmediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stac
k other than ToS which is the register in the ALU (>36 bits). I don't think
 11 bits is quite as good as required, but I'm thinking one crank of the NR
 machine and I'm "good". However, I need to be able to show how good. Anyon
e know of good documentation on this matter?  
Quoted text here. Click to load it
 for a square root calculation. A measurement using a sensor that is at lea
st temporarily deprecated requires a square root. I managed to show the sen
sor was not capable of providing an accuracy at the low end that would prod
uce meaningful results. But if they find a better sensor they may want to a
dd that sensor as an option and the square root will be back. No! The HORRO
R!  
Quoted text here. Click to load it
something non-iterative. A resource I've used before is the Altera Advanced
 Synthesis Cookbook. You can find it online and it has algorithms/source fo
r both "approximate" floating-point divider and root modules. These both sa
crifice a bit of precision for low gate count. The divider is pipelined and
 has a latency of about 5 cycles. Maybe it could be helpful.

I'm not looking for pipelined solutions because speed is not an issue.  I d
on't want to bother with shift and subtract not because it is slow, but bec
ause each step in this design would be controlled by a ROM and while I don'
t expect to worry about the ROM size, if it uses 18 locations for each divi
de in the process I might need to focus on that.  

The floating point NR approach seems to work ok.  After considering the loo
k up table in detail it seems like it will work fine with one pass of the N
R algorithm, about 5 steps in all.  Not bad.  

--  

Rick C.

+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.

Site Timeline