Floating Point Divide

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I am looking for an algorithm to calculate a floating point divide for an F
PGA.  I don't think I need help with the details of working in floating poi
nt.  At one time I worked on an array processor that did everything in floa
ting point and was programmed in microcode.  So I am at least conversant in
 the topic even if  that was some 40 years ago.  

There are a number of choices for doing a divide, but the one that is most  
clear to me and easiest to implement on the hardware I am designing seems t
o be the Newton?Raphson iterative method. I'm trying to understand  
how many iterations it will take. The formula for that in the Wikipedia art
icle assumes an initial estimate for the reciprocal of the divisor; 48/17 -
 D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or  
3 iterations for IEEE 32 bit floating point numbers. My needs are for somet
hing between that range, but I would like to minimize the number of iterati
ons... ideally one.

I believe most implementations use a table lookup for the seed value of the
 reciprocal of the divisor. In the FPGA I am using, a convenient size would
 be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be  
sufficiently more accurate than the above estimate so that one iteration wo
uld be sufficient. I'm not clear on how to calculate this. I know in genera
l the formula converges rapidly with the error being squared, so I'm thinki
ng an initial value that is good to some 11 bits would produce more than 20
 bits for one turn of the Newton-Raphson crank (20 bits being a level of ac
curacy called "good enough" for this application). But I'd like to have som
ething that verifies this rather than using a rule of thumb. At some point  
I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermed
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot
her than ToS which is the register in the ALU (>36 bits). I don't think 11  
bits is quite as good as required, but I'm thinking one crank of the NR mac
hine and I'm "good".  However, I need to be able to show how good. Anyone k
now of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for
 a square root calculation. A measurement using a sensor that is at least t
emporarily deprecated requires a square root. I managed to show the sensor  
was not capable of providing an accuracy at the low end that would produce  
meaningful results. But if they find a better sensor they may want to add t
hat crappy sensor as an option and the square root will be back. No! The HO
RROR!

Whatever... It's just an algorithm...

--  

Rick C.

- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Floating Point Divide
On 16/01/2021 21:42, Rick C wrote:
Quoted text here. Click to load it

If you are using Intel/Altera use the ALTFP_DIV Megafunction.

Re: Floating Point Divide
On 17/01/2021 10:32, Andy Bennet wrote:
Quoted text here. Click to load it




Quoted text here. Click to load it

Quoted text here. Click to load it

Forgot to add:-
https://www.intel.co.jp/content/dam/altera-www/global/ja_JP/pdfs/literature/ug/ug_altfp_div.pdf

Re: Floating Point Divide
Rick C wrote:

Quoted text here. Click to load it

How fast does it need to be? If the input comes from a sensor, then  
perhaps a SW implementation would be a lot easier and use far fewer  
resources.

Quoted text here. Click to load it

IEEE754 and 854 has full support for denormals. It would add a lot of  
complexity, so make sure you need this.

Quoted text here. Click to load it

The wizards at Intel and AMD can't do that in a single cycle and they  
are not resource limited. The reciprocal alone estimation takes 3  
cycles. If you can do better than that, patent that.

Quoted text here. Click to load it

The simplest approach is to bit-invert the divisor. Good to 3 bits,  
which is plenty, given the simplicity.

Quoted text here. Click to load it

Use SW if you can. BRAMs are cheap, you can use one or two to make a  
simple CPU, which you already know.

    Best regards, Piotr

Re: Floating Point Divide
On Sunday, January 17, 2021 at 6:31:38 AM UTC-5, Piotr Wyderski wrote:
Quoted text here. Click to load it
an FPGA.

Trust me, I would not do this if it was not the way we need to do it.  


Quoted text here. Click to load it
ost clear to me and easiest to implement on the hardware I am designing see
ms to be the Newton?Raphson iterative method. I'm trying to underst
and how many iterations it will take. The formula for that in the Wikipedia
 article assumes an initial estimate for the reciprocal of the divisor; 48/
17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy
 or 3 iterations for IEEE 32 bit floating point numbers.
Quoted text here. Click to load it

Now sure what you mean.  I don't plan to implement IEEE floating point, jus
t floating point.  It is easier than trying to track the scale factor throu
gh fixed point calculations... I think.  


Quoted text here. Click to load it
mize the number of iterations... ideally one.
Quoted text here. Click to load it

Hardly.  The obvious initial estimate is a table lookup which I believe Int
el is using.  Wasn't that the bug in the Pentium floating point issue many  
years ago?  That is what keeps turning up in my searches, more than useful  
information.  

At this point I found another resource that talked about the table lookup w
hich indicates an 11 bit input provides at least 9 bits out.  Run one itera
tion of NR and it gives 18 bits of accuracy, so I think I'm good.  


Quoted text here. Click to load it
 the reciprocal of the divisor.
Quoted text here. Click to load it

Why is a table lookup not simple.  Block RAMs are plentiful and easy to use
.  By "invert" do you mean a two's complement?  I'm not sure how that gets  
anything useful.  Or you you mean a bit wise inversion of the unsigned mant
issa?   Even so, that would require three passes of NR iteration.   9 bit a
ccurate table lookup with one NR pass seems to be the way to go.  Very litt
le extra hardware.  


Quoted text here. Click to load it
 for a square root calculation.
Quoted text here. Click to load it

The goal is to make this NOT a CPU and no "software".  That's the requireme
nt.  It's not something I care to discuss.  It's not like hardware is funda
mentally difficult.  It's the same algorithm in software or hardware.  I'm  
just trying to learn the algorithm and its uses and limitations.  

If I'm doing it right, I've got it down to 6 clock cycles.  With one extra  
data path it would be 5 clock cycles.  Speed is not an issue, so I most lik
ely will go with 6 clock cycles.  I'm trying to make the hardware and USE a
s simple as possible.  

--  

Rick C.

+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.

Site Timeline