Do you have a question? Post it now! No Registration Necessary
- Subject
- Posted on
- Floating Point Divide
- 01-16-2021
posted on
January 16, 2021, 9:42 pm
January 16, 2021, 9:42 pm
I am looking for an algorithm to calculate a floating point divide for an F
PGA. I don't think I need help with the details of working in floating poi
nt. At one time I worked on an array processor that did everything in floa
ting point and was programmed in microcode. So I am at least conversant in
the topic even if that was some 40 years ago.
There are a number of choices for doing a divide, but the one that is most
clear to me and easiest to implement on the hardware I am designing seems t
o be the Newton?Raphson iterative method. I'm trying to understand
how many iterations it will take. The formula for that in the Wikipedia art
icle assumes an initial estimate for the reciprocal of the divisor; 48/17 -
D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or
3 iterations for IEEE 32 bit floating point numbers. My needs are for somet
hing between that range, but I would like to minimize the number of iterati
ons... ideally one.
I believe most implementations use a table lookup for the seed value of the
reciprocal of the divisor. In the FPGA I am using, a convenient size would
be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be
sufficiently more accurate than the above estimate so that one iteration wo
uld be sufficient. I'm not clear on how to calculate this. I know in genera
l the formula converges rapidly with the error being squared, so I'm thinki
ng an initial value that is good to some 11 bits would produce more than 20
bits for one turn of the Newton-Raphson crank (20 bits being a level of ac
curacy called "good enough" for this application). But I'd like to have som
ething that verifies this rather than using a rule of thumb. At some point
I will probably have to justify the correctness of the calculations.
Because of the hardware facilities available the calculations have intermed
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot
her than ToS which is the register in the ALU (>36 bits). I don't think 11
bits is quite as good as required, but I'm thinking one crank of the NR mac
hine and I'm "good". However, I need to be able to show how good. Anyone k
now of good documentation on this matter?
Oh, I guess I should mention later on I might be repeating this process for
a square root calculation. A measurement using a sensor that is at least t
emporarily deprecated requires a square root. I managed to show the sensor
was not capable of providing an accuracy at the low end that would produce
meaningful results. But if they find a better sensor they may want to add t
hat crappy sensor as an option and the square root will be back. No! The HO
RROR!
Whatever... It's just an algorithm...
PGA. I don't think I need help with the details of working in floating poi
nt. At one time I worked on an array processor that did everything in floa
ting point and was programmed in microcode. So I am at least conversant in
the topic even if that was some 40 years ago.
There are a number of choices for doing a divide, but the one that is most
clear to me and easiest to implement on the hardware I am designing seems t
o be the Newton?Raphson iterative method. I'm trying to understand
how many iterations it will take. The formula for that in the Wikipedia art
icle assumes an initial estimate for the reciprocal of the divisor; 48/17 -
D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or
3 iterations for IEEE 32 bit floating point numbers. My needs are for somet
hing between that range, but I would like to minimize the number of iterati
ons... ideally one.
I believe most implementations use a table lookup for the seed value of the
reciprocal of the divisor. In the FPGA I am using, a convenient size would
be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be
sufficiently more accurate than the above estimate so that one iteration wo
uld be sufficient. I'm not clear on how to calculate this. I know in genera
l the formula converges rapidly with the error being squared, so I'm thinki
ng an initial value that is good to some 11 bits would produce more than 20
bits for one turn of the Newton-Raphson crank (20 bits being a level of ac
curacy called "good enough" for this application). But I'd like to have som
ething that verifies this rather than using a rule of thumb. At some point
I will probably have to justify the correctness of the calculations.
Because of the hardware facilities available the calculations have intermed
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot
her than ToS which is the register in the ALU (>36 bits). I don't think 11
bits is quite as good as required, but I'm thinking one crank of the NR mac
hine and I'm "good". However, I need to be able to show how good. Anyone k
now of good documentation on this matter?
Oh, I guess I should mention later on I might be repeating this process for
a square root calculation. A measurement using a sensor that is at least t
emporarily deprecated requires a square root. I managed to show the sensor
was not capable of providing an accuracy at the low end that would produce
meaningful results. But if they find a better sensor they may want to add t
hat crappy sensor as an option and the square root will be back. No! The HO
RROR!
Whatever... It's just an algorithm...
--
Rick C.
- Get 1,000 miles of free Supercharging
Rick C.
- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Floating Point Divide
How fast does it need to be? If the input comes from a sensor, then
perhaps a SW implementation would be a lot easier and use far fewer
resources.
IEEE754 and 854 has full support for denormals. It would add a lot of
complexity, so make sure you need this.
The wizards at Intel and AMD can't do that in a single cycle and they
are not resource limited. The reciprocal alone estimation takes 3
cycles. If you can do better than that, patent that.
The simplest approach is to bit-invert the divisor. Good to 3 bits,
which is plenty, given the simplicity.
Use SW if you can. BRAMs are cheap, you can use one or two to make a
simple CPU, which you already know.
Best regards, Piotr
Re: Floating Point Divide
On Sunday, January 17, 2021 at 6:31:38 AM UTC-5, Piotr Wyderski wrote:
an FPGA.
Trust me, I would not do this if it was not the way we need to do it.
ost clear to me and easiest to implement on the hardware I am designing see
ms to be the Newton?Raphson iterative method. I'm trying to underst
and how many iterations it will take. The formula for that in the Wikipedia
article assumes an initial estimate for the reciprocal of the divisor; 48/
17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy
or 3 iterations for IEEE 32 bit floating point numbers.
Now sure what you mean. I don't plan to implement IEEE floating point, jus
t floating point. It is easier than trying to track the scale factor throu
gh fixed point calculations... I think.
mize the number of iterations... ideally one.
Hardly. The obvious initial estimate is a table lookup which I believe Int
el is using. Wasn't that the bug in the Pentium floating point issue many
years ago? That is what keeps turning up in my searches, more than useful
information.
At this point I found another resource that talked about the table lookup w
hich indicates an 11 bit input provides at least 9 bits out. Run one itera
tion of NR and it gives 18 bits of accuracy, so I think I'm good.
the reciprocal of the divisor.
Why is a table lookup not simple. Block RAMs are plentiful and easy to use
. By "invert" do you mean a two's complement? I'm not sure how that gets
anything useful. Or you you mean a bit wise inversion of the unsigned mant
issa? Even so, that would require three passes of NR iteration. 9 bit a
ccurate table lookup with one NR pass seems to be the way to go. Very litt
le extra hardware.
for a square root calculation.
The goal is to make this NOT a CPU and no "software". That's the requireme
nt. It's not something I care to discuss. It's not like hardware is funda
mentally difficult. It's the same algorithm in software or hardware. I'm
just trying to learn the algorithm and its uses and limitations.
If I'm doing it right, I've got it down to 6 clock cycles. With one extra
data path it would be 5 clock cycles. Speed is not an issue, so I most lik
ely will go with 6 clock cycles. I'm trying to make the hardware and USE a
s simple as possible.
an FPGA.
Trust me, I would not do this if it was not the way we need to do it.
ost clear to me and easiest to implement on the hardware I am designing see
ms to be the Newton?Raphson iterative method. I'm trying to underst
and how many iterations it will take. The formula for that in the Wikipedia
article assumes an initial estimate for the reciprocal of the divisor; 48/
17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy
or 3 iterations for IEEE 32 bit floating point numbers.
Now sure what you mean. I don't plan to implement IEEE floating point, jus
t floating point. It is easier than trying to track the scale factor throu
gh fixed point calculations... I think.
mize the number of iterations... ideally one.
Hardly. The obvious initial estimate is a table lookup which I believe Int
el is using. Wasn't that the bug in the Pentium floating point issue many
years ago? That is what keeps turning up in my searches, more than useful
information.
At this point I found another resource that talked about the table lookup w
hich indicates an 11 bit input provides at least 9 bits out. Run one itera
tion of NR and it gives 18 bits of accuracy, so I think I'm good.
the reciprocal of the divisor.
Why is a table lookup not simple. Block RAMs are plentiful and easy to use
. By "invert" do you mean a two's complement? I'm not sure how that gets
anything useful. Or you you mean a bit wise inversion of the unsigned mant
issa? Even so, that would require three passes of NR iteration. 9 bit a
ccurate table lookup with one NR pass seems to be the way to go. Very litt
le extra hardware.
for a square root calculation.
The goal is to make this NOT a CPU and no "software". That's the requireme
nt. It's not something I care to discuss. It's not like hardware is funda
mentally difficult. It's the same algorithm in software or hardware. I'm
just trying to learn the algorithm and its uses and limitations.
If I'm doing it right, I've got it down to 6 clock cycles. With one extra
data path it would be 5 clock cycles. Speed is not an issue, so I most lik
ely will go with 6 clock cycles. I'm trying to make the hardware and USE a
s simple as possible.
--
Rick C.
+ Get 1,000 miles of free Supercharging
Rick C.
+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Site Timeline
- » Capacitors at RF
- — Next thread in » Electronics Design
- » British Medical Journal: Pfizer Vaccine Efficacy 52% After First Dose
- — Previous thread in » Electronics Design
- » Auction in Fremont, ca NOW
- — Newest thread in » Electronics Design
- » Amateur electronics in danger due to lack of DIP ICs
- — Last Updated thread in » Electronics Design
- » Prośba o namierzenie "blaszki z wtyczki"
- — The site's Newest Thread. Posted in » Electronics (Polish)
- » RCA RNSMU4336
- — The site's Last Updated Thread. Posted in » Electronics Repair