Floating Point Divide

- R
- Rick C
  
  Contact options for registered users
posted
3 years ago

Sat, Jan 16, 2021 9:42 PM

I am looking for an algorithm to calculate a floating point divide for an F PGA. I don't think I need help with the details of working in floating poi nt. At one time I worked on an array processor that did everything in floa ting point and was programmed in microcode. So I am at least conversant in the topic even if that was some 40 years ago.

There are a number of choices for doing a divide, but the one that is most clear to me and easiest to implement on the hardware I am designing seems t o be the Newton?Raphson iterative method. I'm trying to understand how many iterations it will take. The formula for that in the Wikipedia art icle assumes an initial estimate for the reciprocal of the divisor; 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or

3 iterations for IEEE 32 bit floating point numbers. My needs are for somet hing between that range, but I would like to minimize the number of iterati ons... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be sufficiently more accurate than the above estimate so that one iteration wo uld be sufficient. I'm not clear on how to calculate this. I know in genera l the formula converges rapidly with the error being squared, so I'm thinki ng an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of ac curacy called "good enough" for this application). But I'd like to have som ething that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermed iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot her than ToS which is the register in the ALU (>36 bits). I don't think 11 bits is quite as good as required, but I'm thinking one crank of the NR mac hine and I'm "good". However, I need to be able to show how good. Anyone k now of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least t emporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add t hat crappy sensor as an option and the square root will be back. No! The HO RROR!

Whatever... It's just an algorithm...

--

Rick C. 

- Get 1,000 miles of free Supercharging 
- Tesla referral code - https://ts.la/richard11209

- A
- Andy Bennet
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sun, Jan 17, 2021 10:32 AM

If you are using Intel/Altera use the ALTFP_DIV Megafunction.

- A
- Andy Bennet
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sun, Jan 17, 2021 10:33 AM

Forgot to add:-

formatting link

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sun, Jan 17, 2021 11:31 AM

How fast does it need to be? If the input comes from a sensor, then perhaps a SW implementation would be a lot easier and use far fewer resources.

IEEE754 and 854 has full support for denormals. It would add a lot of complexity, so make sure you need this.

The wizards at Intel and AMD can't do that in a single cycle and they are not resource limited. The reciprocal alone estimation takes 3 cycles. If you can do better than that, patent that.

The simplest approach is to bit-invert the divisor. Good to 3 bits, which is plenty, given the simplicity.

Use SW if you can. BRAMs are cheap, you can use one or two to make a simple CPU, which you already know.

Best regards, Piotr

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sun, Jan 17, 2021 9:23 PM

an FPGA.

Trust me, I would not do this if it was not the way we need to do it.

ost clear to me and easiest to implement on the hardware I am designing see ms to be the Newton?Raphson iterative method. I'm trying to underst and how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of the divisor; 48/

17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers.

Now sure what you mean. I don't plan to implement IEEE floating point, jus t floating point. It is easier than trying to track the scale factor throu gh fixed point calculations... I think.

mize the number of iterations... ideally one.

Hardly. The obvious initial estimate is a table lookup which I believe Int el is using. Wasn't that the bug in the Pentium floating point issue many years ago? That is what keeps turning up in my searches, more than useful information.

At this point I found another resource that talked about the table lookup w hich indicates an 11 bit input provides at least 9 bits out. Run one itera tion of NR and it gives 18 bits of accuracy, so I think I'm good.

the reciprocal of the divisor.

Why is a table lookup not simple. Block RAMs are plentiful and easy to use . By "invert" do you mean a two's complement? I'm not sure how that gets anything useful. Or you you mean a bit wise inversion of the unsigned mant issa? Even so, that would require three passes of NR iteration. 9 bit a ccurate table lookup with one NR pass seems to be the way to go. Very litt le extra hardware.

for a square root calculation.

The goal is to make this NOT a CPU and no "software". That's the requireme nt. It's not something I care to discuss. It's not like hardware is funda mentally difficult. It's the same algorithm in software or hardware. I'm just trying to learn the algorithm and its uses and limitations.

If I'm doing it right, I've got it down to 6 clock cycles. With one extra data path it would be 5 clock cycles. Speed is not an issue, so I most lik ely will go with 6 clock cycles. I'm trying to make the hardware and USE a s simple as possible.

--

Rick C. 

+ Get 1,000 miles of free Supercharging 
+ Tesla referral code - https://ts.la/richard11209