# Extending floating point precision

• posted

My project team was having a discussion about doing math in assembly language, and an interesting side-issue question was raised that nobody could answer.

We all know how to extend all sorts of math operations in fixed point with more bits than are native to the processor. and we all know how to plug in a floating point library or to use a floating point co-processor if there is one, but is it possible to extend floating point math operations to have more bits than are native to the co-processor?

For example, let's say that I have a math co-processor that handles ANSI/IEEE Std 754-1985 floating point in single-precision 32 bit and double-precision 64 bit:

Single-precision 32 bit

1 8 23
• posted

Yes, if by "clever" you mean largely obvious. Just using a vector of floats instead of a scalar and requiring "exact rounding" and a few simple rules about the operations. Can be coded portably, I believe. Must be a lot of web pages on the subject, I'd imagine. I believe I read about this some 15 years ago and I'm sure it was old, then.

Jon

• posted

Why?

• posted

(1) Because I don't care to have my posts able to be in any way under the control of anyone else -- even if that control remains unexercised

-- where that's possible. (2) And perhaps more especially, when under the specific circumstances with which I was presented in this case.

...

Regardless, do you see the obvious? Or would you like some easy examples to help illustrate, more precisely? I think I can remember enough to provide some concrete examples.

Jon

• posted

I understand. If you ever wish to post there without any chance of outside control, I will be happy to make you a moderator.

Maybe I am just being dumb today, but I don't see the obvious, and neither did the other two engineers who I was having lunch with. That being said, just knowing that it's possible is probably enough for me to get it, so let me work on it tomorrow when I am fresh. Thanks!

• posted

My life is filled enough as it is....

That's the right spirit!

A clue: you implied that you want to be able to use the existing low level operators to combine values, yet to do so it must be done without any rational chance of exceeding the final precision in the result. This result is, of course, usually of the same size as the values supplied to the binary operators. Looking at multiplication should stress this point more clearly than addition, by the way. The way to achieve the extension should be clearer, once you stare at that problem, squarely in the face of it, and figure out how to deal with it. Just focus on multiplication of two values.

But I'm also sure that an almost trivial search of the web will pop up more than a few somethings on this subject. I'm sure it's just too important to have somehow escaped getting some moderate level of attention there.

Jon

• posted

Yes, it certainly is possible and there are several well-known math packages like Maple, Mathematica, (maybe Matlab as well) that will do arbitrary-precision arithmetic in both fixed point and floating point. Depending on which platform you are using, these programs will either use the co-processor or do the floating point calculations entierly in software, or a combination of both. About 10 years ago, you could buy an ISA card called the "Dubner Cruncher" that would do arithmetic on huge numbers (2000 digits or more) entirely in hardware. If you want source code for doing arbitrary precision fixed point and floating math, I believe YACAS is fully open source and it can do both.

--Tom.

• posted

... snip ...

Are you are talking about the usual practice of expressing A as

a = a0 + a1 * 2^N

where the ai are in the precision we have, and the a in the precision we want, and N quantifies the precision we have. We can then do multiplications etc. and combine the portions. The problem in the floating world is that we can't control the normalization and rounding done. That action on the more significant portion will mask anything from the less significant portions. At least as I see it.

What we can do is to build an arbitrary precision set of integer operations, and then build a single floating point mechanism upon that. To me that means there is no point in having a hardware floating point processor at all, if all we need is the extended float.

```--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.```
• posted

perhaps?

Paul Burke

• posted

No, that's not what I'm talking about. And yes, rounding is very important to consider -- exact rounding.

Jon

• posted

I don't know. I'll have to look.

Jon

• posted

[%X]

Have you looked in the Forth Scientific Library code. It sounds like the sort of thing that someone like Julian Noble might have accomplished at some point in his work. Maybe even posting this question to clf would yield answers for you as well.

Not needing to use FP for much of my control work it is not a question I have considered myself.

```--
********************************************************************
Paul E. Bennett ....................```
• posted

Alright, what are you talking about. Rounding is inherently inexact. I see no way of using the usual FP system to extend its own precision. Range, yes. Precision, no.

```--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.```
• posted

My windows box says that the .zip at

is corrupt. I will look at the .tar.gz when I reboot into Linux. I suspect that I will find that it's built out of ordinary fixed point instructions; Restricting myself to floating point instructions is rather a lot like restricting my poetry to iambic pentameter. :)

• posted

That would certainly accomplish the task, but it wouldn't be the learning experience I was seeking. I already have libraries with source and have written my own routines using fixed point instructions; what I am trying to do is to fix a hole in my knowledge - how to do it within the constraints of using the operations in an FPU as my starting point. It's something that I should already know, but I have a tendency to just use the FPU as if it was a fast library call.

I am still working on building up 128 bit floating point routines out of 64-bit or 80-bit floating point routines as opposed to integer math. right now I am struggling with the fact that the most-significant part mucks about with the lowest bit to make it normal or to round it, and that hoses any hope of adding on a bunch of precision with the least- significant part. I have just started on it though.

• posted

Ah. Yes, I see how to do that - build an arbitrary precision set of integer operations, and then build a single floating point mechanism upon that. That's probably the most efficient way to do it as well, under normal circumstances. Then again, some of the newer processors can pipeline FPU instructions without slowing down the rest of the CPU, so building extended precision math out of existing floating point instructions would be essentially free. The way I write code, the FPU has a *lot* of time on it's hands...

• posted

Guy Macon schrieb:

Hello,

if we use a different organisation for quadruple precision like this:

sign 1 bit exponent 11 bits mantissa 104 bits(+2)

we can use all given double precision operations to calculate like this: Every quad number consists of two double numbers, the most and the least significant part a = a1 + a2 We can add a+b = a1+b1 + a2+b2 multiply a*b = (a1+a2)*(b1+b2) = a1b1+a2b1+a1b2+a2b2 The subtraction is done like the addition. All these operations will use the double precision floating point unit.

But the problem is the division a/b = (a1+a2)/(b1+b2)

Calculating the carries is also a problem.

It seems to be easier not to use double fp operations and an exponent of

32 bits and a mantissa of three or four words of 32 bits each. All operations with these quad fp numbers can be done with integer operations.

Bye

• posted

[...]

I think that's exactly it. As soon as you lose a quantum of precision on a boundary, everything to the right of that boundary is useless. It may appear OK, and that would be a dangerous assumption, because after many dozens or thousands of additional operations, the less significant portion of the numbers would be unrecognizable as well as being wrong.

I think that's probably the only way.

• posted

Hi, some available options are:

1. The GNU Multiple Precision Arithmetic Library:
2. Some packages present on:
for example the Quad-Double (QD) and/or MPFUN
3.The "doubledouble" package: