# Extending floating point precision

#### Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

•  Subject
• Author
• Posted on

My project team was having a discussion about doing math in
assembly language, and an interesting side-issue question was

We all know how to extend all sorts of math operations in fixed
point with more bits than are native to the processor. and we
all know how to plug in a floating point library or to use a
floating point co-processor if there is one, but is it possible
to extend floating point math operations to have more bits than
are native to the co-processor?

For example, let's say that I have a math co-processor that handles
ANSI/IEEE Std 754-1985 floating point in single-precision 32 bit
and double-precision 64 bit:

Single-precision 32 bit

1    8       23     <-- Width
+--+-----+----------+
|Si| Exp | Mantissa |
+--+-----+----------+
|31|30 23|22       0| <-- Bit #

length    32 bits
sign      1 bit
exponent  8 bits
mantissa  23 bits(+1)
bias      127

Double-precision 64 bit

1     11              52           <-- Width
+--+---------+--------------------+
|Si|   Exp   |      Mantissa      |
+--+---------+--------------------+
|63|62     52|51                 0| <-- Bit #

length    64 bits
sign      1 bit
exponent  11 bits
mantissa  52 bits(+1)
bias      1,023

Is it possible, through some fiendishly clever scheme
of co-processor operations, carries, multiplies, and
black magic, to write a routine to do math in "Quadruple
precision" like this?

1        22                       105
+---+-------------+-----------------------------------+
|Si |     Exp     |              Mantissa             |
+---+-------------+-----------------------------------+
|127|126       105|104                               0|

length    128 bits
sign      1 bit
exponent  22 bits
mantissa  105 bits(+1)
bias      2,097,151

(Or maybe exponent23% bits and mantissa10%4 bits(+1)..)

To answer the obvious questions, no, I don't have a use for
such precision (nobody has asked me to measure the width of
the universe with a precision of the width of a neutron
lately), and yes, I do know how to do this with a FP routine
written in integer math.  What I am curious about is whether
I can do it through a clever scheme of carries, multiplies,
etc. on the results from 64-bit floating point hardware.

--
Guy Macon <http://www.guymacon.com/

Re: Extending floating point precision
(restricted to CAE):

On 30 Jun 2005 22:50:08 GMT, Guy Macon

Yes, if by "clever" you mean largely obvious.  Just using a vector of
floats instead of a scalar and requiring "exact rounding" and a few
simple rules about the operations.  Can be coded portably, I believe.
Must be a lot of web pages on the subject, I'd imagine.  I believe I

Jon

Re: Extending floating point precision

Jonathan Kirwan wrote:

Why?

Re: Extending floating point precision
On Fri, 01 Jul 2005 02:16:57 +0000, Guy Macon

(1)  Because I don't care to have my posts able to be in any way under
the control of anyone else -- even if that control remains unexercised
-- where that's possible.  (2) And perhaps more especially, when under
the specific circumstances with which I was presented in this case.

...

Regardless, do you see the obvious?  Or would you like some easy
examples to help illustrate, more precisely?  I think I can remember
enough to provide some concrete examples.

Jon

Re: Extending floating point precision

Jonathan Kirwan wrote:

I understand.  If you ever wish to post there without any chance
of outside control, I will be happy to make you a moderator.

Maybe I am just being dumb today, but I don't see the obvious, and
neither did the other two engineers who I was having lunch with.
That being said, just knowing that it's possible is probably enough
for me to get it, so let me work on it tomorrow when I am fresh.
Thanks!

Re: Extending floating point precision
On Fri, 01 Jul 2005 03:47:12 +0000, Guy Macon

My life is filled enough as it is....

That's the right spirit!

A clue:  you implied that you want to be able to use the existing low
level operators to combine values, yet to do so it must be done
without any rational chance of exceeding the final precision in the
result.  This result is, of course, usually of the same size as the
values supplied to the binary operators.  Looking at multiplication
should stress this point more clearly than addition, by the way.  The
way to achieve the extension should be clearer, once you stare at that
problem, squarely in the face of it, and figure out how to deal with
it.  Just focus on multiplication of two values.

But I'm also sure that an almost trivial search of the web will pop up
more than a few somethings on this subject.  I'm sure it's just too
important to have somehow escaped getting some moderate level of
attention there.

Jon

Re: Extending floating point precision

... snip ...

Are you are talking about the usual practice of expressing A as

a = a0 + a1 * 2^N

where the ai are in the precision we have, and the a in the
precision we want, and N quantifies the precision we have.  We can
then do multiplications etc. and combine the portions.  The problem
in the floating world is that we can't control the normalization
and rounding done.  That action on the more significant portion
will mask anything from the less significant portions.  At least as
I see it.

What we can do is to build an arbitrary precision set of integer
operations, and then build a single floating point mechanism upon
that.  To me that means there is no point in having a hardware
floating point processor at all, if all we need is the extended
float.

--
Chuck F ( snipped-for-privacy@yahoo.com) ( snipped-for-privacy@worldnet.att.net)
Available for consulting/temporary embedded and systems.
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

No, that's not what I'm talking about.  And yes, rounding is very
important to consider -- exact rounding.

Jon

Re: Extending floating point precision

Alright, what are you talking about.  Rounding is inherently
inexact.  I see no way of using the usual FP system to extend its
own precision.  Range, yes.  Precision, no.

--
Chuck F ( snipped-for-privacy@yahoo.com) ( snipped-for-privacy@worldnet.att.net)
Available for consulting/temporary embedded and systems.
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

Hmm.  If you reduce the precision of the numbers you store, though a
technique of splitting up the precision, then calculations on them
will have enough bits to have absolutely precise results and you will
be able to _precisely_ round them back to your split precision.  Does
this help?

Jon

Re: Extending floating point precision

No.  I maintain it doesn't work.  If you round one segment you have
to know precisely what effect that had on all the lesser
insignificant digits, and you have no idea of that because they
were never calculated.

--
Chuck F ( snipped-for-privacy@yahoo.com) ( snipped-for-privacy@worldnet.att.net)
Available for consulting/temporary embedded and systems.
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

This first routine will split up a double value into two values that
retain only enough mantissa bits so that a multiplication between two
double values can be exactly rounded.  (I dug it up out of some old
QBASIC code I wrote, years back.)

SUB split (a AS DOUBLE, hi AS DOUBLE, lo AS DOUBLE)

DIM x AS DOUBLE, y AS DOUBLE

LET x = a * 134217729#
LET y = a - x
LET hi = y + x
LET lo = a - hi

END SUB

Once split up, a value is carried as the pair (hi, lo), where the
original value is simply hi+lo.  To multiply two of these values
together, say P and Q, you would first split P into (Phi, Plo) and Q
into (Qhi, Qlo) and then perform the obvious computation:

FUNCTION mult# (p AS DOUBLE, q AS DOUBLE)

DIM phi AS DOUBLE, plo AS DOUBLE
DIM qhi AS DOUBLE, qlo AS DOUBLE
DIM r AS DOUBLE, dr AS DOUBLE

split p, phi, plo
split q, qhi, qlo

LET r = p * q

LET dr = phi * qhi - r
LET dr = dr + phi * qlo + plo * qhi
LET dr = dr + plo * qlo

LET mult = r + dr

END FUNCTION

This combination will produce EXACT rounding, if I got it right back
when I wrote it.

Consequently, I don't find your argument persuasive since the above
code demonstrates the very case of exact rounding you say cannot be
achieved.

Jon

Re: Extending floating point precision

I suspect somewhere we are talking about different things.  You
can't do that split in the first place.  Floats are not neatly
broken into maskable components.  There is no such thing as an
exact round - any round says "this is the nearest representation to
the actual value" and drops the other information, for the simple
reason that there is no place to hold that extra information.
Remember that a float consists of the following components,
represented in some form or other:

An exponent unit, usually (but not restricted to) 2.
A significand, representing all values below exp. unit to 1.
An exponent, the integral power for the exp. unit that forms
a multiplier for the significand.

The whole thing represents a value, and has an intrinsic error.
Knuth has a good discussion of it.  The above neglects the gradual
underflow technique.

You have no control over the representation, apart from the usual
additive and multiplicative operations (that includes subtract and
division).  Other things are formed from those basics.

That is why you have to build the FP system up from the integral
system, upon whose precision the precision of the FP depends.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article.  Click on
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

Certainly, that could be true.

Why?

I didn't mask anything.  The number used depends only on the fact that
the representation is binary, which the IEEE formats are.  Actually,
that method I've shown goes back at least to 1971.  Long before IEEE's
standard arrived on the scene.

You are repeating yourself, but not demonstrating anything here.
Speaking louder or more frequently doesn't help.  Facts do.

Did you test the code?  It actually does break a double into two less
precise portions, quite accurately and well, which can be re-added to
provide the original value.  Further, once the lower mantissa bits
have been set to zero (as happens in that code), multiplications can
be exact (in that, if you multiply two numbers with, say, 26 bits of
valid mantissa and the remainder all zero bits, then the result will
be something with potentially up to 52 bits of valid mantissa, which
just happens to fit into a double with two guard bits to spare.)

The result is that the rounding does take place correctly under those
circumstances... and exactly, given the two original numbers being
multiplied are considered exact to begin with.

If you look at 'r' and 'dr' in the routine before they are added, they
make up a pair of doubles with combined mantissas that yield all of
the required precision for a full multiply of the two original
numbers.

At least, I believe that is true given the article I'd read many years
ago.

I am more than a little aware of this, having written a least some
modest floating point packages of my own.  Both recently and many
years ago.  However, it was for practical need and not as some
theoretical dissertation on the subject, so I'm no expert on the
subject.  But I am intimately familiar with the mechanisms to make
such a system work with practical effect.

But the question was put on the table by the OP...

How so?  I didn't include some snippets of additional code that deal
with large exponents (those greater than something like 10^292 and
less than the limit of around 10^308, but I thought that would only
confuse matters more.)  But I would like to see a specific case (or a
carefully crafted general argument) that illustrates your point
better, as I don't follow you just now.

By the way, I've since discovered that it was Dekker, in 1971, who
wrote the article I'd read so many years back.  I'd forgotten who it
was until I did a search.  He has dealt with these issues using
carefully crafted arguments and I relied upon these for what I'd
tested, before.  So I really would like to see what you are seeing
here.

Different methods would apply, should you have a decimal based system.
Of course.  But I think the routine mentioned above is rather portable
across systems using binary exponents applied to their binary mantissa
bits.  Which pretty much is everything, these days.

My preference, too.  But the question by the OP remains and I've tried
to deal with it, squarely.

But we could still be talking at cross-purposes.  And I could, of
course, be entirely mis-remembering the implications of Dekker's
article.  So I'll leave those doors open.

Jon

Re: Extending floating point precision

... snip ...

No, I didn't really look at the code, so I may be missing
something.  I have saved your message for possible future
examination.  I never use Basic, so translating it to C or Pascal
will be a strain.  You can find one of my floating point packages
in back issues of DDJ - March and April 1979.  All 8080 assembly
code.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article.  Click on
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

A nice fellow sent me some information by email that apples to this:

|
| The Berkley folks have cracked this one with there double double quad
| package.
| http://crd.lbl.gov/~dhbailey/mpdist /
|
| There's an early paper/web page from about '95-'97 over in the UK.
| http://members.lycos.co.uk/keithmbriggs/doubledouble.html
|
| Rumor has it this is how the latest nvidia GPUs do 128b floats.
|

-------------------------------------------------------------------

For those who missed it, here is Jonathan Kirwan's code:

: This first routine will split up a double value into two values that
: retain only enough mantissa bits so that a multiplication between two
: double values can be exactly rounded.  (I dug it up out of some old
: QBASIC code I wrote, years back.)
:
:   SUB split (a AS DOUBLE, hi AS DOUBLE, lo AS DOUBLE)
:
:       DIM x AS DOUBLE, y AS DOUBLE
:
:           LET x = a * 134217729#
:           LET y = a - x
:           LET hi = y + x
:           LET lo = a - hi
:
:   END SUB
:
: Once split up, a value is carried as the pair (hi, lo), where the
: original value is simply hi+lo.  To multiply two of these values
: together, say P and Q, you would first split P into (Phi, Plo) and Q
: into (Qhi, Qlo) and then perform the obvious computation:
:
:   FUNCTION mult# (p AS DOUBLE, q AS DOUBLE)
:
:       DIM phi AS DOUBLE, plo AS DOUBLE
:       DIM qhi AS DOUBLE, qlo AS DOUBLE
:       DIM r AS DOUBLE, dr AS DOUBLE
:
:           split p, phi, plo
:           split q, qhi, qlo
:
:           LET r = p * q
:
:           LET dr = phi * qhi - r
:           LET dr = dr + phi * qlo + plo * qhi
:           LET dr = dr + plo * qlo
:
:           LET mult = r + dr
:
:   END FUNCTION
:
: This combination will produce EXACT rounding, if I got it right back
: when I wrote it.
:
: Consequently, I don't find your argument persuasive since the above
: code demonstrates the very case of exact rounding you say cannot be
: achieved.

-------------------------------------------------------------------

[Meta] Just in case one of you changes his mind about posting to
misc.business.product-dev, I have set the modbot to autoapprove
anything by Jonathan Kirwan or CBFalconer.  You can verify this
with a test post; it will be approved in less than a second - far
too short for a human to change it or interfere with it in any way.

that his moderation software does not allow the moderator to make any
changes to a post (headers or body) other than adding a clearly labeled
moderators comment at the bottom or rejecting the post, and that once
a poster is put on the whitelist, even those actions are impossible.

If there is anything else that I can do to assure you that your posts
will not be in any way under the control of anyone else, let me know.

Re: Extending floating point precision

... snip ...

Before I go any further, this is my interpretation of the (ugh)
Basic code, which makes no sense to me.  The magic number is
8000001 in hex by my count, needing 28 unsigned bits to express.
It will probable be expressed exactly in some doubles.  This means
the product may be the result of adding a value and the value
shifted left by 28 places, in a binary system.
1xxxx...xxxxx1
1xxxx...xxxxx1
============================
^          ^--- about here is end of reg.
only possible carry comes from here.  This may result in:
1000000000000001111111111000
^           ^-- truncated, no register room
1 + 1 yields 0 with carry here

and the truncation might cause the 0 sum bit to become 1 due to
rounding.

which the FP will immediately shift right in normalizing, assuming
all the significand bits marked x were set.  I don't see that all
this does anything for anybody.

void split(double flt, double *hi, double *lo)
{
double x, y;

x = flt * 134217729;
y = flt - x;
*hi = y + x;
*lo = flt - *hi;
} /* split */

double mult(double p, double q)
{
double phi, plo;
double qhi, qlo;
double r,   dr;

split(p, &phi, &plo);
split(q, &qhi, &qlo);
r = p * q;
dr = phi * qhi - r;
dr = dr + (phi * qlo) + (plo * qhi);
dr = dr + (plo * qlo);
return (r + dr);
} /* mult */

... snip ...

What is this about?  Is the cross-post moderated and delaying?  I
am in no rush.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article.  Click on
We've slightly trimmed the long signature. Click to see the full one.
Re: Extending floating point precision

CBFalconer wrote:

I am the moderator of misc.business.product-dev.  Jonathan Kirwan
is concerned about having his posts be subject to moderation, so
he snips misc.business.product-dev when he replies. (All of this
is perfectly acceptable, of course; his post, his choice).  This
causes replies from those who don't have a problem with moderation
to also not go to misc.business.product-dev.

(empasis on the "persuade"; it's his choice to make), I have whitelisted
him and those who reply to him so that anything he or they post goes
right to the newsgroup without any possibility of moderation.  I also
offered to make him a moderator if that would help.

Re: Extending floating point precision

CBFalconer wrote:

Ah.  Yes, I see how to do that - build an arbitrary precision set
of integer operations, and then build a single floating point
mechanism upon that.  That's probably the most efficient way to
do it as well, under normal circumstances.  Then again, some of
the newer processors can pipeline FPU instructions without slowing
down the rest of the CPU, so building extended precision math out
of existing floating point instructions would be essentially free.
The way I write code, the FPU has a *lot* of time on it's hands...

Re: Extending floating point precision

[...]

I think that's exactly it. As soon as you lose a quantum of precision
on a boundary, everything to the right of that boundary is useless.
It may appear OK, and that would be a dangerous assumption, because
after many dozens or thousands of additional operations, the less
significant portion of the numbers would be unrecognizable as well
as being wrong.

I think that's probably the only way.