Floating point issues

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View

I wonder if you could help me with this.

Because of system restrictions, I have to convert input data (floating
point, 32 bit, IEEE format) to a 16 bit format (for example: 1 sign
bit, 5 exponent bits, 10 mantissa) and, after processing, back to IEEE
32 bits.

What issues can I expect in terms of dynamic range, clipping etc? Also,
what would be the most efficient way to convert between the two
formats?

Many thanks.


Re: Floating point issues

Quoted text here. Click to load it

If you can't work this out by yourself,  you should consider a different
career path.

Re: Floating point issues

Quoted text here. Click to load it

Come on Tom, the guy's probably a computing 101 student, asking for advice.

Quoted text here. Click to load it

Assuming only +ive values ... You've assumed that you have to convert fp32
to fp16.

One commonly used approach is to convert the FP data into unsigned 16 bit
integer, that is, multiply the fp number by, say, 1000, do the math
processing, then convert the integer back to fp by dividing by 1000.

Since unsigned 16 bit integer has range 0 - 65535, the conversion gives you
an effective fp range of 0 - 65. If you use a factor of 100, you get an
effective fp range of 0 - 655.  Get the picture?

Try thinking outside the box a bit: Are you are limited to ALWAYS having
just one 16 bit integer for each fp32 value? There's nothing wrong with an
encoding scheme that might use an occasional extra 16 bit value to define
something useful. For example, what about scanning each line  (or buffer
full) to see what the minimum value per line is, then each integer value is
the delta ie base_value (int16), delta_values (int16). This will increase
your effective dynamic range considerably, and make very little difference
to the "system restrictions" you mention. Your algorithm sets the line or
buffer size.

example: base value is 3000, and all buffer values are the delta * 1000 (as
described above).

This scheme could be modified to set the base value to the data mean, then
have signed integer delta values.

You should be able to work out dynamic range and clipping issues from this.
--
regards,
Stewart DIBBS
We've slightly trimmed the long signature. Click to see the full one.
Re: Floating point issues

Quoted text here. Click to load it

Very likely, but it doesn't require more than high school math.

Re: Floating point issues
Quoted text here. Click to load it

And confidence in a reasonable understanding of the issues involved.
--
Engineering is the art of making what you want from things you can get.
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ

Re: Floating point issues
Quoted text here. Click to load it

We've had at least one long thread about FP formats here recently.
The math(s) may be trivial, but an appreciation of the trade-offs
seems to be distinctly non-trivial. If were choosing or designing
a 16-bit format, I'd want some reassurance that others who had been
down the same road had made similar decisions.

Re: Floating point issues
Quoted text here. Click to load it

Well, just to enumerate the primary decisions:

1.  Do negative values exist.  If so, 1 bit used.
2.  Range required.
    2a.  Exponent base.  Affects resolution.
3.  Resolution required. Affects range.

For most purposes a system with binary exponents and significands
is likely to be optimum.  Others do not allow the use of the
implied leading one bit which get the sign bit for free, or can be
considered to give one extra bit of resolution.

--
 "A man who is right every time is not likely to do very much."
                           -- Francis Crick, co-discover of DNA
We've slightly trimmed the long signature. Click to see the full one.
Re: Floating point issues

Quoted text here. Click to load it

While with "hidden bit" normalisation, you can get one extra bit of
resolution, this applies only to binary exponent representations. With
other representations (such as base 16 or even base 10) you do not get
such advantage).

Paul


Re: Floating point issues

(snip, someone wrote)

Quoted text here. Click to load it


But with higher base you get more range with fewer exponent bits.
Depending on the format it may or may not make up for the lost of
the hidden bit.

-- glen


Re: Floating point issues
Quoted text here. Click to load it

I thought I just said that.

--
 "A man who is right every time is not likely to do very much."
                           -- Francis Crick, co-discover of DNA
We've slightly trimmed the long signature. Click to see the full one.
Re: Floating point issues

Quoted text here. Click to load it

Is this homework?

Quoted text here. Click to load it

You can discuss this only with some context and, even then, probably
only by comparison -- as in, x is more efficient than y in this
particular context.

Why don't you disclose more of the situation?

Jon

Re: Floating point issues

Quoted text here. Click to load it

Google for "half float format". It is used by ATI and nVidia to save memeory
and gain speed while rendering floating point. Also, the file format OpenEXR
supports half floats and is well documented, including discussions on half
float advantages over fixed point and 32 bit float. The OpenEXR libraries
are open source and include half float conversion and, IIRC, math.



Re: Floating point issues

Thank you Matthias, that is what I was looking for.


Matthias Melcher wrote:
Quoted text here. Click to load it


Re: Floating point issues
Quoted text here. Click to load it


no-spame-matt> Google for "half float format". It is used by ATI
no-spame-matt> and nVidia to save memeory and gain speed [ ... ]
no-spame-matt> [ ... ] including discussions on half float
no-spame-matt> advantages over fixed point and 32 bit float. [
no-spame-matt> ... ]

Quite interesting. But for small number of bits I was quite
impressed by FOCUS and other logarithm based ''floating
point'' formats:

  http://DBLP.Uni-Trier.DE/rec/bibtex/journals/cacm/EdgarL79

    "Communications of the ACM", v22 n3, March 1979: "FOCUS
    Microcomputer Number System"; Albert Edgar,Samuel Lee.

   AB%FOCUS is a number system and supporting computational
    algorithms especially useful for microcomputer control
    and other signal processing applications.

    FOCUS has the wide-ranging character of floating-point
    numbers with a uniformity of state distributions that
    give FOCUS better than a twofold accuracy advantage over
    an equal word length floating-point system.

    FOCUS computations are typically five times faster than
    single precision fixed-point or integer arithmetic for a
    mixture of operations, comparable in speed with hardware
    arithmetic for many applications. Algorithms for 8-bit and
    16-bit implementations of FOCUS are included.BB%

They require a different programming attitude though, probably
the reason why GPUs use them.

IIRC Some UK academic had started a few years ago a company for
hardware accelerated logarithm based ''floating point'', but it
does not seem to have achieved world domination yet, which may
be a pity.

Re: Floating point issues
Peter Grandi schrieb:

Quoted text here. Click to load it

Sounds really interesting.

Is there any more detailed information about the principle available (online and
for free)?

The only link I found so far is
<http://portal.acm.org/citation.cfm?id35%9080.359085
which is a commercial portal - absolutely no information unless you pay... :-(

--
Dipl.-Ing. Tilmann Reh
http://www.autometer.de - Elektronik nach Maß.

Re: Floating point issues
Quoted text here. Click to load it
The generic term is probably LNS "logarithmic number system"
there are many variants. Apart from software  
implementations in hardware/VLSI are common too.

The basic simple variant is:
Addition and subtraction are easy with integers.
The most straightforward way to get from integer
to logarithmic format and back is tables. There one has
strength reduction: multiplication, division is addition,
subtraction again.
Tables can be appropriate with low-resolution applications,
i know of a flight-simulator for very early x86 PCs.

But for high-resolution the tables get out of hand, so
from time to time someone comes up with a new fix for that
problem. Often with a two word format similar to float.

Log Point Technologies had a version in 1997 that
claimed tables from 17k - 55k bytes. For 8 bit
microprocessors and DSPs with 32x32 integer multiply.

An old Ph.d:
Stouraitis "Logarithmic Number System, Theory, Analysis and Design"
University of Florida
gives in google as "stouraitis logarithmic" about 250 hits
one can examine.

Another starting point would be groups.google on
comp.arch.arithmetic with "logarithmic"

Quoted text here. Click to load it
Thats the oldest (?) and certainly the most often quoted. I think
there is a description of Focus as a chapter in a book too.

MfG  JRD

Re: Floating point issues
Quoted text here. Click to load it

pg_nh> Quite interesting. But for small number of bits I was quite
pg_nh> impressed by FOCUS and other logarithm based ''floating point''
pg_nh> formats: [ ... ] They require a different programming attitude
pg_nh> though, probably the reason why GPUs use them. [ ... ]

Oops, that should have read "the reason why CPUs *don't* use them".

Re: Floating point issues

Quoted text here. Click to load it

When using smaller floating points, you first have to define
the requires dynamic range. What is the biggest number,
what is the smallest number. Then is the take as many bits to
represent this range in the exponent and the rest is left for the
mantissa.

Rene
--
Ing.Buero R.Tschaggelar - http://www.ibrtses.com
& commercial newsgroups - http://www.talkto.net

Site Timeline