# Floating point issues

• posted

If you can't work this out by yourself, you should consider a different career path.

• posted

I wonder if you could help me with this.

Because of system restrictions, I have to convert input data (floating point, 32 bit, IEEE format) to a 16 bit format (for example: 1 sign bit, 5 exponent bits, 10 mantissa) and, after processing, back to IEEE

32 bits.

What issues can I expect in terms of dynamic range, clipping etc? Also, what would be the most efficient way to convert between the two formats?

Many thanks.

• posted

Come on Tom, the guy's probably a computing 101 student, asking for advice.

Assuming only +ive values ... You've assumed that you have to convert fp32 to fp16.

One commonly used approach is to convert the FP data into unsigned 16 bit integer, that is, multiply the fp number by, say, 1000, do the math processing, then convert the integer back to fp by dividing by 1000.

Since unsigned 16 bit integer has range 0 - 65535, the conversion gives you an effective fp range of 0 - 65. If you use a factor of 100, you get an effective fp range of 0 - 655. Get the picture?

Try thinking outside the box a bit: Are you are limited to ALWAYS having just one 16 bit integer for each fp32 value? There's nothing wrong with an encoding scheme that might use an occasional extra 16 bit value to define something useful. For example, what about scanning each line (or buffer full) to see what the minimum value per line is, then each integer value is the delta ie base_value (int16), delta_values (int16). This will increase your effective dynamic range considerably, and make very little difference to the "system restrictions" you mention. Your algorithm sets the line or buffer size.

example: base value is 3000, and all buffer values are the delta * 1000 (as described above).

This scheme could be modified to set the base value to the data mean, then have signed integer delta values.

You should be able to work out dynamic range and clipping issues from this.

```--
regards,
Stewart DIBBS```
• posted

Is this homework?

You can discuss this only with some context and, even then, probably only by comparison -- as in, x is more efficient than y in this particular context.

Why don't you disclose more of the situation?

Jon

• posted

Google for "half float format". It is used by ATI and nVidia to save memeory and gain speed while rendering floating point. Also, the file format OpenEXR supports half floats and is well documented, including discussions on half float advantages over fixed point and 32 bit float. The OpenEXR libraries are open source and include half float conversion and, IIRC, math.

• posted

Very likely, but it doesn't require more than high school math.

• posted

Thank you Matthias, that is what I was looking for.

Matthias Melcher wrote:

• posted

When using smaller floating points, you first have to define the requires dynamic range. What is the biggest number, what is the smallest number. Then is the take as many bits to represent this range in the exponent and the rest is left for the mantissa.

Rene

```--
Ing.Buero R.Tschaggelar - http://www.ibrtses.com
& commercial newsgroups - http://www.talkto.net```
• posted

no-spame-matt> Google for "half float format". It is used by ATI no-spame-matt> and nVidia to save memeory and gain speed [ ... ] no-spame-matt> [ ... ] including discussions on half float no-spame-matt> advantages over fixed point and 32 bit float. [ no-spame-matt> ... ]

Quite interesting. But for small number of bits I was quite impressed by FOCUS and other logarithm based ''floating point'' formats:

"Communications of the ACM", v22 n3, March 1979: "FOCUS Microcomputer Number System"; Albert Edgar,Samuel Lee.

=ABFOCUS is a number system and supporting computational algorithms especially useful for microcomputer control and other signal processing applications.

FOCUS has the wide-ranging character of floating-point numbers with a uniformity of state distributions that give FOCUS better than a twofold accuracy advantage over an equal word length floating-point system.

FOCUS computations are typically five times faster than single precision fixed-point or integer arithmetic for a mixture of operations, comparable in speed with hardware arithmetic for many applications. Algorithms for 8-bit and 16-bit implementations of FOCUS are included.=BB

They require a different programming attitude though, probably the reason why GPUs use them.

IIRC Some UK academic had started a few years ago a company for hardware accelerated logarithm based ''floating point'', but it does not seem to have achieved world domination yet, which may be a pity.

• posted

Peter Grandi schrieb:

Sounds really interesting.

The only link I found so far is

which is a commercial portal - absolutely no information unless you pay... :-(

```--
Dipl.-Ing. Tilmann Reh
http://www.autometer.de - Elektronik nach Maß.```
• posted

The generic term is probably LNS "logarithmic number system" there are many variants. Apart from software implementations in hardware/VLSI are common too.

The basic simple variant is: Addition and subtraction are easy with integers. The most straightforward way to get from integer to logarithmic format and back is tables. There one has strength reduction: multiplication, division is addition, subtraction again. Tables can be appropriate with low-resolution applications, i know of a flight-simulator for very early x86 PCs.

But for high-resolution the tables get out of hand, so from time to time someone comes up with a new fix for that problem. Often with a two word format similar to float.

Log Point Technologies had a version in 1997 that claimed tables from 17k - 55k bytes. For 8 bit microprocessors and DSPs with 32x32 integer multiply.

An old Ph.d: Stouraitis "Logarithmic Number System, Theory, Analysis and Design" University of Florida gives in google as "stouraitis logarithmic" about 250 hits one can examine.

Another starting point would be groups.google on comp.arch.arithmetic with "logarithmic"

Thats the oldest (?) and certainly the most often quoted. I think there is a description of Focus as a chapter in a book too.

MfG JRD

• posted

And confidence in a reasonable understanding of the issues involved.

```--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯```
• posted

We've had at least one long thread about FP formats here recently. The math(s) may be trivial, but an appreciation of the trade-offs seems to be distinctly non-trivial. If were choosing or designing a 16-bit format, I'd want some reassurance that others who had been down the same road had made similar decisions.

• posted

pg_nh> Quite interesting. But for small number of bits I was quite pg_nh> impressed by FOCUS and other logarithm based ''floating point'' pg_nh> formats: [ ... ] They require a different programming attitude pg_nh> though, probably the reason why GPUs use them. [ ... ]

Oops, that should have read "the reason why CPUs *don't* use them".

• posted

Well, just to enumerate the primary decisions:

1. Do negative values exist. If so, 1 bit used.
2. Range required. 2a. Exponent base. Affects resolution.
3. Resolution required. Affects range.

For most purposes a system with binary exponents and significands is likely to be optimum. Others do not allow the use of the implied leading one bit which get the sign bit for free, or can be considered to give one extra bit of resolution.

```--
"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA```
• posted

While with "hidden bit" normalisation, you can get one extra bit of resolution, this applies only to binary exponent representations. With other representations (such as base 16 or even base 10) you do not get such advantage).

Paul

• posted

(snip, someone wrote)

But with higher base you get more range with fewer exponent bits. Depending on the format it may or may not make up for the lost of the hidden bit.

-- glen

• posted

I thought I just said that.

```--
"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA```

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.