Fixed point Vs Floating point

Hi All,

I am trying to understand how fixed point and floating point are implemented in hardware. This is what I understand so far

1) Fixed point means the decimal point is fixed. Floating point has some bits set for mantissa and some for the exponent ie the decimal point "floats". 2) Floating point has more dynamic range for the given number of bits as compared to fixed. 3) Floating point is harder to implement in hardware than fixed point but is closer to representing real world values. 4) Fixed point is better when ur application has power consumption requirements but doesn care as much about the precision.

All of this is good. But i am trying to visualize fixed and floating in hardware. What i am trying to ask is, in fixed point is there a seprate datapath for fraction and for the the integer part. I wouldnt think so because at the end of the day its just bits and its upto us how we interpret them. I am sure fixed point and floating point hardware are architecturally different.

So one good question to ask is...how is a fixed point multiplier implemented and how is a floating point multiplier implemented. Can someone give a real world processor example of each. It would be much appreciated.

Reply to
skilambi
Loading thread data ...

I don't know how it is implemented in hardware, but I know how you can implement it in software. Below is an implementation I've written some time ago for MIDP 1.0 (Java on mobile phones, because MIDP 1.0 doesn't have float). The algorithms are described by Knuth in The Art of Computer Programming.

I guess it can be implemented in hardware like this, too. There are some tricks with fast hardware multipliers, try Google, which can be used for fixed point, too.

public static int mul(int u, int v) { int fu = u & 0x7FFFFF; int eu = ((u >> 23) & 0xff) - 127;

int fv = v & 0x7FFFFF; int ev = ((v >> 23) & 0xff) - 127;

if (u < 0) fu = -fu; if (v < 0) fv = -fv;

return normalize(((long) fu) * ((long) fv), eu + ev); }

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

Let's forget signed numbers and other complications for the moment. A fixed-point, unsigned, binary multiplier is implemented an awful lot like how you learned to multiply multi-digit numbers back in grade school, except that the digits are binary.

For a 4x4 multiply, if you were to tell the processor to multiply 5 *

13, that would be 0101 * 1101. Doing this out long hand you'd get (view in fixed font) 0101 * 1101 ------ 0101 0000 0101 0101 -------- 01000001

(5 * 13 = 65)

So the processor had to do three shifts, four decisions whether to add or not, and up to four additions.

Any time you are multiplying two fixed-point numbers with the same radix this is how it's done; the answer may be interpreted differently, but the bit patterns will always be the same -- there are no separate paths for integer and fractional parts.

A floating point multiply treats the mantissas as fixed-point numbers; it multiplies them together, checks the leading bit to see if it is zero and adjusts the mantissa and exponent accordingly, then adds the two exponents.

Where floating point _really_ starts to slow down is for subtraction, and for exceptions.

When two F.P. mantissas must be subtracted from one another the answer can get markedly smaller than either argument (or go to zero), so after a subtraction the algorithm much adjust the mantissa (and the exponent) to match the new answer.

If the hardware or software is implementing IEEE floating point, then there are a number of error checks that must be done, and there are very specific actions that must be carried out in the case of overflow, underflow, divide by zero, etc. These exceptions are generally ignored or given a once over lightly with fixed-point math.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

-- snip --

Fixed point math can yield results that are every bit as precise as floating point math, and can often do so with significantly fewer processor resources and software bugs.

If you have an applications where the inputs are of known range and precision (such as you get when you're taking input from some sort of analog to digital conversion process), then you can almost always map out the entire data path of your algorithm, and know the exact range of magnitudes and precision for each step. Knowing this, the task of fitting fixed-point math to the algorithm is simple, if sometimes tedious.

In this case floating point's mantissa is a useless waste of space, and maintaining it is a useless waste of clock ticks.

Moreover, if you _really_ want to make sure that your application is going to work, there's no savings to be had using floating point -- you still have to analyze your algorithm from one end to the other to make sure that the floating point type that you have chosen is still 'big' enough to work. For many practical problems where the hardware uses

16-bit ADCs, a 24-bit mantissa isn't enough -- and double precision floating point _always_ uses scads of processor resources.
--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

I don't see why subtraction is slow, e.g. in my Java implementation (and in Knuth's description) it is implemented with add, which works with shifts and integer add and sub, only:

public static int add(int u, int v) { int fu = u & 0x7FFFFF; int eu = ((u >> 23) & 0xff) - 127;

int fv = v & 0x7FFFFF; int ev = ((v >> 23) & 0xff) - 127;

if (fv == 0) return u; if (fu == 0) return v;

if (u < 0) fu = -fu; if (v < 0) fv = -fv;

if (eu < ev) { if (ev - eu > 24) return v; fu >>= ev - eu; return normalize(fu + fv, ev); } else { if (eu - ev > 24) return u; fv >>= eu - ev; return normalize(fu + fv, eu); } }

public static int sub(int u, int v) { return add(u, v ^ 0x80000000); }

Just for completness, the normalize method, including the TODO :-)

private static int pack(int f, int e) { boolean s = f < 0; if (s) f = -f; int z = f | ((e + 127) 0x7fffff) { f /= 2; e++; if (c++ > 64) break; } // TODO: rounding if (s) f = -f; return pack((int) f, e); }

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

Hi tim, Thanks for the quick response. I totally understand what you have explained. But lets take the same example you have given. Let us assume we have a micrprocessor that does only 4 bit multiplies. Lets say the numers are now, 11.01 and 01.01. For the hardware, the binary point is not really a concern, so really its doing 1101 * 0101. Now the answer is 0100.0001. This is where I lose my wheels. What comes out of the microprosessor (assuming we have enough data word bits to obtain the result) is 11010101. How do i know where the decimal point is. That number can be interpreted in a myriad number of ways.

Can it be that we first scale the numbers to a known place for the binary point, so in this case .1101 and .0101 and now WE KNOW that the answer is .01000001? Is this how the hardware design is done?

Regards, Sai.

text -

Reply to
skilambi

Where did you get that last number? By pasting 1101 to the left side of 0101?

1101 x 0101

------- 1101 1101

-------

1000001

Since you happen to know, a priori, that there are two bits to the right of the radix point on the multiplier and multiplicand, you then also know, a priori, that there are four (sum of 2 and 2) bits to the right of the new number's radix point.

So, start out assuming the radix point at the far right as though they truly were simple integers: 1000001. and then move it leftwards four points: 100.0001

This much is grade school math, by the way. 6th or 7th, if I recall.

Not so.

The only complexity at all for fixed-point multiplication is handling signed values. And even that isn't complicated. You might look up Booth's algorithm, memory serving. There should discussions covering both unsigned and signed situations. It's not complex.

Also read:

formatting link

Which may help you in understanding some ways of fixed point notation.

Floating point can be complicated if you are looking for a purely combinatorial approach. The only folks I know of, whom I think did it for floating point division, was Bipolar Integrated Technologies in Beaverton, Oregon. But I could be wrong about it.

Jon

Reply to
Jonathan Kirwan

This is basic mathematic. Assuming the same number of fraction bits n it looks like this:

a*2^n * b*2^n = a*b*2^(2*n) = (a*b*2^n)/2^n

So you have to shift the result by n bits right after the multiplication.

If you have only n bits, but want to multiply bigger numbers, it's again basic mathematic. Assuming b and d are the lower bits and a and c are the higher bits, you can calculate it like this (where *2^n means simply shifting)

(a*2^n+b)*(c*2^n+d)=2^(2*n)*a*c+2^n*b*c+2^n*a*d+b*d

PS: Please quote this newsgroup postings when answering your homework :-)

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

Suppose the hardware has registers and operators (add, multiply) for

16-bit data. You can elect, in your head, to interpret integer 1101 as 11.01 volts, and integer 206 as the gain of an amplifier = 2.06.

Multiply to get the volts times the gain, integer result being 226806. Mentally shift that 4 places to the right and get 22.6806, the actual voltage. So every multiply means you pick up two decimal digits. Add/sub just work.

One easier way to think of things is to use fractional notation: define the full-scale value of a variable, and associate it with integer 65535. If a 16 bit unipolar ADC has a full-scale output of 10 volts, we'd say...

Vmax = 9.99985 volts == integer 65535, 0xFFFF == 0.999985 fractional

So 2.5 volts is represented by integer value 16384, 0x4000.

Now a 2:1 voltage divider can be expressed as

Gain = 0.5 == 32768

Volts * divider = 8192 == 1.25 volts

So all 16-bit integers are treated as fractions below 1.000. When you multiply the fractions, do a 16x16 integer multiply and throw away the low 16 bits of the result. Now you have a consistant notation and avoid a lot of ad-hoc shifting; a fractional times a fractional is always a fractional. The Xilinx fpga software allows you to make signed or unsigned, regular or saturating add/sub/mul fractional operations, and I think it may even do divides, with the obvious gotchas. Fractionals are great for things like calibration factors.

Two other ways to handle real-world math using only integers, both of which I've used on recent projects:

  1. Express everything as the engineering units times 1000 (like, 2.5 volts is 25000) and remember to do three mental shifts after any multiply. I often use 32 or 64-bit signed integers here. You can only do a few multiplies before you run out of bits. You can fix *that* by doing binary right shifts, but then all the scalings change. That gets tiresome quick.

  1. Use 32.32 format: every variable is a signed 32-bit integer followed by a 32-bit fraction, 64 bits total.

5.5 volts becomes 0x 0000 0005 . 8000 0000

which is enough range to handle most any real-world problem. The add and subtract operations are easy, since no normalizations are needed. I have a little math library to do add/sub/mul/div, signed, saturating, in this format.

Some famous person said :If you absolutely have to use floating point, you don't understand the problem."

Here's a horrible example of case(1), from a edit window I just happen to have open now. The CPU is the 68332, with 32-bit registers.

.SBTTL . F2R : SCALE ENG UNITS FREQUENCY TO RAW DDS FORMAT

; D0:D1 IS FREQUENCY IN mHz, 64 BITS, SIGNED, 40 MHz MAX MAYBE ; D1 IS OUR OUTPUT, IN DDS UNITS, 0.014901 HZ/LSB ; ; MAXH:MAXL, IN THE CAL TABLE, HOLD THE MAX ALLOWED INPUT FREQUENCY, IN mHz

; WE "JUST" NEED TO MULTIPLY BY 2^32 / 64E9 == 0.067109 == 2^17 / 1953125

F2R: MOVEM.L D2 D4 D5, -(SP) MOVE.L D0, D6 ; REMEMBER INPUT SIGN BPL.S F2EX ; AND SKIP FOR POS NEG.L D1 ; IF NEGATIVE, NEGX.L D0 ; FLIP FOR CONVENIENCE

F2EX: MOVE.L MAXH.W, D4 ; FETCH MS PART OF MAX FREQUENCY MOVE.L MAXL.W, D5 ; AND LS, FROM CAL TABLE

CMP.L D4, D0 ; POS: CHECK MS LONGWORD BHI.S FPMAX ; TOO HIGH? CLIP TO MAX BLO.S F2GO ; LOWER? THAT'S SAFE CMP.L D5, D1 ; IF MS IS TIE, CHECK LS BLS.S F2GO ; IF OK, USE HIS VALUE

FPMAX: MOVE.L D4, D0 ; NAUGHTY BOY! FORCE MAX FREQ MOVE.L D5, D1

F2GO: MOVE.W # 17-1, D5 ; SET UP FOR 17-BIT LEFT SHIFT F217: LSL.L # 1, D1 ; THIS COULD BE DONE A LOT ROXL.L # 1, D0 ; FASTER, BUT IT DOESN'T DBF D5, F217 ; MUCH MATTER HERE.

MOVE.L # 1953125, D2 ; SCALE DOWN, DIVU.Q D2, D0:D1 ; RESULT IN D1

F2NOR: TST.L D6 ; RECOVER ORIGINAL SIGN BPL.S F2Z ; AND FLIP RESULT NEG.L D1 ; IF NEEDED

F2Z: MOVEM.L (SP)+, D2 D4 D5 RTS ; D1 IS THE ANSWER.

John

Reply to
John Larkin

Even addition is nasty, what with the normalizations.

Addition and subtraction are the same, really; just flip the sign of the second thingie.

FP bugs are common, in hardware or software, especially if people are trying to push speed. Fixed-point ALU's rarely have bugs.

John

Reply to
John Larkin

(code snipped)

If you're implementing it in Java, everything is slow.

It's the normalizations that slow it down, so any time you end up subtracting the mantissas (i.e. any like-signed subtraction, or differently-signed add), then the mantissa has to do the rumba (shift left, cha cha cha) while the exponent counts the beat.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

The "decimal" point isn't necessarily fixed in "fixed point" arithmetic. There is no exponent though.

Because of the exponent.

Neither are "hard", given an unlimited transistor budget. ;-) In fact the fixed point unit I worked on was every bit as complicated as the floating point unit.

Not at all clear.

No, there aren't two paths. As you note, bits are bits.

A floating point unit must treat the mantissa and exponent differently. A multiply would add the exponent and multiply the mantissa. What sort of "example" do you want?

--
Keith
Reply to
krw

There are fast algorithms to calculate the log2:

formatting link

Of course, in hardware it is easy to implement it in parallel:

library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL;

entity test is port( log2: out integer range 0 to 3; argument: in unsigned(7 downto 0) ); end entity test;

architecture rtl of test is

begin

log2

Reply to
Frank Buss

(HDL code snipped)

I think your fast is my slow, and your small is my big.

On a fixed point machine that doesn't have hardware acceleration for normalization, 32-bit IEEE floating point hovers around 50x slower than integer math -- and that's a machine that does multi-cycle integer multiplies.

That can be the difference between a processor not being a contender for a product (or a product not being a contender on the market) and a processor/product being just right for the job.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

If you do a very wasteful base 256 exponent, you can get the speed up quite a bit. It may seem awful but when you have the extra RAM, it is a lot better than doing IEEE floats.

Reply to
MooseFET

Obviously a FORTH bigot. ;)

This is right in instruments, but not e.g. in a math program, where part of the point is that you don't want to limit what the user can do. The dynamic range of any instrument is far less than that of an IEEE single-precision float.

Cheers,

Phil Hobbs

Reply to
Phil Hobbs

Not a DDS synthesizer. Or a digital delay generator. Some of the DDGs I've done wouldn't even work in double floats.

John

Reply to
John Larkin

But less than you often need in integrators in control systems, where 24 or 25 bits doesn't quite cut it (but 32 almost always does).

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

Fair enough in that case--time and frequency is a teensy little bit more precise than ADCs, it's true. ;)

Still, though, you know in advance the step size you'll be using, which helps a lot in designing the arithmetic.

IEEE 754 floating point is great in theory, but there's no hardware support on any CPU I know of for some of its features--especially denormalized numbers.(*) The unimplemented features have to be emulated by the runtime handling the EM_DENORMAL exception in software on _every_ _single_ _operation_, which takes *forever*. On my big EM simulator, I have to fill the entire volume with noise of a few times FLT_MIN before starting to avoid it running like molasses for several hundred iterations until all the denormals get flushed out. Now

*that's* stupid and inefficient. (Some compilers support turning off denormals (its' usually called "flush to zero", but gcc doesn't.)

IEEE 754 assumes that there'll be hardware acceleration, which there is on higher-end CPUs--they've got all those transistors available, so they might as well do something useful with them. POWER 6 CPUs dispatch 2 floating point and one integer operation per clock, iirc.

If the guy designing the instrument is also designing the arithmetic, all is copacetic--though laborious. Otherwise, give me hardware FP any day.

So I think that maxim of yours goes in the 'curmudgeonly half-truth' category, along with my favourite one of Rutherfords: "If your experiment needs statistics, you ought to have done a better experiment."

Cheers,

Phil Hobbs

(*) Denormals are numbers whose exponents are more negative than can be represented in the IEEE format--so the exponent stays pegged low and digits shift off to the right as the number shrinks. It's a theoretically cute way of reducing rounding error due to underflow when summing terms of widely different sizes. In practice it's a big nuisance.

Reply to
Phil Hobbs

Okay, so between you and John and me, I think we agree that the resolution of a float is usually but not always enough, but the dynamic range (2**128:1) is always enough for instruments.

The PC real-time clock needs 32 bits too (and will be needing 33 bits starting in 2038).

But getting general-purpose FP arithmetic right is hard (any volunteers for hand-coding signed FP divisions?) and getting it efficient as well as right is really hard. Knuth Vol 1 is mostly about arithmetic, and it's an enlightening read.

For special-purpose use on small CPUs, I'd probably use fixed point as well. (Of course, as far as embedded stuff goes, I probably belong in a twelve-step program--"Hi, my name is Phil, and I program microcontrollers in C.")

Cheers,

Phil Hobbs

Reply to
Phil Hobbs

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.