Fixed vs Float ?

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hello all,

Concerning digital filters, particurlarly IIR filters, is there a
preferred approach to implementation - Are fixed-point preferred over
floating-point calculations ? I would be tempted to say yes. But, my
google search results leave me baffled for it seems that floating-point
computations can be just as fast as fixed-point.
Furthermore, assuming that fixed-point IS the preferred choice, the
following question crops up:
If the input to the digital filter is 8 bits wide and the coefficents
are 16 bits wide, then it would stand to reason that the products
between the coefficients and the digital filter intermediate data
values will be 24 bits wide. However, when this 24-bit value is to get
back in the delay element network (which is only 8 bits wide), some
(understatemen) resolution will be lost. How is this resolution loss
dealt with? so it will lead to an erroneous filter?  


Re: Fixed vs Float ?

Quoted text here. Click to load it
This is a simple question with a long answer.

Floating point calculations are always easier to code than fixed-point,
if for no other reason than you don't have to scale your results to fit
the format.

On a Pentium in 'normal' mode floating point is just about as fast as
fixed point math; with the overhead of scaling floating point is
probably faster -- but I suspect that fixed point is faster in MMX mode
(someone will have to tell me).  On a 'floating point' DSP chip you can
also expect floating point to be as fast as fixed.

On many, many cost effective processors -- including CISC, RISC, and
fixed-point DSP chips -- fixed point math is significantly faster than
floating point.  If you don't have a ton of money and/or if your system
needs to be small or power-efficient fixed point is mandatory.

In addition to cost constraints, floating point representations use up a
significant number of bits for the exponent.  For most filtering
applications these are wasted bits.  For many calculations using 16-bit
input data the difference between 32 significant bits and 25 significant
bits is the difference between meeting specifications and not.

For _any_ digital filtering application you should know how the data
path size affects the calculation.  Even though I've been doing this for
a long time I don't trust to my intuition -- I always do the analysis,
and sometimes I'm still surprised.

In general for an IIR filter you _must_ use significantly more bits for
the intermediate data than the incoming data.  Just how much depends on
the filtering you're trying to do -- for a 1st-order filter you usually
to do better than the fraction of the sampling rate you're trying to
filter, for a 2nd-order filter you need to go down to that fraction
squared*.  So if you're trying to implement a 1st-order low-pass filter
with a cutoff at 1/16th of the sample rate you need to carry more than
four extra bits; if you wanted to use a 2nd-order filter you'd need to
carry more than 8 extra bits.

Usually my knee-jerk reaction to filtering is to either use
double-precision floating point or to use 32-bit fixed point in 1r31
format.  There are some less critical applications where one can use
single-precision floating point or 16-bit fractional numbers to
advantage, but they are rare.

* There are some special filter topologies that avoid this, but if
you're going to use a direct-form filter out of a book you need fraction^2.


Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?

Quoted text here. Click to load it
Oops -- thought I was responding on the dsp newsgroup.

Everything I said is valid, but if you're contemplating doing this on an
FPGA the impact of floating point vs. fixed is in logic area and speed
(which is why fast floating point chips are big, hot and expensive).
Implementing an IEEE compliant floating point engine takes a heck of a
lot of logic, mostly to handle the exceptions.  Even if you're willing
to give up compliance for the sake of speed you still have some
significant extra steps you need to take with the data to deal with that
pesky exponent.  I'm sure there are various forms of floating point IP
out there that you could try on for size to get a comparison with
fixed-point math.


Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?
I don't take issue with anything Tim stated, but I will add a few

I think that the added complexity of floating point in an FPGA will
probably be enough to rule it out.

Fixed point implementations are often better than floating point
implementations. This comparison tends to be true when the result of a
multiplication is twice the width as the inputs in the fixed point case
and when a floating point result is the same size as its inputs. This is
usually the case in a DSP processor. This also assumes that you use a
filter structure that takes advantage of the long result.

Most IIR filters are constructed as cascaded biquads (and sometimes one
first order section). The choice of the biquad structure has a
significant impact on performance. If we restrict our choices to one of
the direct forms, then usually the direct form I (DF I) structure is best
for fixed point implementations. This assumes that we have a double wide
accumulator. If this is not the case, the DF I is not a particularly good
structure. Floating point implementations are usually implemented as DF
II or the slightly better transposed DF II.

You can also improve the performance of a fixed point DF I by adding
error shaping. This is relatively cheap from a resoursce point of view in
this structure.

As Tim pointed out, you have to pay attention to scaling with fixed point

Like every design problem, you need to examine the performance
requirements carefully. I would look at the pole-zero placement on the
unit circle. For you need a high Q filter at some low frequency as
compared to the sampling rate, the math precision is going to be
critical. The poles might not be on the unit circle, but they will be
very close. If the precision is poor, the filter is likely to blow up. In
other situations, just about anything will work.

Here is a good link describing biquad structures:

Al Clark
Danville Signal Processing, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?

(newbie) Question:
At the intermediate nodes, between biquad-structures in a cascaded
biquad structure IIR filter design approach (employing the fixed point
approach), the resolution of the (extended) accumulator (of the output)
must be scaled down to the width of the data bus, Rigth?

(Internal musing: that would require a fixed point divider, I wonder
how many cycles divison takes?)

The scaling-down to the original databus width is required because the
next biquad filter in the cascaded structure is expecting an input of n
bits. n being the number the number of bits in the databus. m being the
number of bits of the coefficients. Correct?

Would/Can that cause problems ? ( that perphaps are not obvious to me
rigth now).
Are there any tools (freeware) that permit to cascade structures?

Re: Fixed vs Float ?

Quoted text here. Click to load it

Yes, this is usually the case

Quoted text here. Click to load it

Why a division? For example, if I have a number in 1.63 format (1 sign
bit, 63 fractional bits, I can either round or truncate the result to
1.31 format.

Quoted text here. Click to load it


The quantitizer (the process that shortens the fix point word) is going
to have a very small effect.

Quoted text here. Click to load it

Almost all filter programs assume cascade structures. I don't know much
about the free ones. I use QEDesign 1000 ( whicj is very
good. One of the advantages of QED is that the programmer is a very good
DSP guy. This is not generally the case.

Matlab is also very popular for filter design.

Al Clark
Danville Signal Processing, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?
Quoted text here. Click to load it

I compared an [8th order lowpass chebychev lowpass filter, 16-bit fixed
point] with [2nd order lowpass chebychev lowpass filter, 16 bit fixed
point, whose frequency I multiplied by 4 as to emulate a 4
I used WinFilter freeware.
(-I do not yet know if all 2nd order IIR filters can be called biquads.
Have to look into that...)

Anyways, based on the attenuation evaluated from the frequency response
from both filters (8th and 4x2nd), the 8th order filter clearly was the
better filter. It's attenuation was stronger and faster (rolloff rate
greater). The 4x2nd structure did eventually outperform the attenuation
of the 8th order filter, but only because the 8th order filter had
reached its 16 bit attenuation floor.
The 4x2nd order filter structure frequency response (attenuation) was
most definitely NOT sharp!

Stability ? Based on the pole enplacements of the 2nd order filter, the
2nd order filter if FAR more  stable than the 8th order filter. Its
poles are nowhere near the unit cercle's circumference. On the other
hand, the 8th order filter poles are located nearer the unit cercle's
circumference (than the 2nd order's poles), but I would not say that
the poles are shadowing the unit cercle's circumference. Except for 2
of the 8 poles - they are located near to +j and -j. Nonetheless, the
poles were eveluated using a 16 bit limited precision and consequently
were displaced (at least I assume they were) from their
infinite-precision-theoretical-locations. Thus, since all the poles
were found ALL in the unit circle, the IIR filter should be stable.
(I have a feeling I am leaving myself wide open for a finger-waggling

Thus, my question is:
Why are cascaded-biquad-structures preferred over non-composite higher
order filters since the attenuation pays such a high price? for IIR
filters, of course.

Thx in advance

Re: Fixed vs Float ?

Quoted text here. Click to load it
You misunderstood what was said.

If you want to implement that 8th-order Chebychev filter you can choose
several methods.  You might think that the most sensible thing to do
would be to implement it as an 8th-order direct form filter.  If you did
you would be wrong.  Why?  Because the pole locations of a filter are
sensitive to the accuracy of the coefficients, and this sensitivity
increases sharply as the filter order goes up.

For a 1st-order filter the pole sensitivity is roughly equal to the
precision of the coefficient, so a 1r15 coefficient will give you a pole
that is no more than 2^-15 off from target.  For a 2nd-order filter the
pole sensitivity is roughly equal to the square root of the precision of
the coefficient, so a 1r15 coefficient will give you a pole that could
be off by as much as 0.006.  Note that in some systems this amount of
variation could make or break the system performance.  Extend this to an
8th-order system and your 1r15 coefficient gives you poles that will
wander by as much as 0.27 -- that's going to be a pretty useless filter!

For pretty much the same reasons the accuracy requirements of your
arithmetic goes up with filter order.

So what you do is you take your filter and you break it into sections of
no more than 2nd-order each.  You implement each one of these
individually, and cascade them.  The transfer function of the cascade is
the product of the individual transfer functions so you get the response
that you need, but the accuracy requirements are no more than for
2nd-order sections, so you don't need to use an infinite number of bits
to do your work.

I have a pair of suggestions for you:

First, hie thee down to a bookstore and get a copy of "Understanding
Digital Signal Processing" by Richard G Lyons.  It's a good book, and
it's written for people who need to know the stuff without experiencing
a lot of pain.  This link will get you a copy: .

Second, think of posting (or cross-posting) questions like this to
comp.dsp.  Al and I both frequent that group; there are other's there
(including Rick Lyons) who may have useful input.


Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?

Quoted text here. Click to load it

Right.  Or, since this _is_ an FPGA group, the data bus must be scaled
up to match the resolution of the accumulator.
Quoted text here. Click to load it

Lopping off bits doesn't necessarily mean scaling the numbers down
numerically.  If you view your numbers as integers then throwing away
the least significant 16 and keeping the most significant 24 could be
seen as a divide operation -- but if you view your numbers as fractional
it's just disregarding some bits.

In either case it doesn't require a divider -- you're simply wiring up
the most significant bits which may or may not involve changing the
apparent amplitude of the result by integer factors of two.
Quoted text here. Click to load it

Should you wish to hold the input of the next filter to n bits then yes,
you have to do something with that extra-wide data bus coming out of
preceding filters.
Quoted text here. Click to load it
It can/will cause problems with precision, but if you can analyze what
happens inside a filter section you can analyze what happens in between


Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?

Quoted text here. Click to load it

Since together with the decision to have float or
fixed, the next question which is at least as
important is how many bits you need. while the
float part takes over the exponent adjustment,
speak the shifting to the left or right, the
number of bits in the mantissa or as fixed
determine the dynamic range of your result.

When the pressure to save some macrocells is
there, then you should have a closer look what
happens when you omit how many bits at what

Ing.Buero R.Tschaggelar -
& commercial newsgroups -
We've slightly trimmed the long signature. Click to see the full one.
Re: Fixed vs Float ?
The length of the mantissa determines the resolution or precision.
The exponent determines the dynamic range.
Peter Alfke,

Site Timeline