Fixed vs Float ?

Hello all,

Concerning digital filters, particurlarly IIR filters, is there a preferred approach to implementation - Are fixed-point preferred over floating-point calculations ? I would be tempted to say yes. But, my google search results leave me baffled for it seems that floating-point computations can be just as fast as fixed-point. Furthermore, assuming that fixed-point IS the preferred choice, the following question crops up: If the input to the digital filter is 8 bits wide and the coefficents are 16 bits wide, then it would stand to reason that the products between the coefficients and the digital filter intermediate data values will be 24 bits wide. However, when this 24-bit value is to get back in the delay element network (which is only 8 bits wide), some (understatemen) resolution will be lost. How is this resolution loss dealt with? so it will lead to an erroneous filter?

-Roger

Reply to
Roger Bourne
Loading thread data ...

This is a simple question with a long answer.

Floating point calculations are always easier to code than fixed-point, if for no other reason than you don't have to scale your results to fit the format.

On a Pentium in 'normal' mode floating point is just about as fast as fixed point math; with the overhead of scaling floating point is probably faster -- but I suspect that fixed point is faster in MMX mode (someone will have to tell me). On a 'floating point' DSP chip you can also expect floating point to be as fast as fixed.

On many, many cost effective processors -- including CISC, RISC, and fixed-point DSP chips -- fixed point math is significantly faster than floating point. If you don't have a ton of money and/or if your system needs to be small or power-efficient fixed point is mandatory.

In addition to cost constraints, floating point representations use up a significant number of bits for the exponent. For most filtering applications these are wasted bits. For many calculations using 16-bit input data the difference between 32 significant bits and 25 significant bits is the difference between meeting specifications and not.

For _any_ digital filtering application you should know how the data path size affects the calculation. Even though I've been doing this for a long time I don't trust to my intuition -- I always do the analysis, and sometimes I'm still surprised.

In general for an IIR filter you _must_ use significantly more bits for the intermediate data than the incoming data. Just how much depends on the filtering you're trying to do -- for a 1st-order filter you usually to do better than the fraction of the sampling rate you're trying to filter, for a 2nd-order filter you need to go down to that fraction squared*. So if you're trying to implement a 1st-order low-pass filter with a cutoff at 1/16th of the sample rate you need to carry more than four extra bits; if you wanted to use a 2nd-order filter you'd need to carry more than 8 extra bits.

Usually my knee-jerk reaction to filtering is to either use double-precision floating point or to use 32-bit fixed point in 1r31 format. There are some less critical applications where one can use single-precision floating point or 16-bit fractional numbers to advantage, but they are rare.

  • There are some special filter topologies that avoid this, but if you're going to use a direct-form filter out of a book you need fraction^2.
--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google?  See http://cfaj.freeshell.org/google/
Reply to
Tim Wescott

Oops -- thought I was responding on the dsp newsgroup.

Everything I said is valid, but if you're contemplating doing this on an FPGA the impact of floating point vs. fixed is in logic area and speed (which is why fast floating point chips are big, hot and expensive). Implementing an IEEE compliant floating point engine takes a heck of a lot of logic, mostly to handle the exceptions. Even if you're willing to give up compliance for the sake of speed you still have some significant extra steps you need to take with the data to deal with that pesky exponent. I'm sure there are various forms of floating point IP out there that you could try on for size to get a comparison with fixed-point math.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google?  See http://cfaj.freeshell.org/google/
Reply to
Tim Wescott

I don't take issue with anything Tim stated, but I will add a few comments.

I think that the added complexity of floating point in an FPGA will probably be enough to rule it out.

Fixed point implementations are often better than floating point implementations. This comparison tends to be true when the result of a multiplication is twice the width as the inputs in the fixed point case and when a floating point result is the same size as its inputs. This is usually the case in a DSP processor. This also assumes that you use a filter structure that takes advantage of the long result.

Most IIR filters are constructed as cascaded biquads (and sometimes one first order section). The choice of the biquad structure has a significant impact on performance. If we restrict our choices to one of the direct forms, then usually the direct form I (DF I) structure is best for fixed point implementations. This assumes that we have a double wide accumulator. If this is not the case, the DF I is not a particularly good structure. Floating point implementations are usually implemented as DF II or the slightly better transposed DF II.

You can also improve the performance of a fixed point DF I by adding error shaping. This is relatively cheap from a resoursce point of view in this structure.

As Tim pointed out, you have to pay attention to scaling with fixed point implementations.

Like every design problem, you need to examine the performance requirements carefully. I would look at the pole-zero placement on the unit circle. For you need a high Q filter at some low frequency as compared to the sampling rate, the math precision is going to be critical. The poles might not be on the unit circle, but they will be very close. If the precision is poor, the filter is likely to blow up. In other situations, just about anything will work.

Here is a good link describing biquad structures:

formatting link

-- Al Clark Danville Signal Processing, Inc.

-------------------------------------------------------------------- Purveyors of Fine DSP Hardware and other Cool Stuff Available at

formatting link

Tim Wescott wrote in news: snipped-for-privacy@web-ster.com:

Reply to
Al Clark

Hello,

(newbie) Question: At the intermediate nodes, between biquad-structures in a cascaded biquad structure IIR filter design approach (employing the fixed point approach), the resolution of the (extended) accumulator (of the output) must be scaled down to the width of the data bus, Rigth?

(Internal musing: that would require a fixed point divider, I wonder how many cycles divison takes?)

The scaling-down to the original databus width is required because the next biquad filter in the cascaded structure is expecting an input of n bits. n being the number the number of bits in the databus. m being the number of bits of the coefficients. Correct?

Would/Can that cause problems ? ( that perphaps are not obvious to me rigth now). Are there any tools (freeware) that permit to cascade structures?

Reply to
Roger Bourne

"Roger Bourne" wrote in news: snipped-for-privacy@v46g2000cwv.googlegroups.com:

Yes, this is usually the case

Why a division? For example, if I have a number in 1.63 format (1 sign bit, 63 fractional bits, I can either round or truncate the result to

1.31 format.

Yes

The quantitizer (the process that shortens the fix point word) is going to have a very small effect.

Almost all filter programs assume cascade structures. I don't know much about the free ones. I use QEDesign 1000

formatting link
whicj is very good. One of the advantages of QED is that the programmer is a very good DSP guy. This is not generally the case.

Matlab is also very popular for filter design.

--
Al Clark
Danville Signal Processing, Inc.
--------------------------------------------------------------------
Purveyors of Fine DSP Hardware and other Cool Stuff
Available at http://www.danvillesignal.com
Reply to
Al Clark

Right. Or, since this _is_ an FPGA group, the data bus must be scaled up to match the resolution of the accumulator.

Lopping off bits doesn't necessarily mean scaling the numbers down numerically. If you view your numbers as integers then throwing away the least significant 16 and keeping the most significant 24 could be seen as a divide operation -- but if you view your numbers as fractional it's just disregarding some bits.

In either case it doesn't require a divider -- you're simply wiring up the most significant bits which may or may not involve changing the apparent amplitude of the result by integer factors of two.

Should you wish to hold the input of the next filter to n bits then yes, you have to do something with that extra-wide data bus coming out of preceding filters.

It can/will cause problems with precision, but if you can analyze what happens inside a filter section you can analyze what happens in between sections.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google?  See http://cfaj.freeshell.org/google/
Reply to
Tim Wescott

I compared an [8th order lowpass chebychev lowpass filter, 16-bit fixed point] with [2nd order lowpass chebychev lowpass filter, 16 bit fixed point, whose frequency I multiplied by 4 as to emulate a 4

2ndorder-cascaded-structure]. I used WinFilter freeware. (-I do not yet know if all 2nd order IIR filters can be called biquads. Have to look into that...)

Anyways, based on the attenuation evaluated from the frequency response from both filters (8th and 4x2nd), the 8th order filter clearly was the better filter. It's attenuation was stronger and faster (rolloff rate greater). The 4x2nd structure did eventually outperform the attenuation of the 8th order filter, but only because the 8th order filter had reached its 16 bit attenuation floor. The 4x2nd order filter structure frequency response (attenuation) was most definitely NOT sharp!

Stability ? Based on the pole enplacements of the 2nd order filter, the

2nd order filter if FAR more stable than the 8th order filter. Its poles are nowhere near the unit cercle's circumference. On the other hand, the 8th order filter poles are located nearer the unit cercle's circumference (than the 2nd order's poles), but I would not say that the poles are shadowing the unit cercle's circumference. Except for 2 of the 8 poles - they are located near to +j and -j. Nonetheless, the poles were eveluated using a 16 bit limited precision and consequently were displaced (at least I assume they were) from their infinite-precision-theoretical-locations. Thus, since all the poles were found ALL in the unit circle, the IIR filter should be stable. (I have a feeling I am leaving myself wide open for a finger-waggling session)

Thus, my question is: Why are cascaded-biquad-structures preferred over non-composite higher order filters since the attenuation pays such a high price? for IIR filters, of course.

Thx in advance

-Roger

Reply to
Roger Bourne

You misunderstood what was said.

If you want to implement that 8th-order Chebychev filter you can choose several methods. You might think that the most sensible thing to do would be to implement it as an 8th-order direct form filter. If you did you would be wrong. Why? Because the pole locations of a filter are sensitive to the accuracy of the coefficients, and this sensitivity increases sharply as the filter order goes up.

For a 1st-order filter the pole sensitivity is roughly equal to the precision of the coefficient, so a 1r15 coefficient will give you a pole that is no more than 2^-15 off from target. For a 2nd-order filter the pole sensitivity is roughly equal to the square root of the precision of the coefficient, so a 1r15 coefficient will give you a pole that could be off by as much as 0.006. Note that in some systems this amount of variation could make or break the system performance. Extend this to an

8th-order system and your 1r15 coefficient gives you poles that will wander by as much as 0.27 -- that's going to be a pretty useless filter!

For pretty much the same reasons the accuracy requirements of your arithmetic goes up with filter order.

So what you do is you take your filter and you break it into sections of no more than 2nd-order each. You implement each one of these individually, and cascade them. The transfer function of the cascade is the product of the individual transfer functions so you get the response that you need, but the accuracy requirements are no more than for

2nd-order sections, so you don't need to use an infinite number of bits to do your work.

I have a pair of suggestions for you:

First, hie thee down to a bookstore and get a copy of "Understanding Digital Signal Processing" by Richard G Lyons. It's a good book, and it's written for people who need to know the stuff without experiencing a lot of pain. This link will get you a copy:

formatting link

Second, think of posting (or cross-posting) questions like this to comp.dsp. Al and I both frequent that group; there are other's there (including Rick Lyons) who may have useful input.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google?  See http://cfaj.freeshell.org/google/
Reply to
Tim Wescott

Since together with the decision to have float or fixed, the next question which is at least as important is how many bits you need. while the float part takes over the exponent adjustment, speak the shifting to the left or right, the number of bits in the mantissa or as fixed determine the dynamic range of your result.

When the pressure to save some macrocells is there, then you should have a closer look what happens when you omit how many bits at what operation.

Rene

--
Ing.Buero R.Tschaggelar - http://www.ibrtses.com
& commercial newsgroups - http://www.talkto.net
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from
http://www.SecureIX.com ***
Reply to
Rene Tschaggelar

The length of the mantissa determines the resolution or precision. The exponent determines the dynamic range. Peter Alfke,

Reply to
Peter Alfke

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.