Floating point vs fixed arithmetics (signed 64-bit)

kishor · 2012-03-26T09:22:21+00:00

Hi friends, I am working on stellaris LM3s6965 (cortex-m3) & Keil 4.20 for dataacquisition. ADC is signed 24-bit. To perform software Gain calibration I have two options, 1. 64-bit fixed width arithmetic uint16_t Gain; // 0x8000 means gain is 1 int32_t ADC_Reading; // It contains 24-bit signed integer ADCreading ADC_Reading = ((int64_t)ADC_Reading * Gain) / 0x8000; //Gain calibration // As multiplication of signed 24-bit & unsigned 16-bit will not fitinto 32-bit variable // I typecast it to int64_t. 2. Single precision Float float Gain; int32_t ADC_Reading; // It contains 24-bit signed integer ADCreading ADC_Reading = ADC_Reading * Gain; // Gaincalibration Which is better for performance wise. Thanks, Kishore.

D

David T. Ashley 14 years ago

Additional note to the OP: comp.lang.c will point you in the right direction as far as what is portable and what is not.

From memory, probably wrong ... I should look it up, but too lazy.

For unsigneds, no issues shifting in either direction. Works as intuitively expected.

For signeds ...

Signed left shifts work as expected. 0 is always propagated into the LSB.

Signed right shifts are, from memory, I believe, implementation dependent. It isn't guaranteed how the MSB will be populated.

Again, this is from memory and possibly wrong.

The suggestion of separating out the sign certainly prudent.

DTA

Vote

U

upsidedown 14 years ago

The x86 family is a bit strange case. The number of cycles required by trivial integer operations (adds, shifts) compared to more complex instructions like integer mul/div is nearly 1:1 and the floating point variants are not much worse. Even some complex cases such as floating point sin/cos are handled quite quickly.

One might even argue that the relative performance for primitive operations like shifts and adds are quite poor on x86 processors, compared to computationally intensive operations like sin/cos (requiring 3rd-8th order polynomial).

Vote

T

Tim Wescott 14 years ago

Be careful of 32-bit floating point. It is insufficient for a number of real-world tasks for which 32-bit fixed point is well suited. IEEE single-precision floating point gives you (effectively) a 25- or 26-bit mantissa (I can't remember how many bits it is, plus sign, plus implied

1). When integrator gains get low, that's not enough, where the extra factor of 128 or 64 available from well-scaled fixed point will save the day.

Be _very_ careful of 32-bit floating point in an Extended Kalman filter. Particularly if you're not using a square-root algorithm for the evolution of the variance matrix. You can run out of precision astonishingly quickly.

My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com

Vote

M

Mark Borgerson 14 years ago

IIRC, IEEE-854 is 8 bits exponent (offset by 128 ), one bit sign and

23-bit mantissa with an implied 1 bit as the 24th bit.

That's probably OK for FIR filters working on the results of 16-bit ADCs as long as the number of terms is reasonable ( evolution of the variance matrix. You can run out of precision

Thanks for the notes. I looked up the last time I ported someone else's code to a StrongArm processor. They did use doubles (64-bit FP). The chip didn't have an FPU and was running Linux. The standard FP library implementation did all the floating point calculations with software interrupts and performance truly sucked. We ended up revising all the code to use a special library that didn't use SWIs. It was still not as fast as we wanted. I'm not sure how much a 32-bit FPU will help with 64-bit FP calculations. One of these days I'll take a closer look at the IAR and STM signal processing libraries.

Mark Borgerson

Vote

T

Tim Wescott 14 years ago

It gets to be an issue when you're implementing IIR filters or PID controllers where the bandwidth of the filter or loop is much smaller than the sampling rate: in those circumstances, the difference between the maximum size of an accumulator and the size of an increment that needs to affect it can get to be a healthy portion of -- or more than --

2^25, and then you're screwed.

If I needed to implement a Kalman filter on a processor that would take a significant speed hit going to 64-bit floating point I'd take a close look at the square root algorithms. The basic idea is that you have to do more computation to carry the square root of the variance, but because it's a square root you pretty much cut your needed precision in half.

On a PC I rather suspect that using a square root algorithm would be a stupid waste of time -- but if brand B can do 32-bit floating point 50 times faster than 64-bit, the square root algorithm would probably win hands down.

Tim Wescott Control system and signal processing consulting www.wescottdesign.com

Vote

C

Clifford Heath 14 years ago

I think I recall that transition point occurring around 1994.

I was writing a scalable vector graphics subsystem, and carefully using integer (sometimes fixed-point) math wherever possible, only to find that, when I changed the basic type of the coordinate to float (or double, I can't recall) the system actually rendered *faster*.

The integer unit was busy computing addresses and array offsets, and being interrupted with *coordinate* math, while the FPU lay idle.

This was still in the Pentium days, before even the 686 and PII.

On a modern note, has anyone tried to use the TI OMAP ARM CPUs? I haven't looked at the DSP instruction set, but the hardware FP is sweet.

Clifford Heath.

Vote

M

Mark Borgerson 14 years ago

the OP may need to consider and that is granularity.

lower precision than a 32*32:64/32 because the float uses 23+1 bits to store the number. The other bits are exponent, and give dynamic range, but NOT precision.

would need to watch it very carefully.

Have you actually found and used a 32-bit ADC? For and ADC with a 5V range, that would mean just a few nanovolts per LSB!!!

libraries for Gain/Scale type calibrates, that use a 64 bit result in the intermediate steps.

My experience is that I'm lucky to get 20 noise-free bits on any system actually connected to an MPU (for a single conversion). Still, that would push the limits on FP with only 24 bits in the mantissa if I were to do any significant oversampling. I remember professors in chemistry and physics warning me that the uncertainty in my final result should have error limits corresponding the the precision of my inputs. Still, roundoff errors could eventually degrade the result past the limits of the input for some calculations.

The reality of the oceanographic sensors I work with is that 16 bits gets you right into the noise level of the real world for most experiments.

However, if you are doing long-term integrations of variable inputs, roundoff error could come back to haunt you.

Mark Borgerson

Vote

J

John Devereux 14 years ago

the OP may need to consider and that is granularity.

lower precision than a 32*32:64/32 because the float uses 23+1 bits to store the number. The other bits are exponent, and give dynamic range, but NOT precision.

would need to watch it very carefully.

Only actual chip I have heard of is a sigma-delta from TI. Of course

8-10 of these bit are marketing. I would look it up for you but the flash selection tool is still "initializing" for me on their site...

The best ADC I have seen is a HP 3458A meter, the equivalent of a 28 bit chip ADC.

It might just be possible to make a 32 bit ADC using a josephson junction array, if you have a liquid helium supply handy :)

[...]

John Devereux

Vote

A

Anders.Montonen 14 years ago

Off-topic, but as far as I can tell TI are not using Flash in any of their selection tools, only HTML5. Unfortunately their backend sometimes glitches out, usually when you need to look up one of their components.

Anyway, their ADS1281/1282 advertise a 31 bit resolution. The ADS1282-HT high-temperature variant is even available in DIP packaging for the low, low price of $218.75 ea.

-a

Vote

J

John Devereux 14 years ago

Oh really? Good for them. I apologise to TI, I admit I was using quite an old browser.

In fact it seems to work very well in a slightly more modern one. It is one of the few such manufacturer "selection tools" that uses the whole width of the browser window. Most are crippled to uselessness by some stupid marketeers desire to exactly control appearance.

John Devereux

Vote

T

Tim Wescott 14 years ago

If you do any filtering at all, the 25 bits of precision often matter with a _16_ bit ADC, when they aren't a show-stopper altogether. It wouldn't be sensible to even _think_ about filtering the output of a 24- bit ADC with single-precision floating point data paths unless the ADC had been exceedingly poorly chosen or applied, and had essentially useless content in the last several bits.

My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com

Vote

P

Paul 14 years ago

Because the marketeer or developer believe everyone has the same system and screen sie as them. Then it looks right when printed out on a piece of paper and handed to the board to look at. Don't even get me on fonts specified in pixels :)

Paul Carpenter | paul@pcserviceselectronics.co.uk PC Services Timing Diagram Font GNU H8 - compiler & Renesas H8/H8S/H8 Tiny For those web sites you hate

Vote

M

Mark Borgerson 14 years ago

I agree with your point about filtering with 16-bit ADCs. I generally implement FIRs with about 20 taps---which is easiy done with a 16 x 16 -> 32-bit MAC. There's no real advantage to floating point there, and with 16-bit data inputs, dynamic range is not a problem.

I've usually found that getting the full 24 bits from a 24-bit ADC is next to impossible. The CS5534 that I've used comes with a table that lists the effective number of bits vs cycle time. IIRC, need to go to

7-1/2 conversions per second to get over 20 bits. At 30 or 60 conversions per second, you're down in the 18 bits range. However, the built-in 60Hz rejection is quite helpful for some applications.

Floating point does have it's uses though--where dynamic range is high and some of the numbers start out very large----as in chemistry calculations where you may start with constants like 6.02245x10^23.

32-bit floating point may not be suitable for exactly counting the hydrogen ions in a beaker of analyte, but it can give you reasonable results within the limits of chemical sensors you might use (Such as pH meter with a 4-digit display.)

Mark Borgerson

Vote

J

John Devereux 14 years ago

I find it can be nice for generating the final "result" when a complicated formula is involved. Or even if not that complicated but there is some horrible mixture of units involved, Convert everything to floating point SI unit and just do the calculation, instead of carefully scaling everything and checking for loss of precision and overflows at every sub-step.

John Devereux

Vote

Floating point vs fixed arithmetics (signed 64-bit)

Join the Discussion

Didn't find your answer?