Damned floating point

Tim Wescott · 2016-02-26T18:58:07+00:00

I assume the answer is "yes", but is it normal behavior in an x86 processor to get slightly different answers when you add up the same short vector of doubles at different times? (Very slightly different -- basically so that sum1 == sum2 isn't true, but not noticeable in any other way.) I assume this has something to do with how the floating-point processor state is treated when a calculation is done. Does anyone know the details? TIA. -- Tim Wescott Wescott Design Services

S

Stefan Reuther 10 years ago

Am 26.02.2016 um 19:58 schrieb Tim Wescott:

Sounds like you're using gcc. Read up on "-ffloat-store". That option is required to make gcc a fully-conforming C compiler.

With optimisation enabled, gcc will try to store variables in floating-point registers. The problem is that floating-point registers always have 80 bits of precision whereas a regular 'double' variable has just 64. So if you end up comparing a variable that remained in a register all the time, and a variable that was spilled to memory, you'll get a mismatch.

Strictly speaking, this is a violation of the C language standard, but gcc probably invokes customary rights here.

(Context switches are not an issue. These save all 80 bits if your operating system is remotely sensible. x86 is not THAT bad.)

Stefan

Vote

R

Robert Wessel 10 years ago

Both true, but not related to either guard bits or context switches.

The problem is that the irregular existence of the extra precision causes numerical instability.

The lack of 80-bit long double support is mostly limited to compilers targeting Windows. Most *nix x86 implementations do support the longer format.

No, they're not - they exist only during the course of an operation (eg. *during* an FADD). The three extra bits are the key to implementing the IEEE accuracy rule (roughly, computed to infinite precision, and then rounded), with a sane amount of hardware (basically the three extra bits). Once the result is back in the register, the three extra bits are gone.

The presence of rename registers does not change the timing of the needed saves, if the OS elects to do lazy saves of the FP registers (and that applies to FPU, MMX, SSE and AVX equally), it uses the task-switched bit in the task state. If set, any attempt to access any non-integer registers will generate a #NM, at which point the OS can do the save of the old state and the restore of the new one. Alternatively the OS could save/restore the FP registers immediately upon the task switch. The advantage of doing it lazily is that you might be able to avoid the work entirely (for example an interrupt handler might run and return without ever accessing any of the FP registers), and you might be able to finish the task switch faster, and allow long FP operations to complete while executing other code. On the flip side when lazy switches do happen, they require handing an interrupt, which is never particularly cheap.

There's no way to allow "old" state to remain in any architecturally visible way.

Now this sort-of happens with SMT (Hyperthreading), but in that case there's an architectural state for each thread, even though they share all the (rename) registers.

Now a CPU might implement more contexts than it can actually execute threads simultaneously, and then effectively store suspended contexts in the rename registers, but that would just appear as more threads, each with a full architectural state, to the OS. No x86, AFAIK, has implemented such a scheme.

Again, no. On a context switch the registers have to be saved, and reloaded with the new context. In the case of the FP registers, that can be delayed until they're actually accessed in the new context (which may allow you to avoid saving them at all if the new context never uses them), but one the new context needs them, they need to be swapped.

If the context switches are purely cooperative, the OS can decide to not save most of the registers (at the least on x86 you're going to have the save xPC and xSP), and leave any (other) saving and restoring up to the caller. But interrupt driven context switches don't have the luxury, since the application cannot control when they happen.

Vote

M

mblume 10 years ago

Maybe score1 is computed as (a+b)+c, score2 as a+(b+c) (or even (a+c)+b)). I don't know if the compiler is free to rearrange the terms.

So you are doing other floating point calculations in between? Do they involve a,b,c as terms or results?

How much do score1 and score2 differ? How much do a,b,c differ?

Regards Martin

Vote

B

Bill Davy 10 years ago

Taking a punt here but things I would wonder about:

Has something changed the operating set-up of FP unit? Could an interrupt be coming in, saving and restoring FP result?

Vote

C

Clifford Heath 10 years ago

An ISR is commonly where a thread context switch occurs. Any thread is allowed to use the FPU.

Vote

G

Grant Edwards 10 years ago

You're right -- when I wrote "guard bits" I meant the extra bits that get discarded whenan 80-bit FPU register gets stored in a 64-bit register or memory location. I should have referred to them as extra pricision bits or something like that.

One would hope that saving and restoring the FPU state would preserve all 80 bits of each value, but I've seen compiler and kernel writers do some pretty stupid things over the years...

Grant

Vote

R

rickman 10 years ago

I'm not certain an interrupt would cause this problem. 80 bit floating point is a valid size for calculations. The interrupt code won't know how the floating point registers are being used so it will have to save and restore the full state of the FL unit rather than truncate the values before saving.

Rick

Vote

U

upsidedown 10 years ago

It is very hard for me to understand why someone would need to use floating point instructions in the ISR, thus no need to save and restore the FP stack.

Usually there is something seriously wrong with the program architecture if you _have_to_ use FP in ISR.

Vote

B

Bill Davy 10 years ago

I agree it is neither nromal nor desirable to use Fp in an ISR but (a) someone else may have written the ISR, and (b) s**t happens, and (c) when you have discounted everything else ...

Vote

R

rickman 10 years ago

I'm not going to judge code I haven't seen or requirements I don't know about. Code is code and floating point is just another form of data.

Rick

Vote

P

Paul Rubin 10 years ago

I have to wonder what x86 model this is running on and what compiler, since it occurs to me that x86 floating point computation is typically done these days with XMM instructions rather than the x87 (stack based) instruction set. I don't know what that does to the 80 bit intermediate results. I guess Tim has deadlines to meet and is happy to have put the issue behind him with a workaround but it would be interesting to see the assembly code that came out of the compiler from his code.

Vote

U

upsidedown 10 years ago

Floating point instructions have various kinds of traps or faults (whatever you call them) such as division by zero, A trap service routine is more or less similar to an ISR and in general, you do not want to trigger an other ISR from within an ISR.

There are priority issues, is the original ISR be interrupted by the trap or should it be suspended after the ISR returns (but then the ISR can't use the result of the trap).

Neither do you want to have e.g. a segment fault within, which usually results in a more or less well handled OS crash.

Vote

D

David Brown 10 years ago

No, floating point is typically done with x87 instructions when you are compiling as "32-bit x86 code that works on anything" - which is typically the default setting for x86 32-bit compilers. XMM registers and instructions will only be used if you set the minimum target requirements to something a little higher, using appropriate compiler flags. Obviously if you are interested in performance you will do that

- in most cases, you can rely on the user having something more recent than an early Pentium. But you have to make an active decision to do this. I can only make guesses about how Tim set up his build, but my guess is that he only bothered with simple flags like "-O2 -Wall

-Wextra", rather than figuring out what architecture flags were appropriate for a good balance between speed and portability, because code efficiency was not particularly important.

Vote

D

David Brown 10 years ago

That makes no sense whatsoever. You use the features you need, inside an ISR or not. Obviously you /usually/ want to keep ISRs' runtimes short, but there are exceptions - a task switcher will often be done as part of an ISR, and you may have an arrangement with nested ISRs where the outer ones can be long running.

And why would you think that floating point is slow or otherwise unsuitable for an ISR? Software floating point on an 8-bit AVR is slow, but on an x86 the basic FP operations are no slower than basic integer operations. On an ARM Cortex M4F and many other embedded microcontrollers with floating point hardware, as long as you are careful to stick to single-precision and use "-ffast-math", your basic floating point operations will be single cycle. Certainly "real" floating point code will be smaller and faster than using scaled integers and shifting to emulate it.

Vote

D

Dimiter_Popoff 10 years ago

Obviously a task switch exception using FP would be a frequent case but I think he did not mention it in this context because it is an obvious exception. As a rule, you practically don't use FP in an ISR - apart from the case mentioned. Then even during a task switch sometimes the OS may not use the FPU (e.g. in DPS each task can turn FPU usage on/off in order to prevent saving all the FP regs - 32 64 bit ones, in fact many times it is done dynamically within a task).

There are many reasons why, I think he already listed some in another reply to someone else. It makes life more complex - FP exceptions because of data variations are no rarity, these will have to be excepted from your IRQ handler. Not the greatest idea. Then the IRQ handler writer will have to be aware of all the subtleties about saving/restoring the FPU context, mainly the FPSCR and other registers with all interdependencies etc. which may well vary between implementations on the same architecture; whereas a plain integer unit IRQ handler will take a lot less knowledge to write.

That being said using the FPU in an ISR is certainly possible, however I would like to see some real examples when this might do you any good.

Have you _actually_ written an IRQ handler which does use FP opcodes? (Other than the obvious case of a task switch of course).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

D

David Brown 10 years ago

Fair enough.

Obviously you don't use FP in an ISR if you don't need it - and ISR's typically do not need FP. That would be the main reason for not using FP in ISR's!

If you need to stack and restore a lot of FP registers, then that can take time (although on some embedded processors, like the MPC5674, FP is done using normal GP registers rather than an additional register set). If your ISRs are short and avoid calling other functions (at least, those that are not static or inline or available through LTO) then no registers, FP or GP, need to be stacked unless they are used.

He mentioned traps or faults, such as from division by 0. There are a few points here. First, don't divide by zero - that would be a bug in the program. Second, use FP modes, instructions and compiler settings that assume the FP operations are done on reasonable operands, and disable such traps or faults (-ffast-math on gcc). Third, on many embedded processors (like the Cortex M4F), the FP instructions don't have any traps or faults. Fourth, if you worry about FP divide by zero traps, you should worry equally about integer divide by zero traps - FP is not special in that regard.

If this is something that your compiler and/or library and/or IDE project "wizard" does not handle for you, then I agree it is an extra complication.

Ah, that's a different matter! I don't have any code convenient that uses floating point at all, never mind inside an interrupt. (The last project I had with floating point used software floating point, and it is not in an ISR.) Floating point just doesn't turn up often in the kind of systems I work with.

Personally, no. But I have a good customer who has a motor control system on an MPC561 that uses hardware floating point both in the main loop and in an interrupt that runs every 600 us (IIRC). We did not write the software itself, but I made the HAL, startup code and some library functions. So I handled the details of preserving the FPSCR and FP registers (it was a good few years ago, so I don't remember details of the process).

I don't disagree that it is rare to want to do FP in an ISR. I just disagree that it is something to actively avoid beyond the usual aim of keeping your ISRs short and simple, or that it indicates "something seriously wrong in the program architecture".

Vote

N

Nobody 10 years ago

It's not the individual instructions, it's the cost of saving and restoring the entire FPU state (bearing in mind that each 80-bit x87 register is 2.5x the size of a 32-bit integer register).

For this reason, x86 has a flag which is set whenever the FPU state changes, so that the state doesn't have to be saved/restored if that would constitute a no-op.

Actually using the FPU in an ISR would require the state to restored upon exit, which may be significant compared to the rest of the ISR.

I don't know whether it's still the case, but it used to be a matter of policy that the Linux kernel doesn't use floating point at all (not just for ISRs).

Vote

Damned floating point

Join the Discussion

Didn't find your answer?