Floating point vs fixed arithmetics (signed 64-bit)

Floating point instructions MUL/DIV are trivial, just multiply/divide the mantissa and add/sub the exponent.

With FP add/sub you have to denormalize one operand and then normalize the result, which can be quite time consuming, without sufficient HW support.

This can be really time consuming, if the HW is designed by an idiot.

Reply to
upsidedown
Loading thread data ...

Your observations are valid. But I have yet to see a practical example of something that can be done faster and with equal accuracy in floating point vs. using integer operations.

I concur with your observations. After reading your first paragaph ... yeah, floating-point multiplication is pretty simple so long as the floating point format is sane.

Before reading your post, I my mental model was that floating-point operations might be 20 times as slow as integer operations. Now I'm thinking maybe 2-3 times.

DTA.

Reply to
David T. Ashley

I did a fixed point support package for our 8 bit embedded systems compilers and one interesting metric came out of the project.

Given a number of bits in a number and similar error checking fixed or float took very similar amounts of execution time and code size in applications.

For example 32 bit float and 32 bit fixed point. They are not exact but they are close. In the end much to my surprise the choice is dynamic range or resolution.

There are other factors IEEE754 has potentially much more error checking but not all libraries a written to support it, and not applications need it.

Regards,

w..

-- Walter Banks Byte Craft Limited

formatting link

Reply to
Walter Banks

That's interesting, because in my experience fixed-point fractional arithmetic (i.e., 0x7fffffff = 1 - 2^-31, 0x80000001 = -1 + 2^-31), with saturation-on-add, is significantly faster (3x to 10x) than floating point on all the machines I've tried it except for those with floating- point hardware.

I have a portable version that works on just about anything that's ANSI-C compatible, and when I really need speed I rewrite the arithmetic routines in assembly for about a 2x increase.

The only processor that came close to matching it was the TMS320F2812, where we used the ANSI-C compatible version that was just about matched by the floating-point package that came with the tool set (and I _know_ that TI cut corners with that floating point package). That's the _only_ processor in my experience where the floating point could keep up with the ANSI-C version, and I would expect that had I written an assembly version it would have been faster yet.

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

What you saw was what I was expecting. My points in the post was to be careful in assuming that fixed is going to be dramatically better. At least for

8 bits the variable size in bits is a significant factor when all math is multiprecision.

One of the keys in our metrics was the target was 8 bit processors and there was an exchange between precision and dynamic range but the bit sizes remained the same.

Real applications are probably dominated by scaling and precision reducing the number of bits used by fixed point for the same application.

It didn't make sense until I realized that it was 8 bit processors using software mults and divides and 32 bit floating point uses for the most part 24bit mults and divides and a few adds/subtracts for the exponents.

32 bit fixed point uses 32 bit mults/divides adding to the cycle count.

My experience with 32 bit processors is similar to yours although I don't have metrics to back it up.

Walter..

Reply to
Walter Banks

Ah. I see your point. 9 multiplies and some shifting during addition vs. 16 multiplies might well turn out to be a wash.

The first serious control loop I did was quite starved for clock cycles, and used a 24-bit accumulator, but with an 8 x 16 (or 8 x 8) multiply, and had 16-bit data paths other than that.

--
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com
Reply to
Tim Wescott

That's not a big surprise - with floating point, the actual arithmetic is 24-bit, which will be quite a lot faster than 32-bit on a small 8-bit machine (especially if it doesn't have enough registers or data pointers).

Reply to
David Brown

It depends on the chip, the type of floating point hardware it has, the operations you need, the compiler, and the code quality. For a lot of heavy calculations done with integer arithmetic, you need a number of "extra" instructions as well as the basic add, subtract, multiply and divides. You might need shifts for scaling, mask operations, extra code to get the signs right, etc. And the paths for these are likely to be highly serialised, with each depending directly on the results of the previous operation, which slows down pipelining. With hardware floating point, you have a much simpler instruction stream, which can result in faster throughput even if the actual latency for the calculations is the same.

This effect increases with the size and complexity of the processor. Obviously it is dependent on the processor having floating point hardware for the precision needed (single or double), but once you have any sort of hardware floating point you should re-check all your assumptions about speed differences. You could be wrong in either direction.

Reply to
David Brown

The key point is "it is dependent on the processor having floating point hardware for the precision needed". And, I might add, on other things -- see Walter Banks's comments in another sub-thread about 32-bit floating point vs. 32-bit integer math.

In my experience with signal processing and control loops, having a library that implements fixed-point, fractional arithmetic with saturation on addition and shift-up is often faster that floating point _or_ "pure" integer math, and sidesteps a number of problems with both. It's at the cost of a learning curve with anyone using the package, but it works well.

On all the processors I've tried it except for x86 processors, there's been a 3-20x speedup once I've hand-written the assembly code to do the computation (and that's without understanding or trying to accommodate any pipelines that may exist).

But on the x86 -- which is the _only_ processor that I've tried it that had floating point -- 32-bit fractional arithmetic is slower than 64-bit floating point.

So, yes -- whether integer (or fixed point) arithmetic is going to be faster than floating point depends _a lot_ on the processor. So instead of automatically deciding to do everything "the hard way" and feeling clever and virtuous thereby, you should _benchmark_ the performance of a code sample with floating point vs. whatever fixed-point poison you choose.

Then, even if fixed point is significantly faster, you should look at the time consumed by floating point and ask if it's really necessary to save that time: even cheapo 8-bit processors run pretty fast these days, and can implement fairly complex control laws at 10 or even 100Hz using double-precision floating point arithmetic. If floating point will do, fixed point is a waste of effort. And if floating point is _faster_, fixed point is just plain stupid.

So, benchmark, think, make an informed decision, and then that virtuous glow that surrounds you after you make your decision will be earned.

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

Assuming we are doing 64 bit double precession mul/div with an 8 bit processor, the mantissa is 48-56 bits and hence a single cycle 8x8=16 bit multiply instruction helps a lot. In addition, the lowest part of mantissa result (96-112 bits) is interesting only to see if this will generate a carry to the most significant 48-56 bits.

The denormalization of the smaller value can be done quite effectively if the hardware supports shift right by N bits in a single instruction. In fact it makes sense to first perform the right shift by multiple of 8 bits by byte copy and then do the 1..7 bit right shift by shift right instructions.

Unfortunately , the normalization after FP add/sub gets ugly. While you can do the multiple of 8 shift with byte test and byte copying, you still have to do the final left shift with a loop 1-7 times with shift into carry and branch if carry set.

Again, if the hardware supports something like FindFirstBitSet instruction in a single cycle, this will help the normalization a lot.

In the old days, I have seen a lot of designs, in which the designs is made based on available gates, not by the required functionality.

Reply to
upsidedown

Weren't you the one that said that your (tuned) ARM C code was generally only a factor of 1.2 worse than the best hand-tweaked assembly code? Maybe not, but I've seen it said in these parts. Certainly, my experience is that that is quite good rule of thumb, and it is very difficult to get more than a factor of two between assembler and C unless the platform in question has a very poor C compiler or the assembly code is actually implementing a different algorithm (which is sometimes possible, but much rarer in these days of well-supplied intrinsic function libraries.)

One thing that gives float a particualr edge on the x86(32) (but which can also apply to other processors) is that using floating point means that you don't have to use the precious integer register set for data: it can be used for pointers, counters and other control periphera, leaving the working "data state" in the FPU registers. Modern SIMD units can do integer operations as well as floating point, so the "extra state" argument might seem weaker, but I've never seen a compiler use SIMD registers for integer calculations (unless forced to with intrinsic functions).

Fast isn't always the only consideration, though. Floating point is

*always* going to be more power-hungry than fixed point, simply because it is doing a bunch of extra work at run-time that fixed-point forces you to hoist to compile-time.

The advice to benchmark is excellent, of course. Particularly because the results won't necessarily be what you expect.

Cheers,

--
Andrew
Reply to
Andrew Reilly

When the compiler can figure out what I mean, yes, it is usually at least almost as good as I can do, and sometimes better (I don't carry around all the instruction reordering rules in my head: the compiler does).

With fixed-point arithmetic stuff, though, the compiler never seems to "get it".

So, the next time I try this on x86 I should use the SIMD registers.

Actually, if you know you're going to be doing things like vector dot products, then you could probably get some significant speed-up by doing a spot of assembly here and there. I haven't had occasion to try this on an x86, though.

It'll be power hungry twice if you select a chip that has floating point hardware. I never seem to have the budget -- either dollars or watts -- to use such processors.

Yes. Even when I expect anti-intuitive results, I can still be astonished by benchmarks.

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

The main problem trying to write _low_level_ math routines in C is that you do not have access to the carry bit or use any rotate instruction. The C-compiler would have to be very clever to convert a sequence of C-statement into a single rotate instruction or shifting multiple bits into two registers.

Reply to
upsidedown

Yes (see my reply on that thread).

When you add things like saturation into the mix, it gets more complicated. That is going to be much less overhead for integer arithmetic than for floating point (unless you have a processor that has hardware support for floating point saturated instructions).

But yes, a well-written library is normally going to be better than poorly written "direct" code, as well as saving you from having to get all the little details correct (you shouldn't worry about your code being fast until you are sure it is correct!). A lot of ready-made libraries are not well written, however, or have other disadvantages. I've seen libraries that were compiled without optimisation - and were thus far slower than necessary. And many libraries are full of hand-made assembly that is out of date, yet remains there for historic reasons even when it now does more harm than good.

Like everything in this field, there are no simple answers.

While x86 typically means "desktop" rather than "embedded", there are steadily more powerful cpu's making their way into the embedded space. I've been using some PowerPC cores recently, and see there's a large number of factors that affect the real-world speed of the code. Often floating point (when supported by hardware) will be faster than scaled integer code, and C code will often be much faster than hand-written assembly (because it is hard for the assembly programmer to track pipelines or to make full use of the core's weirder instructions).

Absolutely.

It's always tempting to worry too much about speed, and work hard to get the fastest solution. But if you've got code that works correctly, is high quality (clear, reliable, maintainable, etc.), and runs fast enough for the job - then you are finished. It doesn't matter if you could run faster by switching to floating point or fixed point - good enough is good enough.

Yes.

Reply to
David Brown

This is one of the reasons why it is best to use a modern compiler for big processors - it is hard to keep up with them when working by hand. On small devices, you can learn all you need to know about the cpu - but for modern x86 devices, it is just too much effort. And if you are trying to generate the fastest possible code, it varies significantly between different x86 models - your fine-tuned hand-coded assembly may run optimally on the cpu you have on your machine today, but poorly on another machine.

For particularly complex vector work, hand-coding the SIMD instructions is essential for optimal speed. But compilers are getting surprisingly good at generating some of this stuff semi-automatically - it is worth trying the compiler's SIMD support before doing it by hand. The other option is libraries - Intel in particular provides optimised libraries for this sort of stuff.

That is a wildly inaccurate generalisation. For small processors, the power consumption is going to depend on the speed of the calculations - these cores are all-or-nothing in their power usage, so doing the work faster means you can go to sleep sooner. So faster is lower power. For larger processors, there may be dynamic clock enabling of different parts - if the hardware floating point unit is not used, it can be powered-down. Then there is a trade-off - do you spend extra time in the integer units, or do you do the job faster with the power-hungry floating point unit? The answer will vary there too, but typically faster means less energy overall.

It is obviously correct that the more work that is done at compile time the better - it is only run-time that takes power (on the target). But I can think of no justification for claiming that fixed-point algorithms will do more at compile-time than floating-point algorithms - I would expect the floating-point code to do far more compile-time optimisation and pre-calculation (since the compiler has a better understanding of the code in question).

Reply to
David Brown

It's a funny old world. I've seen several compilers recognise the pair of shifts and an or combination as a rotate, and emit that instruction. I've also replaced carefully asm-"optimised" maths routines (on x86) that used the carry flag with "vanilla" C equivalents, and the overall effect was a fairly dramatic performance improvement. Not sure whether it was a side effect of the assembly code pinning registers that could otherwise have been reassigned, or some subtle consequence of reduced dependency, but the result was clear. Guessing performace on massively superscalar, out-of-order processors like modern x86-64 is very difficult, IMO.

Intrinsic functions (to get access to things like clz and similar) also help a lot.

Benchmarking is important.

Milage will definitely vary with target and toolchain...

Cheers,

--
Andrew
Reply to
Andrew Reilly

C compilers have been gaining performance in part because compiler designers are targeting with both a target and a subset of applications in mind.

Most compiler developers are benchmarking "real" applications that are tending to direct the compiler to optimize those applications. The result is compilers used in the embedded systems market can often do some very low level optimization very well that would not be available or even considered for compilers used in other applications.

For embedded systems specifically most if not all commercial compilers have some mechanism to access the processor condition codes. Most embedded system compilers do well at using the whole processor instruction set.

Walter..

-- Walter Banks Byte Craft Limited

formatting link

Reply to
Walter Banks

Nothing wakes me up faster than strong coffee than last nights benchmark results. Benchmarking code fragments are important, but benchmarking applications can be a real eyeopener.

There is nothing more humbling than adding a clever optimization to a compiler and discovering that 75% of the regression applications just got slower and larger as a result.

Walter..

Reply to
Walter Banks

Cortex M4 chips,like the STM32F405 have lowered the bars quite a bit for FPU availability. STm32F405 is about $11.5 qty 1 at DigiKey. The STM32F205 Cortex M3 is about the same price.

I've got one of the chips, and it's compatible with the F205 board I designed, so I'll be trying it out soon. More RAM, more Flash, faster clock----everything we look forward to in a new generation of chips. (since I'm not using an OS or big USB or ethernet stacks, I'll have LOTS of flash left over for things like lookup tables, etc.)

Right now, I'm just happy to read an SD card and send bit-banged data to an FT232H at about 6MB/second. I can even use the same drivers and host I use with the FT245 chips which do the same thing at about

200KB/s. The 4-bit SD interface on the STM chips can do multi-block reads at upwards of 10MB/s. Hard to match that with SPI mode!

I think the FPU availability will greatly simplify coding of things like Extended Kalman Filters and digital signal processing apps. You can write and test code on a PC while specifying 32-bit floats and port pretty easily to the MPU system.

Mark Borgerson

Reply to
Mark Borgerson

I think it is a bit of an exaggeration to say this applies to "most" commercial compilers - and it is certainly not all. I think it applies to a /few/ commercial compilers targeted at particularly small processors.

For larger processors, you don't get access to the condition codes from C - it would mess up the compiler code generation patterns too much. For big processors, a compiler needs to track condition codes over series of instructions - if the programmer can fiddle with condition codes in the middle of an instruction stream, the compiler would lose track.

Also for larger processors, there are often many instruction codes (or addressing modes) that are never generated by the compiler. Some instructions are just too weird to map properly to C code, others cannot be expressed in C at all. As a programmer, you access these using library calls, inline assembly, or "intrinsic" functions (which are simply ready-made inline assembly functions).

You write compilers targeted for small and limited processors, and have very fine-tuned optimisations and extensions to let developers squeeze maximum performance from such devices. But don't judge "most if not all" commercial compilers by your own standards - most do not measure up in those areas.

Reply to
David Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.