Puzzling power results STM32F4 FPU test

M

Mark Borgerson 13 years ago

In a recent thread Jon Kirwan and I were discussing FPUs and power consumption. I decided to try some real world tests on an STM32F4 Discovery board. After a few tests in the ChiBios RTOS, where I discovered that you can save a lot of power by doing floating point math with the FPU and shutting off the CPU clock in the idle process, I decided to try to measure the power using software and hardware floating point without the RTOS. I initialized the CPU clock to 168MHz and ran this code:

// ChiBios calls commented out to run without OS static msg_t ThreadMath(void *arg) { float sinetable[360], fval; int i,j; systime_t start, end;

msg_t mathmsg; long mathloop = 0; (void)arg; // chRegSetThreadName("Math"); while (TRUE) { // mathmsg = chBSemWait(&MathSemaphore); // start = chTimeNow(); for(j= 0; j

Vote

F

Frank Miles 13 years ago

start = chTimeNow();

sinetable[i] = sinf(fval);

Did you check the current as a function of time (i.e. with a current probe and 'scope)? The obvious reason is that the FPU does the job faster, so you spend more than enough time in sleep to make up for the higher current consumed by the FPU. BTW - does the FPU get completely turned off when you are not going to use it?

If you don't have a current probe, you could set up your system to exercise floating point calculations continuously {either using FPU or CPU/software}. That should get you the results you expect.

Vote

M

Mark Borgerson 13 years ago

The function that fills in the table of sine values runs continuously-- the CPU should never go to sleep.

I get the expected reduction in power when using an RTOS where the sine function is intermittent and the CPU sleeps between activations.

That's what I did with the code above.

Mark Borgerson

Vote

W

Waldek Hebisch 13 years ago

Wild guess: FPU instructions take more time to execute so probably CPU is doing smaller number of instructions when using FPU. In other words CPU may be spending a lot of cycles stalled waiting on FPU. Less work in integer part of CPU may give power saving.

Waldek Hebisch hebisch@math.uni.wroc.pl

Vote

A

Anders.Montonen 13 years ago

Pure speculation, but since many of the floating-point instructions take multiple cycles to complete, the CPU pipeline may spend more time stalled, which in turn means the flash interface is activated less often.

Also, just to avoid mistakes I've made myself, I trust you have verified that the compiler emits floating-point instructions, and that the hardware-float version of the math library is linked?

-a

Vote

F

Frank Miles 13 years ago

Ah, could you hold on a moment while I find some hole to crawl into, preferably one with a remedial reading class?

Sorry, guess I'm clueless today.

To repeat one point - are you sure that the FPU is completely turned off when you're not going to be using it? Hopefully there's some way to be sure this is happening.

Vote

M

Mark Borgerson 13 years ago

I guess that's a possibility. While a FP multiply is just one cycle, an FP divide is 12. The sine function and the loop code do use divide instructions.

Yes, I checked the instruction codes in the assembly display of the C-spy debugger. It does use the FPU---and the 8X faster performance on other test code supports that.

One other hypothesis that I've come up with is that the software FP pushes and pops more stuff on and off the stack doing the same work that the hardware FP does with a single transfer to the FPU registers. Perhaps moving all those registers to and from RAM uses more energy.

Mark Borgerson

Vote

M

Mark Borgerson 13 years ago

I'm not sure of the status of the FPU when I compile the code for software FP. There are a couple of FPU enable bits that aren't set when using software FP, but I'm not sure if they turn off the FPU clock or if they just cause a fault on writes to the FPU registers.

Mark Borgerson

Vote

R

Richard Damon 13 years ago

My guess would be due to a couple of factors:

1) The FPU is quicker, so the CPU will be spending more time in the IDLE state, where the power consumption is a lot less.

2) The FPU is probably a lot more efficient in the number of electrons needed to do the operation than the software emulation. On a per microsecond basis, the FPU may use more power when it is running than the integer ALU, but it may well need less energy to do the full computation.

Vote

M

Mark Borgerson 13 years ago

I'm not sure this matches the test conditions. Both with and without the fpu, the software was in a continuous loop that computed and stored sine values. There was no idle state.

But both computations were running continuously---but the FPU loop does cycle more times per second.

Mark Borgerson

Vote

B

Boo 13 years ago

Because the FPU is specifically designed for floating point and it performs those operations more efficiently in terms of fetched instructions, changed register bits etc than a software implementation can do.

Boo2

Vote

R

Richard Damon 13 years ago

In the initial post, the OP said

So there appears to be a fixed amount of computations to be done per unit time and the processor being put to sleep in between. Under this condition, it makes sense that the FPU will save power, as it is more efficient in doing the calculation, being designed for it.

If the choices are doing more calculations per unit time with the FPU verse not, then the power per unit time likely goes up, but the energy used per unit of calculation should still be lower. Since normally there IS a fixed amount of processing to do in an embedded system, using the FPU can be a power savings (as long as you can use it enough that its "idle" power doesn't eat up the savings when you are not using it).

Vote

M

Mark Borgerson 13 years ago

That was the initial test. In the later test, with the code shown in the post, there was no RTOS active, just a continuous loop computing and storing the sine values.

I agree with this---and it was demonstrated in the initial test using the RTOS where the power went down by about 50% with the CPU idle between calculation loops.

The mystery is why the power is lower using the FPU when the calculation are in an infinite loop with no idle state between loops.

Mark Borgerson

Vote

A

Anders.Montonen 13 years ago

If the chip you're using allows it, you could try rearranging the test to run entirely from RAM.

-a

Vote

A

Anders.Montonen 13 years ago

I had a look in the data sheet for the STM32F405xx/407xx, and the current consumption characteristics on pages 77-78 gives the following figures for running at 168MHz with all peripherals disabled:

With flash accelerator OFF: 46mA typ, 61mA max
With flash accelerator ON: 40mA typ, 54mA max

This would support the idea that flash memory accesses at least play a part in the overall power consumption. The ARM embedded trace macrocell has a performance counter specifically for measuring multi-cycle instruction and instruction fetch stalls which you could use to test whether the hardware FP code actually stalls significantly more than the emulated code.

-a

Vote

W

Walter Banks 13 years ago

I have been following this thread with interest because it is similar in approach to power experiments we have done characterizing instructions.

Your results are puzzling and you might want to contract ST privately and see what they have to say. The Discover board has been used quite a bit recently in the ST promotional seminars and one of the demos is a time and size difference between FPU compiled code and the same source using floating libraries.

As others have suggested the only explanation for the difference that I can see is wait states for FPU. When total power for a given task is factored in (volts*current*time) then a different picture will likely emerge.

Walter Banks..

Vote

M

Mark Borgerson 13 years ago

When I get time, I'll clean up my test code and do a couple of variants that pare things down to the minimum set of operations. I'll do variants that concentrate on floating point multiply and divide. Since divide is multi-cycle, it should cause more pipeline stalls and may show a different power result than multiply.

During the paring-down process, I'll to make sure that compiler optimizations don't eliminate the math functions! ;-)

When I used the RTOS with an idle task that shut off the CPU clock, power was definitely lower when using the FPU.

Mark Borgerson

Vote

D

dp 13 years ago

Basically this is work to be done in assembly. What counts is not just multiply, but also the data dependencies, these can affect the multiply/add performance several times, basically as many times as there are pipeline stages involved in the opcode under test. I would expect this to influence the power consumption (but have never measured it, as opposed to the data dependencies which I had to eliminate on a power core to get all of its power out :-) ).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Vote

Puzzling power results STM32F4 FPU test

Join the Discussion

Didn't find your answer?