Cortex M4 Floating Point Size

T

Tim Wescott 12 years ago

I am, apparently, incompetent at reading data sheets.

At least when they get up to several hundred pages.

Do Cortex M4 parts deal with 64-bit floating point in hardware, or just

32-bit?

Thanks...

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

R

Roberto Waltman 12 years ago

32, I believe.

From the Cortex-M4 reference manual ( DDI0439D_cortex_m4_processor_r0p1_trm.pdf

"2.1 About the functions Optional Floating Point Unit (FPU) providing:

32-bit instructions for single-precision (C float) data-processing operations.
Combined Multiply and Accumulate instructions for increased precision (Fused MAC).
Hardware support for conversion, addition, subtraction, multiplication with optional accumulate, division, and square-root.
Hardware support for denormals and all IEEE rounding modes.
32 dedicated 32-bit single precision registers, also addressable as

16 double-word registers.

Decoupled three stage pipeline."

"7.1 - About the FPU The Cortex-M4 FPU is an implementation of the single precision variant of the ARMv7-M Floating-Point Extension (FPv4-SP). It provides floating-point computation functionality that is compliant with the ANSI/IEEE Std 754-2008, IEEE Standard for Binary Floating-Point Arithmetic, referred to as the IEEE 754 standard. The FPU supports all single-precision data-processing instructions and data types described in the ARM®v7-M Architecture Reference Manual"

And from infocenter.arm.com: "ARMv7-M Architecture Reference Manual ... This document is only available ... to registered ARM customers."

Roberto Waltman [ Please reply to the group, return address is invalid ]

Vote

T

Tim Wescott 12 years ago

Crud.

Thanks.

I guess I test my algorithm with 32-bit arithmetic and see how it flies, then.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

A

Anders.Montonen 12 years ago

In addition to the information Roberto posted, it may be worth keeping in mind that the parts with the FPU are "Cortex-M4F", and the parts without are plain "Cortex-M4". At least some of Freescale's Kinetis parts are of the latter kind.

-a

Vote

J

Jim Stewart 12 years ago

Just out of idle curiosity, what kind of an application might require 64 bit floating point?

Vote

T

Tim Wescott 12 years ago

Most control loops that need any precision won't work quite right with 32 bit floating point. You need more than the 25 bits worth of mantissa that comes with single-precision floating point (32 bit fixed-point often works quite well, however). If you're just spinning a motor then you can get by, but if you've got a PID loop with 16-bit or better inputs and a high sampling rate to bandwidth ratio, then you need integrators with more than 25 bits worth of precision.

In this case it's a Kalman filter application. It may work with 32 bits, but I haven't tested it against the data that I have, and it'll be tight. So either I'll need to rearrange the algorithm (Kalman filters can use a "square root" algorithm that basically halves the required precision in the most sensitive areas, in return for a whole bunch of extra, and extra-weird, math) or re-think my processor choice.

Sigh...

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

I

info 12 years ago

The single-precision FPU of Cortex-M4F needs to be enabled before it is use d (the FPU is disabled out of reset). Typically the FPU is enabled in the s tartup code, but you need to check to be sure.

Also, the FPU in Cortex-M4F comes with its own register bank, which needs t o be saved/restored if the FPU can be used in the ISRs or in tasks of a pre emptive RTOS. The need for saving/restoring this context is a huge penalty for using the FPU in such circumstances. To reduce this (unacceptable reall y) overhead, ARM has introduced the feature called "lazy stacking" describe d in the ARM App Note:

formatting link

. Lazy st acking of FPU registers is enabled by default.

Miro Samek state-machine.com/arm

Vote

F

FreeRTOS info 12 years ago

We seem to of gone off the topic of the OP, but...

[hardware] lazy stacking breaks down when using a true multi-threaded OS, requiring the FPU registers to be saved on a task context switch. The reason being, the lazy stacking algorithm [obviously] cannot be aware of the kernel's radical stack pointer manipulation - it can only be aware of predicable stack pointer increments and decrements.

Regards, Richard.

formatting link
Designed for microcontrollers. More than 103000 downloads in 2012.

formatting link
Trace, safety certification, FAT FS, TCP/IP, training, and more...

Vote

P

Paul Rubin 12 years ago

In Tim's application, I wonder whether the FPU can be exclusively used by a single task, so nothing else touches the registers. Is that a reasonable approach?

Vote

U

upsidedown 12 years ago

Floating point instructions in ISRs ? I have never encountered such ISRs.

Why not use the same principle for some of the highest priority tasks and only below a certain priority level FP-register save/restore is performed. At the low levels, the full save/restore cost is not that significant, since these tasks typically execute for quite long times at once. Of course, this requires some hooks into the task scheduler, but should not be too hard to implement.

Vote

P

Paul Rubin 12 years ago

Well I've heard of applications whose main loop consisted of a halt instruction repeated endlessly. All the functionality happened at interrupt level. No idea if they used floating point. :)

Vote

F

FreeRTOS info 12 years ago

Where in this thread does it say that the OP is using multitasking or a task scheduler?

If multithreading is not being used then the Cortex-M4F will handle everything for you by only saving the floating point registers when it is absolutely necessary (the save being triggered by a floating point instruction being executed - if you turn this functionality on).

If multithreading is being used then there are several different ways of doing it...the best of which can only be determined when you know how the application is using the FPU (from how many tasks, how often, etc.).

However, as per my previous post, I think this is quite off topic from a question of "is it 32-bits or 64-bits" so probably not a helpful discussion to the OP.

Regards, Richard.

formatting link
Designed for microcontrollers. More than 103000 downloads in 2012.

formatting link
Trace, safety certification, FAT FS, TCP/IP, training, and more...

Vote

H

hamilton 12 years ago

I did that years ago (1985) on the i286 w/floating point co-processor (i287).

3-Axis vertical mill, at each 8 mSec interrupt a new position of one of the axis was run.

A simple mutex handled the FPU.

There was no RTOS involved, just a simple round robin of each axis. All code was written with Turbo C.

Also did the same with a Z80 and an AM9511a co-processor before that. This one used Microsoft BASIC.

hamilton

Vote

T

Tim Wescott 12 years ago

It would. I've thought of that. At the moment the whole application is small enough that I'm planning on using a home-rolled cooperative multitasker that dodges the whole context-switch thing at the expense of weighing down the developer with the need to chop low-priority computations up into bits that are small enough that they don't bog down important tasks. So the whole "can't RTOS" thing is moot for me at the moment.

As far as the "only one task gets the math processor", I've actually already been there, done that (sorta), with the ADSP 2101 using an RTOS. The ADSP 2101 has some hardware context associated with its DSP functionality that is simply not accessible via software (except by "push" and "pop" into very shallow hardware stacks). It's not even a matter of "slow" -- it's "you can't, sucker". So if you want to use its DSP features in an RTOS you're limited to doing it in one task. (Well, one task and one ISR, thanks to those shallow stacks).

All the "regular processor" stuff can be context-switched just fine, however. So we used the thing exactly that way: we had one task for the heavy lifting (running a spinning-wheel gyroscope that had to be in closed loop) with a bunch of tasks to make it play nice with the balance of the system. That one magic control task was the _only_ task that got its fingers onto the MAC and associated instructions; everything else was kept away.

The board, by the way, worked great.

It would be harder to do this with the M4F. Ironically, it's because the tools support floating point -- in the case of that 2101, the tools didn't know what to do with a MAC instruction and never generated one. So it was easy to tuck all the "DSP" stuff away in assembly language code that was only called from one c file.

I suppose it might be possible to compile just one or two magic files using the M4F switch, and compile the rest using the M4 switch (or whatever the gnu compiler supports -- that's my next task!!!). If so, and if it works without weird namespace or other collisions, then I'd get software-synthesized math for most of the thing, and hardware math for the important stuff.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

T

Tim Wescott 12 years ago

Totally off topic, yes. But still interesting, and useful in that I may get 32 bit to work for me, and the selection of a multithreaded OS isn't entirely off the table. This side discussion has certainly put a pretty high bar on any multitasking OS that I do select, so it's useful in that regard.

As I mentioned elsewhere, I'm currently planning on using a cooperative multitasker because (a) I have it lying around, and (b) I'm the only author on this software, so I don't have to worry about some dip**** trying to compute pi to 100 decimal places in the lowest-priority task without yielding.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

I

info 12 years ago

Tim, Richard: To be strictly on topic, the whole discussion can be closed w ith just one number: 32, so all of the posts that go beyond this number are OT.

But, I still believe that the mention of the "lazy stacking" feature of the Cortex-M4F FPU _is_ relevant, even in the absence of a preemptive RTOS or ISRs that use the FPU. I think it's good to know about "lazy stacking", bec ause it is enabled by default (when you enable the FPU), so if you don't kn ow about it, it can hit you by unexpectedly high stack usage. "Lazy stackin g" always allocates the space for the FPU registers on the stack, but the a ctual saving/restoring of the registers does not happen until the FPU is us ed. This has also an interesting implications for real-time, because if an ISR uses the FPU, its timing will carry the penalty of stacking the FPU reg isters.

Miro Samek state-machine.com

Vote

I

info 12 years ago

Indeed, a traditional RTOS kernel that can block in multiple places in a ta sk body probably cannot take advantage of the "lazy stacking" feature.

But a simpler class of run-to-completion preemptive kernels _can_ take adva ntage of the "lazy stacking" and, in fact, this feature integrates very sea mlessly with this type of kernels. The use of the Cortex-M4F FPU with a pre emptive QK kernel is described in Section 4.2 of the AppNote, available at:

formatting link

.

Miro Samek state-machine.com/arm

Vote

I

info 12 years ago

Yes, this is the most efficient use of the FPU. In this case, you can disable "lazy stacking" to save stack space. The CMSIS-compliant code for disabling "lazy stacking" is:

FPU->FPCCR &= ~((1U

Vote

D

dp 12 years ago

Oh more than those which can use 32 bits for sure. For example, if you will be DSP-ing (that is, doing lots of MAC),

32-bit FP is just useless, the 24 bit mantissa begins to lose data before you know. 32 bit FP can be useful of course but not a lot if the FPU is constrained to 32-bit only. If it has both 32 and 64 one tends to use both, well, at least I tend to do so.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Vote

D

dp 12 years ago

task body probably cannot take advantage of the "lazy stacking" feature.

vantage of the "lazy stacking" and, in fact, this feature integrates very s eamlessly with this type of kernels. The use of the Cortex-M4F FPU with a p reemptive QK kernel is described in Section 4.2 of the AppNote, available a t:

formatting link

.

Or, if an OS is well written, it does allow the tasks to switch FPU saving on/off when needed - like I do under DPS all the time, need FPU - call "fpuon$", which returns the former state of "fpu" for that task. Return from the function, if former state was off, switch it off again, leave on otherwise. So FPU registers are saved during task switch only when necessary. This is not applicable to IRQ handlers, of course, but I can think of no IRQ handler I ever wrote for what, nearly 30 years, which needs/uses FPU.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Vote

Cortex M4 Floating Point Size

Join the Discussion

Didn't find your answer?