ARM Cortex Mx vs the rest of the gang - Page 2

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: ARM Cortex Mx vs the rest of the gang
On 12.6.2017 ?. 17:21, StateMachineCOM wrote:
Quoted text here. Click to load it


They have to be saved during every context switch if the OS is broken.
What stops it from saving the FPU context only for tasks which use
the FPU? It is a single bit of information in the task descriptor.
Then what stops the programmer from declaring the FPU is in use or
not within a task? Normally what is done for this purpose is
- enable FPU for tha task, saving the state the previous FPU_in_use bit
state,
- do the FPU work,
- restore the previous state of the FPU_in_use bit.
Obviously for tasks which use the FPU intensively this needs not
be done, one just leaves the FPU_in_use on.

Quoted text here. Click to load it

I once had a look at that only to confirm my suspicion that it is just
piling problems over existing problems. Things like this are done
by software, or rather software is written such that there is no
need for that sort of thing.

Quoted text here. Click to load it


Not having a separate FPU register set is a disadvantage, not an
advantage (now if the FPU is a useless 32 bit one having it at all
is a disadvantage but this is another matter :-).
Nothing prevents software from using what it wants and from saving
only what has to be saved; often having an entire FPU and saving its
entire context in addition to that of the integer unit makes things more
efficient, sometimes a lot more efficient (e.g. implementing a FIR
on a plain FPU where data dependencies do not allow you to do it
at a speed more than speed/(pipeline length), you just need many
registers).
The e500 cores from freescale were never popular with me exactly
because they had the integer and FP register sets in one, just 32
registers in total. Not a good idea on a load/store machine.

Dimiter


======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/



Re: ARM Cortex Mx vs the rest of the gang
On 6/12/2017 7:57 AM, Dimiter_Popoff wrote:
Quoted text here. Click to load it

Exactly.


Exactly.  As memory is now the bottleneck in most designs, any program
state (e.g., floating point values) that have to be saved OUTSIDE the
CPU (even in normal operation, ignoring context switches) take a hit
in terms of performance.  Esp when you're dealing with "big" data types
(64, 80, 128b).

[Guttag clearly missed the boat on that call wrt the 99K.  Cute/clever
idea but he failed to see the growing disparity in the CPU/memory
"impedance mismatch".  But, context switches were a piece of cake!  :>]

Quoted text here. Click to load it

Ideally, register-rich processors would implement a set of internal flags
that would be set each time a register was loaded.  And, a "save state"
opcode that acted similar to the 09's PUSH "vector" saving only those
registers that have been "altered".  Then, letting the programmer
clear this "flag register" as it reloaded the preserved state for
the new task.

[An even more flexible/conditional scheme would allow traps for each
register but the overhead of servicing the trap FOR EACH REGISTER
would far outweigh the cost of unconditionally restoring ALL state]

Quoted text here. Click to load it


Re: ARM Cortex Mx vs the rest of the gang
On 12.6.2017 ?. 20:10, Don Y wrote:
Quoted text here. Click to load it

Hi Don,

I am not so sure how useful this would be, basically software knows
which registers have been used and which not, it is up to it to save
just what needs to be saved and restore it when needed only. It could be
some help but not enough to justify the extra silicon & complexity
I believe.

Dimiter




Re: ARM Cortex Mx vs the rest of the gang
Hi Dimiter,

On 6/12/2017 12:06 PM, Dimiter_Popoff wrote:
Quoted text here. Click to load it

You're not thinking with HLL's in mind -- where a *tool* creates the software
(how does the tool tell you, concisely, which registers it used?)

And, even if you know which registers were used, you don't know which were
used SINCE THE LAST CONTEXT SWITCH!

Quoted text here. Click to load it

The silicon is trivial:  each load of a register (or register in a register
file) forces a corresponding bit to be set in a collection of flags.

Then, a new "PUSH <vector>" opcode simply uses that "collection of flags" as
the <vector>.

If you had to "manually" examine the flags (bits) in that vector and
conditionally save/restore registers, the overhead of doing so wouldn't
offset the cost of just unconditionally performing the save/restore.

In essence, this is what I do with my handling of the FPU context (see other
post).  I assume the FPU registers are NOT used and let the processor
(in the NS32k example) tell me when a floating point instruction is invoked
(the FPU is an optional component in the early NS32k systems; if it is NOT
present, the opcodes are implemented by traps to user-supplied emulation
functions) by invoking a TRAP handler.

Of course, I can use that notification (with or without a hardware FPU)
to alert the OS to the fact that the additional state is being referenced
and save/restore it, as appropriate.

[This only needs to happen at most once for each context switch]

Re: ARM Cortex Mx vs the rest of the gang
On 12.6.2017 ?. 22:19, Don Y wrote:
Quoted text here. Click to load it

Of course not, although HLL-s should be pretty good at knowing what to
push/pull. But I certainly consider a language broken if it creates a
need for hardware not needed when using other languages.

Quoted text here. Click to load it

Well and if they are not changed what, you have to mark stack frames
somehow to know what exactly did you save so you can restore etc.;
then even if you switch task once per ms the time to save/restore
all registers is negligible (32 longwords get written to cache on a
400 MHz 32 bit processor within 80 ns...).
OTOH if it is just an interrupt handler it will know what registers
it uses and would save just them - and will modify them so there is
no need to know whether they were changed or not.

Quoted text here. Click to load it

This sounds simple enough indeed, push the register list along with
the list descriptor, then use it to restore the list. But you will
still have to calculate the length you allocated to save the list
and this variable length might complicate the scheduler enough to
cancel the benefits if not worse... hard to say by just hypothesizing.


Quoted text here. Click to load it

I am even more cheeky than that on power for DPS. A task _must_ have
declared it will use the FPU if it will use it, this means it gets its
FPU context preserved, entirely. All 32 FP regs + fpscr etc. thing.
If not, the task just won't know what the state of the FPU is; I can
make it trap (a bit in the MCR) or leave it unknown (not sure which
I do, trying to use the FPU when not explicitly enabled is a programming
error). Quite often when I need the FPU for just a function and do not
know whether the calling task will have the FPU enabled or not at
the beginning the function saves the FPU on/off state, switches
it "on", does its job then restores the former state.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/



Re: ARM Cortex Mx vs the rest of the gang
On 6/12/2017 1:04 PM, Dimiter_Popoff wrote:
Quoted text here. Click to load it

Yes, but the language doesn't know when a context switch is going to be
triggered that swaps out *all* of the processor state!

context switch

           restore state of task M
----
           do something

           alter register Q
task M
           do something

           alter register P
----
           preserve state of task M

context switch

           restore state of task N
----
           use register Y

           alter register Q
task N
           ...


When preserving the state of task M, *only* the saved state of registers
Q and P needs to be updated as they are the only registers that have been
altered in that portion of task M's execution.  The other 873 registers
haven't been altered so the "preserved copies" of their state needn't
be updated.

The problem with traditional architectures is that there is no easy
way of the "context switch" routine knowing what has been altered
since the state for *that* task was most recently preserved.

The hack I proposed would be "reset" as the last step in the context switch
so that any alterations of the register file's contents would be individually
flagged.  I.e., at the time of the task switch at the end of task M, the
CPU will have "noticed" that ONLY registers Q & P have been updated
(update can be as simple as noting ANY write to the register *or* as clever
as noting any write that alters the specific contents of *that* register
so overwriting a value of '27' with another '27' would NOT flag the register
as "altered").

So, when the context switch went to "preserve state of task M", it would
execute this magical "PUSH <vector>" command and move only those altered
register contents back into the TCB.

[It's not really a "PUSH" as it needs to put each register in a specific
place relative to some "frame" -- i.e., the TCB -- which could be  carefully
arranged to be ToS relative]

Quoted text here. Click to load it

Yes, but its obvious that the number of cores and the number (and size)
of registers in those cores will only keep increasing as the memory
interface becomes more of a "performance issue".  Cache tries to handwave
around this problem -- more buffering doesn't always come without costs
(e.g., having to flush the cache)

N cores means N times the hammering on the memory interface due to
task switches.

Quoted text here. Click to load it

Yes -- in that case, the developer knows what he's "touched" in the
ISR (if written in ASM) and would only bother preserving and restoring
those things that he was about to "taint".

E.g., there are ARM's that have a "FIRQ (Fast IRQ)" capability that essentially
preserves minimal state for extremely low latency IRQ's.

Quoted text here. Click to load it

The CPU vendor knows how many registers he has in the core.
So, if he knows that registers 1, 2 and 12 need to be saved
(by this magical "PUSH <vector>" instruction, he could save
r1 to WORKSPACE+1, r2 to WORKSPACE+2 and r12 to WORKSPACE+12
where WORKSPACE is a particular register/address.

I.e., the 99k implemented all registers in main memory.  So, when
you accessed r1, you were really accessing the contents of memory
at WORKSPACE_POINTER+(1*register_size).  Shifting r1 left resulted
in a read of that memory location into the CPU, a left ship in the
ALU and a write *back* to that location of the updated datum.

Imagine, instead, caching all of those operations in an internal
register file (gee, what a novel idea!  :> ) and, only flushing
the contents of ALTERED registers back into main memory (at locations
relative to the WORKSPACE_POINTER) when a task switch was needed.

[I.e., on the 99k, to do a context switch, you just changed one
register -- the workspace pointer -- as the context was already *in*
memory!]

Quoted text here. Click to load it

I did this in my first implementation.  But, that meant having a handler
for those cases where someone screwed up the configuration of the task
and forgot to indicate that it needed the FPU.  Runtime might not see
a task run that portion of its code so you might not see a "crash and burn"
(until thorough testing).

So, if you have to handle the case where the task hasn't been configured
to use the FPU and it *does*, then why not let that handler just "handle
the FPU's usage" without forcing the developer to make that configuration
choice?

You can create "FCB's" (FPU Control Blocks) to store FPU state and reference
them *from* each task's TCB (so they are a part of that task's "state").
This also lets the FPU-handler keep a pointer to the FCB into which the
current FPU's hardware should be (eventually) saved -- *if* that need arises.
If the current task executes a FP opcode before the FPU state has been
preserved, then the old task's FPU state can be saved the TCB for the
current task lets you chase down the FCB for the current task so it's
previous FPU state can be restored before the floating point operation is
allowed to continue (restart).

Its damn near impossible to come up with a winning strategy without an
understanding of the application and the deployment environment.  Where
are the *effective* resource shortages?

Re: ARM Cortex Mx vs the rest of the gang
On 13.6.2017 ?. 00:34, Don Y wrote:
Quoted text here. Click to load it

Yes, I see no problem with that but not much gain either. Generally I
don't care much about mistakes other people will make, I make enough of
my own to care for. So if something is a programming error the best I
can do is to make it as easily detectable as possible. The way you
have done it is OK of course, it is no longer a programming error, but
I am not sure it would save me much time. Nor would it waste me much
though, so why not.

Dimiter



Re: ARM Cortex Mx vs the rest of the gang
On 6/12/2017 4:44 PM, Dimiter_Popoff wrote:
Quoted text here. Click to load it

The gain comes by decoupling the need to configure the "uses_FPU"
flag for each task as well as eliminating the problem of a developer
failing to *correctly* define that flag (i.e., if he changes the
switches used with the compiler to use FP opcodes in the generated
code he won't shoot himself in the foot).

This is especially true when the task can execute code that the
developer didn't write/compile.  Does he know under what circumstances
FP opcodes are called into play?

I'm finding working in "resource richer" environments results in
very different approaches to software/system design!  :<

Re: ARM Cortex Mx vs the rest of the gang

itto:
Quoted text here. Click to load it
ecause it adds tons of overhead and a lot of headache for the system-level  
software.
Quoted text here. Click to load it
 with a big context of 32 32-bit registers (S0-S31). These registers need t
o be saved and restored as part of every context switch, just like the CPU  
registers. ARM has come up with some hardware optimizations called "lazy st
acking and context switching" (see ARM AppNote 298 at http://infocenter.arm
.com/help/topic/com.arm.doc.dai0298a/DAI0298A_cortex_m4f_lazy_stacking_and_
context_switching.pdf ). But as you will see in the AppNote, the scheme is  
quite involved and still requires much more stack RAM than a context switch
 without the VFP. The overhead of the ARM VFP in a multitasking system is s
o big, in fact, that often it outweighs the benefits of having hardware FPU
 in the first place. Often, a better solution would be to use the FPU in on
e task only, and forbid to use it anywhere else. In this case, preserving t
he FPU context would be unnecessary. (But it is difficult to reliably forbi
d using FPU in other parts of the same code, so it opens the door for race  
conditions around the FPU if the rule is violated.)
Quoted text here. Click to load it
esas RX CPU comes also with single precision FPU, which is much better inte
grated with the CPU and does not have its own register context. Compared to
 the ARM VFP it is a pleasure to work with.
Quoted text here. Click to load it

Thanks for the explanation.

Bye Jack

Re: ARM Cortex Mx vs the rest of the gang
On 6/12/2017 7:21 AM, StateMachineCOM wrote:
Quoted text here. Click to load it

Actually, this is fairly old news.  I've been designing MTOS's & RTOS's
with this in mind for at least 30 years -- though the actual mechanisms
involved vary.

E.g., the FPU for the NS32k is a true coprocessor; it executes in parallel
with the CPU.  So, you don't *want* to save its state at a context switch
cuz it might be busy "doing something".

Instead, you enable the trap on the FPU opcodes so that if the "new" task
attempts to use the FPU, you first swap out the FPU's state -- having
remembered which task it belongs to (which may not be the task that executed
immediately prior to the current task!).  Having done so, you restore the
saved FPU state for *this* task, disable the trap and let the instruction
complete *in* the FPU.  All the while, knowing that it may not complete
before the current task loses control of the processor.

With this framework, you can configure individual tasks to use the FPU -- or
not.  *AND*, detect a task that "accidentally" uses the FPU when it has been
configured not to.  The last bit is important because you can build different
flavor TCB's -- one that holds just the basic registers and another that
holds the basic *plus* the FPU state.

If your tools give you finer-grained control over which *parts* of the FPU
are used, then you can similarly refine the parts that you save/restore.

[E.g., the Zx80 has an "alternate register set" that a compiler will rarely
make use of.  But, is handy for ASM coders.  Saving and restoring it
unconditionally is wasteful as it almost doubles the process state.  *But*,
conditionally doing this (synchronous with a regular task switch in the
case of the Zx80's) can offer significant reward.]

Quoted text here. Click to load it

Why "difficult"?  Turn on FP emulation in the code compiled for the
"nonFPU tasks".  Then, *if* an occasional floating point instruction
is invoked, it just executes "slowly".

Quoted text here. Click to load it

That's like complaining that the 6 course meal is much inferior to just
"grabbing a burger" at a fast-food joint...

Re: ARM Cortex Mx vs the rest of the gang
On 06/12/17 16:54, Don Y wrote:

Quoted text here. Click to load it

That sounds too complicated. If the fpu is busy, why not just put the
requesting task ready waiting back on the task queue, then try again
next time round ?. Set priorities accordingly.

Other solutions might include encapsulating the fpu within it's own
task, with or without input queue, then use messaging to talk to it ?

Anyway, isn't this just a bit academic ?. Modern cpus are orders of
magnitude faster than early designs and have never been limited by
cpu throughput.  Just take the simplest approach, save all registers
to start, then profile the code to see where the bottlenecks are. It's
not economic, nor sound engineering design to fine tune everything
just for the sake of it...

Chris


Re: ARM Cortex Mx vs the rest of the gang
On 7/10/2017 6:30 AM, Chris wrote:
Quoted text here. Click to load it

Using the FPU *is* complicated in a multithreaded world!  :>

Quoted text here. Click to load it

So, some event has occurred which forces a reschedule() operation.
The system has decided that TASK_X (hand-wave away the task/process/thread
finer points) is deserving of the processor (or, *a* processor core).
But, you want to defer the execution of this "task" because its
inconvenient, at this time, and try for "second best".  What if the
second choice also requires the FPU's services?  Third choice?  etc.

You want to artificially LOWER the timeliness constraints of TASK_A
because the FPU is busy -- even if the VERY FIRST opcode that TASK_A
fetches (after it RESUMES execution) might not be a floating point
instruction?

How do you model this in your system design?  Do you profile the
frequency of floating point operations in each task and try to
predict the likelihood of one thread (task) starting a floating point
operation in the instant before a reschedule() event to be followed
by another thread (task) that happens to need to execute a floating
point operation AT SOME POINT (possibly hours from now)?  Does the
deferred task (thread) ever regain its DESERVED priority (timeliness)?
Or, once "demoted", does it remain that way -- hoping its peers
similarly get demoted (by pure chance) so that its RELATIVE priority
is reclaimed?

You're making a coarse-grained scheduling decision whereas the
trap approach just has the appearance of an opcode "taking longer"
to execute "in user space".

Quoted text here. Click to load it

You can treat the FPU as a "device" -- like a UART, disk drive,
NIC, etc. -- and impose sharing (through locks/mutexes) on it.
But, this promotes the sharing to a very visible level and forces
the developer (and all tasks) to consider the extent of FPU usage
in each case where the device is "open()-ed" -- so you can push
commands/messages at it.

Quoted text here. Click to load it

*Memory* is the bottleneck.  Saving and restoring FPU state WHEN NOT
NEEDED (by the task surrendering the CPU/FPU *or* the task acquiring it)
generates lots of unnecessary memory activity.  E.g., in the M4, the
FPU state is ~100+ bytes that you're moving in and out, possibly
needlessly.

Quoted text here. Click to load it

Imagine if every ISR preserved and restored the ENTIRE processor state
"just to make things simple".  Would you consider THAT to be "sound
engineering"?  :>


Re: ARM Cortex Mx vs the rest of the gang
On 07/10/17 23:03, Don Y wrote:

 >
 > Using the FPU *is* complicated in a multithreaded world! :>

Yes, so you have to tightly define how it is accessed for best
results.

 >
 > You want to artificially LOWER the timeliness constraints of TASK_A
 > because the FPU is busy -- even if the VERY FIRST opcode that TASK_A
 > fetches (after it RESUMES execution) might not be a floating point
 > instruction?

If you have contention for a resource, somone has to wait.
Who waits depends on task priorities, fpu state may have
to be saved, but so what ?. There various ways to provide
fair access, but if the design is so critically constrained by timing
issues, then the design is wrong and needs more resources. Ok, we
have all had to deal with that, but it shouldn't happen these days.

 >
 > How do you model this in your system design? Do you profile the
 > frequency of floating point operations in each task and try to
 > predict the likelihood of one thread (task) starting a floating point
 > operation in the instant before a reschedule() event to be followed
 > by another thread (task) that happens to need to execute a floating
 > point operation AT SOME POINT (possibly hours from now)? Does the
 > deferred task (thread) ever regain its DESERVED priority (timeliness)?
 > Or, once "demoted", does it remain that way -- hoping its peers
 > similarly get demoted (by pure chance) so that its RELATIVE priority
 > is reclaimed?
 >
 > You're making a coarse-grained scheduling decision whereas the
 > trap approach just has the appearance of an opcode "taking longer"
 > to execute "in user space".

Sorry, that doesn't make sense.

 >
 >> Other solutions might include encapsulating the fpu within it's own
 >> task, with or without input queue, then use messaging to talk to it ?
 >
 > You can treat the FPU as a "device" -- like a UART, disk drive,
 > NIC, etc. -- and impose sharing (through locks/mutexes) on it.
 > But, this promotes the sharing to a very visible level and forces
 > the developer (and all tasks) to consider the extent of FPU usage
 > in each case where the device is "open()-ed" -- so you can push
 > commands/messages at it.

If you use messaging ipc, the fpu is always ready for data, assuming
the queue is properly sized. If you want to make it priority aware,
just include that with the request, along with a the pid of the
requester. All the fpu internal complexity is hidden from the requester,
which doesn't need to know. While this might not be ideal for an fpu,
a task based model is a great way to encapsulate complexity.

 >
 > Imagine if every ISR preserved and restored the ENTIRE processor state
 > "just to make things simple". Would you consider THAT to be "sound
 > engineering"? :>
 >

 From a practical, keep the code simple point of view and
assuming there are no performance issues, that's the way I might do
it, but even the ancient 68000 had selective register save
instructions. Have used them at times in interrupt handlers, but it
requires poking around in the entrails and asm macros if you
typically write interrupt handlers in C.  Modern cpu's are orders of
magnitude faster, a king's ransome of riches in terms of throughput
and hardware options,  which allows a much more high level view of
the overall design. Ok, for a few apps like image processing and
compression etc, more care might be needed but they are the
exceptions for embedded work, afaics.

We are thankfully, past the stage where it was necessary to hand
optimise all the low level stuff to make systems work and the added
complexity is not good for reliability, nor maintenance. It's rarely
properly documented, so the poor soul who replaces you will have no idea
why the design decisions were made. Good design is not just about the
hardware and code, but the whole project infrastructure and the needs
surrounding it.

This is a bit tl:dr isn't it ?, but you do cover a lot of ground at
once :-)...

Chris

Re: ARM Cortex Mx vs the rest of the gang
On 7/11/2017 9:24 AM, Chris wrote:
Quoted text here. Click to load it

Or, design a strategy that will *adapt* to the needs of the application
without having to make that decision /a priori/.

Quoted text here. Click to load it

Of course.  But the "when" becomes a driving factor.  You design
the system based on the intrinsic priorities of the "actors"
competing for those resources.  You don't decide that "its hard"
to give an actor his just due and, thus, rejiggle the "priorities"
to fit something more convenient.

Quoted text here. Click to load it

But it may *not* "have to be saved".  You're assuming the task to be
assuming control of the processor *does* need the FPU and WILL need it
"presently" -- so save the state NOW instead of deferring the
act until it PROVES to be necessary (said "proof" being indicated
by the execution of a floating point operation).

The state of the FPU is *big* in most processors.  With multicore
chips, that's multiplied by the number of cores.  THE MEMORY BANDWIDTH
IS FIXED (and shared by *all* cores).  Why move temporally distant
data through that pipe if you don't NEED to do so?

Quoted text here. Click to load it

Why do compilers worry so much about optimization?  We *surely*
shouldn't NEED the effective resource gain that these options
provide, right?

Unconditionally saving and restoring the FPU's state is akin to
unconditionally saving the entire state of the CPU for each
interrupt -- why invent things like FIRQ (which costs real silicon)
if these constraints "shouldn't happen these days"?

Why optimize away:
        foo += number;
        foo -= number;
SURELY we can afford a pair of integer (?) operations!  :>

Quoted text here. Click to load it

You are using the current state of the FPU (busy) to effectively
make a scheduling decision -- without knowledge of whether or not
the task that SHOULD be executing, next (based on the scheduling
criteria selected AT DESIGN TIME) will actually need the FPU *or*
will need it "in this next period of execution" (avoiding the term
"time slice" because preemptive schedulers tend not to be driven
strictly by "time").  *OR*, even "shortly".

When will the *deferred* highest priority task get his next opportunity
to run?  If you jigger with the priorities, then he's no longer
the most eligible to run (you may, in fact, have introduced a deadlock
that the *design* had implicitly avoided in its assignment of "priority").

[I assume you understand that "priority" in the sense used in scheduling
is NOT a "small integer used to artificially impose order on competing
actors"]

Quoted text here. Click to load it

If you want finer-grained access to the FPU, then you have to be willing
to save and restore the contexts of the individual clients on a transactional
basis.  I.e., either load the FPU context of the IPC being serviced *now*,
run the opcode and then save the context as you're passing the results
to the client.  Or, leave the most recently loaded context *in* the FPU
until you examine the next incoming IPC to determine *if* there is a
need to swap out the context currently residing therein.

Your other arguments advocate unconditionally loading and saving the
*current* client's FPU context on each IPC -- regardless of recent past
history of that resource.  My argument is to leave whatever context
happens to be *in* the FPU there -- in the hope that the next request
MIGHT be from the same client; only swap contexts when you KNOW the
new client is a different entity than the last and, therefore, avoid
the overhead of a save-restore PER IPC.

Quoted text here. Click to load it

Ask yourself:  why did the vendor include these instructions in the
processor's design?  Why did they complicate the silicon, and the
programming model.  Surely, the developer has adequate resources
to blindly save the entire state; why provide provisions to save
only part of it?  "It shouldn't happen these days"  :>

Why would an MCU vendor add silicon and programming complexity
to a design to support this sort of treatment of the FPU?
Why waste an engineer's time documenting it:
<http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A_cortex_m4f_lazy_stacking_and_context_switching.pdf
Surely, the developer shouldn't need to tune an application
(OS) to this extent, these days! (?)

Quoted text here. Click to load it

Modern APPLICATIONS are orders of magnitude more complex!   And,
you don't always use features to gain performance but, also, to
gain reliability, etc.

If I, the developer, KNOW that a particular task/process/thread
doesn't use the FPU, why wouldn't I want to take advantage of a mechanism
that tells me *if* an attempt is made to use the FPU (by THAT task)?
And, if so notified, wouldn't I want to *do* something about it?

If I, the developer, KNOW that my task's memory references are
constrained to the region [LOW,HIGH] -- because some other task
accesses the adjoining memory above/below that region -- wouldn't
I want to take advantage of a mechanism that tells me *if* an
attempt is made to access memory outside that region?

If I, the developer, KNOW that my task should NEVER be accessing a
particular file, device, etc. wouldn't I want to take advantage
of a mechanism that tells me *if* it tries to do so?

Or, tries to WRITE to program memory (CODE)?

Or, tries to grow the stack beyond the limits determined at design time?

Or, tries to "hog" the CPU?

etc.

Quoted text here. Click to load it

The opposite is true.  Why do we see increasingly complex OS's in use?
Ans:  because you can design the mechanisms to detect and protect
against UNRELIABLE program operation *once* and leverage that across
applications and application domains.

Why do we see HLL's in use?  Ans:  it makes it easier for developers to
code larger programs *reliably*.  (Why "larger"?  Because applications
are getting orders of magnitude more complex).

Quoted text here. Click to load it

If the poor soul is competent to design an operating system, then he
SHOULD be skilled enough in his art to understand the ideas that are
frequently exploited in operating system designs.  If not, he shouldn't
be tinkering with the OS's implementation.

(You wouldn't want someone who doesn't have a deep understanding of
floating point issues to be writing a floating point emulation library,
would you?)

The developer (writing the *application*) need not be concerned about
the minutiae of how context switches are performed.  Do you have to
understand how a multilevel page table is implemented (and traversed
at runtime) in order to use demand-paged virtual memory?  OTOH,
you *would* if you were charged with maintaining that part of the
codebase!

Quoted text here. Click to load it

Good design is fitting the design *to* the application "most effectively"
(which are squishy words that the developer defines).  If every project
could be handled with a PIC and 2KB of RAM, there'd be no need for
MMU's, FPU's, RTOS's, HLL's, SMP, IPC/RPC, etc.

Thankfully, (cuz that would be a world of pretty BORING applications)
that's not the case.  And, as applications ("projects") get larger,
they quickly grow to a point where they are "complex" (complex:  too
large to fit in one mind) and have to rely on the efforts of many.
Anything that can be done to be a productivity/reliability/performance
multiplier "by purchasing a few more gates on a die" almost always
has a net positive return.

Imagine if the authors of every application running on your PC had
to cooperate to ensure they were all LINKED at non-competing memory
addresses (because there was no relative addressing mode, segments,
virtual memory, etc.).  Instead, the silicon -- and then the OS -- can
assume the burden of providing these mechanisms so the developers
need not be concerned with them.

[I'd wager most PC developers are clueless as to what happens under
the hood when their application is launched.  And, I suspect there
is a boatload of documentation available for them *if* they decided
they had a genuine need to know -- at whatever level of detail they
deemed appropriate!]

Quoted text here. Click to load it

IME, most non-trivial engineering decisions are hard to summarize in
a page (or ten :> ) or less.

Time to take advantage of 12 hours of rain to do some digging...

Re: ARM Cortex Mx vs the rest of the gang
On 07/11/17 19:15, Don Y wrote:

 > Of course. But the "when" becomes a driving factor. You design
 > the system based on the intrinsic priorities of the "actors"
 > competing for those resources. You don't decide that "its hard"
 > to give an actor his just due and, thus, rejiggle the "priorities"
 > to fit something more convenient.

Make a rough estimate during development, then fine tune to fix
edge cases, or where a bit more headroom is needed for individual
tasks. No design is fixed in stone from the start.

 >
 > The state of the FPU is *big* in most processors. With multicore
 > chips, that's multiplied by the number of cores. THE MEMORY BANDWIDTH
 > IS FIXED (and shared by *all* cores). Why move temporally distant
 > data through that pipe if you don't NEED to do so?

You seem to be assuming high end systems running at the ragged
edge, which isn't the sort of work done here. Leave that to the mobile,
tablet and workstation / graphics people. You can't even be fluent in
all aspects of computing, let alone the electronics that enables it.

 > Why do compilers worry so much about optimization? We *surely*
 > shouldn't NEED the effective resource gain that these options
 > provide, right?.

Not sure. From a performance point of view, perhaps, but optimisation
can reduce memory footprint, critical for some embedded work.

 >
 > Unconditionally saving and restoring the FPU's state is akin to
 > unconditionally saving the entire state of the CPU for each
 > interrupt -- why invent things like FIRQ (which costs real silicon)
 > if these constraints "shouldn't happen these days"?

I guess you are talking arm?. FIRQ is a leftover from early arm, fwir.
Have you seen amount of tortuous code needed to get interrupts
working properly with Arm7TDMI, for example?. About 2 pages of dense
assembler, from memory. I rejected early arm almost on that basis
alone, but there were other idiosyncracies. They fixed it eventually
with a proper (68K) style vector table, but it took them a long time
:-). Cortex was when Arm finally came of age.

 > You are using the current state of the FPU (busy) to effectively
 > make a scheduling decision -- without knowledge of whether or not
 > the task that SHOULD be executing, next (based on the scheduling
 > criteria selected AT DESIGN TIME) will actually need the FPU *or*
 > will need it "in this next period of execution" (avoiding the term
 > "time slice" because preemptive schedulers tend not to be driven
 > strictly by "time"). *OR*, even "shortly".


It's a case of organisinmg system design, task allocation etc, so
that you get a result that meets spec. Think systems engineering.
If you have limited resources, something has to give, but would
prefer a situation where an fpu operation always runs to completion.
It's the simplest solution and the fewest variables in terms
of  estimating performance.

Interrupting and saving fpu state could be done, but only if all
other avenues have been explored. it's a whole can or worms best
avoided if possible and dependent on the actual fpu in use. It needs
memory to save context, added management code and maybe complex
synchronisation issues. Even if you make it work, May turn out to
be less efficient than run to completion.

Anyway, all kinds of events affect scheduling decisions, even if
indirectly. To make a waiting process ready, for example. Don't see what
the problem is. Perhaps that's the issue: Some always look for
issues, while others assume everything is going to work.

 >
 > If you want finer-grained access to the FPU, then you have to be willing
 > to save and restore the contexts of the individual clients on a
 > transactional
 > basis.

I don't want fine grained access if possible. I want a black box to
feed data and get a result. Not really interested what happens
under the hood, so long as it meets requirements and is predictable.

 >
 > Ask yourself: why did the vendor include these instructions in the
 > processor's design? Why did they complicate the silicon, and the
 > programming model. Surely, the developer has adequate resources
 > to blindly save the entire state; why provide provisions to save
 > only part of it? "It shouldn't happen these days" :>

Simple, both memory and processors were slow in those days and needed
all the help they could get. Modern processors arguably don't need them
for most applications. Do commercial tool chains make use of them ?.
Last time I checked, gcc still didn't know about interrupts, though
some vendors do add extensions.

 >
 > Why would an MCU vendor add silicon and programming complexity
 > to a design to support this sort of treatment of the FPU?
 > Why waste an engineer's time documenting it:
 > <http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A
    _cortex_m4f_lazy_stacking_and_context_switching.pdf>

Competitive market perhaps ?. Featureitis between vendors to cater for
widest application and market share. With most work these days, only
use a fraction of the internal arch and throughout. I see that as good,
as there's more freedom to think systems engineering, rather than detail.

 >
 >> Have used them at times in interrupt handlers, but it
 >> requires poking around in the entrails and asm macros if you
 >> typically write interrupt handlers in C. Modern cpu's are orders of
 >> magnitude faster, a king's ransome of riches in terms of throughput
 >> and hardware options, which allows a much more high level view of
 >> the overall design. Ok, for a few apps like image processing and
 >> compression etc, more care might be needed but they are the
 >> exceptions for embedded work, afaics.
 >
 > Modern APPLICATIONS are orders of magnitude more complex! And,
 > you don't always use features to gain performance but, also, to
 > gain reliability, etc.

Perhaps many modern apps don't need it, but don't write apps, so what
do I know ?. It's not only windows and Linux that suffer from bloat
these days.

 >
 > The opposite is true. Why do we see increasingly complex OS's in use?
 > Ans: because you can design the mechanisms to detect and protect
 > against UNRELIABLE program operation *once* and leverage that across
 > applications and application domains.
 >

Are we talking about vanilla embedded work here, or big system design ?.

 >
 > Good design is fitting the design *to* the application "most effectively"
 > (which are squishy words that the developer defines). If every project
 > could be handled with a PIC and 2KB of RAM, there'd be no need for
 > MMU's, FPU's, RTOS's, HLL's, SMP, IPC/RPC, etc.
 >

Agreed, but much embedded work is not big systems stuff, but at simple
state driven loop or rtos level. Ok, phones etc are all some
flavor of unix, Linux, whatever, but not typical embedded.

 >
 > Imagine if the authors of every application running on your PC had
 > to cooperate to ensure they were all LINKED at non-competing memory
 > addresses (because there was no relative addressing mode, segments,
 > virtual memory, etc.). Instead, the silicon -- and then the OS -- can
 > assume the burden of providing these mechanisms so the developers
 > need not be concerned with them.

That's why mainstream os's have loaders and memory management, because
you want maximum flexibity, whereas embedded is usually locked down
to particular need.

I don't get into pc stuff, it's just a tool and I assume that it works,
which it generally does. Same for Linux, but FreeBSD gets more and more
interesting and is rock solid on X86 and Sparc here. After systemd and
other bloat issues, Linux becomes less and less attractive.

 >
 > IME, most non-trivial engineering decisions are hard to summarize in
 > a page (or ten :> ) or less.
 >
 > Time to take advantage of 12 hours of rain to do some digging...

Been chucking it down all day here today in Oxford, but that's uk
summer weather and as you say, an excuse to catch up with the groups
and get into some back burner ideas. Too many interests and not
enough time, as usual :-)...

Chris


Re: ARM Cortex Mx vs the rest of the gang
On 7/11/2017 4:16 PM, Chris wrote:
Quoted text here. Click to load it

No, I'm seeing an opportunity for an optimization that can be largely
transparent to *any* application (assuming the application makes use
of floating point operations -- with or without hardware assist)
WITHOUT burdening the developer with the details of its implementation.

E.g., 30+ years ago, I'd build floating point "subroutines" (ASM) with a
preamble that resembled:

      if (!flag) {
           save_floating_point_context(previous_owner)
           restore_floating_point_context(new_owner)
           flag = TRUE
      }
      ...  // body of actual "subroutine"

This allowed the "task switcher" (scheduler) to simply clear
"flag" as part of the normal context switch and DEFER handling
the "floating point unit" (which was a bunch of subroutines and
a large shared section of memory) to a time when the "new_owner"
actually NEEDED it -- as indicated by his CALLing any of the
floating point subroutines (ALL of which had the above preamble).

This allowed the "FPU" to be implemented in a time-efficient
manner (e.g., potentially leaving the "floating point accumulator"
in denormalized form instead of normalizing after every operation!)

It's an obvious step from there to hooking the "helper routines"
used by many (esp *early*) compilers in the same way.

And, from there, to hooking the (early) *hardware* FPU's (e.g., Am9511)
that were costly to embrace in a multithreaded environment without
such deferred optimization.

Finally, the more modern FPU's with better mechanisms to detect these
things IN HARDWARE (i.e., no need for that explicit "if (flag)...")

Quoted text here. Click to load it

The point of all of these optimizations is they can be done, reliably, without
requiring effort on the part of the developer.  The folks responsible for
designing/implementing your OS deal with this issue.  Just like the compiler
writers deal with the schemes/machinations to make your code smaller, faster,
etc.

Quoted text here. Click to load it

The thread is about ARMs (Cortex M4).  FIRQ is still available in
most (all?) ARM cores.

Quoted text here. Click to load it

Do you turn the cache OFF in your designs -- because it makes it easier to
estimate performance?

Quoted text here. Click to load it

Modern hardware FPU's tend to treat all opcodes as atomic.
The difference is software emulations -- you'd not want to let
the emulation of FSIN run to completion when it can be interrupted
at any of the hundreds of opcode fetches spanning its duration.

[But, then you need to be able to preserve ALL of the emulator's
state, not just the state that visibly mirrors the hardware FPU!]

Quoted text here. Click to load it

Designing reliable products means thinking about everything that *can*
go wrong and either ensuring it can't *or* being prepared to handle the
case *when* it does.

Quoted text here. Click to load it

An FPU is essentially another CPU.  As many (or more!) "internal state"
as the CPU itself.  If you want to share that resource, then you
need a way of ensuring that task_A's FPU register contents aren't
used (or exposed!) to task_B's operations.  So, you either swap them
in/out based on the identity of the (IPC) client making the *new*
request *or* examine the request and selectively decide which
portions of the FPU state are "safe" from interference based on
the nature of the FPU request (e.g., if it is an attempt to FADD
S0 and S1, then S2-S31 can be left in place -- only the previous
contents of S0 & S1 need to be preserved and the new client's
contents of S0 & S1 restored prior to servicing the request.)

[Think about the consequences of that sort of implementation:
now you have to track which *portions* of the FPU state are
associated with which tasks.  *Or*, let the FPU emulation
operate on FPU state *in* each client's TCB]

Quoted text here. Click to load it

How do you KNOW that?  As memory becomes increasingly the bottleneck,
the number of register inside the processor (CPU, FPU, MMU, etc.)
increases in an attempt to cut down on memory traffic.  E.g., the
99K placed the bulk of the processor's registers *in* memory
and just kept a pointer to them (the Workspace Pointer) inside the
CPU.

As the amount of state inside the CPU increases, the cost of
context switches goes up -- the memory accesses that have been
"avoided" by incorporating a register file eventually end up
appearing "deferred" (you pay the piper when the context switch
comes along)

Quoted text here. Click to load it

It still uses PUSH and POP -- even for a register-at-a-time.

Quoted text here. Click to load it

Have you seen how many products use a Linux kernel when they don't really
*need* that level of functionality?  How much does *it* draw into
the mix that the application itself doesn't intrinsically need?

Returning to my earlier comment, applications have become increasingly
complex.  Some of this is natural progression.  Some is a design tradeoff
("Let's use floating point instead of hassling with Q12.19...").  Some
is marketing hype.

I don't want "users" (Ma & Pa) to have to understand the consequences of
particular numeric data types.  So, I use a BigRational form for the
"numbers" that users manipulate in their scripts.  I can elect to
give them 200 digits of precision (or, let them opt for that themselves)
rather than explaining to them why you want to reorder:

    REALLY_BIG_NUMBER * REALLY_BIG_NUMBER * REALLY_BIG_NUMBER
---------------------------------------------------------------
(REALLY_BIG_NUMBER * REALLY_BIG_NUMBER * REALLY_BIG_NUMBER) + 1

That comes at a cost:  I "waste" some of the system's resources
to enable them to NOT need to think about this level of detail.

Similarly, I "waste" system resources to ensure program A can't
stomp on program B's code/data.  Or, access a resource to which
it should have no need ("why is the MP3 player trying to access
the NIC?")

All of these added complexities make the resulting system more
robust and easier to design within.  (Easier just to *hide*
a resource from someone who shouldn't be needing it than it is to
try to concoct a set of ACL's that allow those who *should* have
access to do so while preventing those who shouldn't!)

Quoted text here. Click to load it

*Systems* are more complex.  In the past, products were isolated little
islands.  Your mouse had no idea that it was sitting alongside a keyboard.
There was no interaction between them.

Now, that is increasingly NOT the case.  It's now COMMON for applications
to have network connectivity (with all the complexity -- and risk -- that
a network stack brings to the design).

When the Unisite was released (80's?), it was "odd" in that it didn't have
a user interface:  just two idiot lights, a power switch and a "null modem"
switch on the back.  It *relied* on an external display (glass TTY) to
act as its user interface.  Previous product offerings had crippled
little keypads and one-line displays that tried to provide the same sort
of information in a klunkier manner ("Use Mode 27 for this...")

Now, its common for a device to have no specific user interface
and rely on a richer interface provided by some external agency.
No need for DIP switches to configure a device:  just set up a
BOOTP server and let the device *fetch* its configuration from
a set of text files that the user can prepare with more capable
tools (than a kludgey keypad interface).

Quoted text here. Click to load it

You're assuming embedded is NOT "big system design".

The cash registers at every store I visit are PC (or iPad) based.
What do you call *them*?  Does a cash register need "DirectX" capabilities?
Or, the ability to read FAT12/16 filesystems?

My current system is distributed.  I "waste" an entire core on each node
just servicing communications and RPC.  <shrug>  I'll *take* every optimization
that I can get "for free" to pay for these more costly capabilities (that can't
easily be optimized).

Quoted text here. Click to load it

This is a THERMOSTAT:
<https://www.ifixit.com/Teardown/Nest+Learning+Thermostat+2nd+Generation+Teardown/13818

Conceptually, it just implements:

    case mode {
        HEAT =>
            if (temperature < setpoint)
                furnace(on)
        COOL =>
            if (temperature > setpoint)
                ACbrrr(on)
    }

As I said, applications are getting increasingly complex!

Do you *need* that sort of capability in a thermostat?
Questionable.

OTOH, if it can reduce your heating/cooling costs, then
it's potentially "free".  A "dumb" thermostat can be
MORE expensive!

Quoted text here. Click to load it

"Usually" is a representation of The Past.  Looked at the capabilities of
"smart TV's" lately?

My current system is "deeply embedded".  But, provides a *richer*
execution environment than a typical desktop PC -- because it aims to
be more durable, extensible and reliable.  You can replace a PC every
few years; you wouldn't want to replace ALL the automation in a
particular business every few years!
     "Is there something WRONG with the existing irrigation system?
     Burglar alarm?  HVAC controls?  Energy management system??
     I.e., WHY should we be replacing/uprading it?"

Quoted text here. Click to load it

The top end of the "embedded" domain keeps nibbling at the underbelly
of the "desktop/mainframe" domain.  40 years ago, I could pilot a boat
with a few KB of code and an actuator for the rudder.  Nowadays,
cars park and drive themselves -- undoubtedly with far more than
a few KB and a fractional MIPS of resources!

An embedded designer who isn't aware of the technologies that are
becoming increasingly "affordable" is doomed to designing 2 button
mice for the rest of his days.

Quoted text here. Click to load it

Time gets scarcer and interests (for anyone with an imagination)
multiply.  The only solution I've found is to reduce the time spent
asleep!  :<

Re: ARM Cortex Mx vs the rest of the gang
On Wed, 12 Jul 2017 11:44:38 -0700, Don Y

Quoted text here. Click to load it

FWIW: on modern x86 ["modern" meaning since Pentium Pro, circa 1995],
PUSH and POP instructions are internally converted to MOV instructions
referencing the appropriate [stack offset] addresses.  A sequence of
PUSHes or POPs may be executed simultaneously and/or out of order.

x86 compilers still emit the PUSH and POP instructions because they
are more representive of the logical model expected by the programmer
who examines the generated code.


Quoted text here. Click to load it

A *really* minimal configuration provides little more than chipset
support, tasking, and memory management with MMU isolation.  Depending
on the kernel version that could be as little as ~80 KB of code.  You
can run a [tight] kernel+application image in as little as 1 MB.

You actually *can* run Linux sans MMU, but it is difficult because so
many existing drivers and software stacks assume the MMU is present
and enabled.  You have to be willing/able to roll your own system
software.

George

Re: ARM Cortex Mx vs the rest of the gang
On 7/12/2017 2:46 PM, George Neuner wrote:
Quoted text here. Click to load it

On machines with more orthogonal instruction sets, auto-pre/post-inc/decrement
addressing modes could effectively implement *a* stack using any register.
So, a PUSH/POP (PULL) was just a shorthand for a "well decorated" opcode

     MOV  (R6)+ R0

Even the '8 had a mechanism for doing this using particular "memory indirect"
addressing modes via a small set (16?) of specific memory addresses

[IIRC, the Nova's could conceptually keep indirecting through "random"
memory locations indefinitely... "never" coming up with a final effective
address!]

With processors that didn't have the same sort of orthogonality
in addressing modes available, PUSH/POP/PULL could *imply* the
auto inc/decrement register indirect mode on a *special* register
(SP).

Quoted text here. Click to load it

My point is that folks don't bother to trim that DEAD CODE from their
products.  Either they figure its not worth the effort (CODE memory is
cheap?) *or* they are fearful of their lack of DETAILED knowledge of
the kernel's internals and don't want to risk "breaking something".

How many devices support a web interface that, conceptually, should
only be accessed by a single client at any given time -- but don't
expressly PREVENT two or more simultaneous connections?  Just drop
the cobbled code into the application and coax it to do what you
want -- and hope the "extra" code is never accidentally activated
(exploited!)

You don't, for example, think I'm going to elide the code from
PostgreSQL that supports the UUID type because I don't need/use it?
<grin>  Rather, I'll *rationalize* that someone MIGHT make use of it
in the future and use that to justify leaving it in the codebase
(despite it being, effectively, dead code!)

Quoted text here. Click to load it

Re: ARM Cortex Mx vs the rest of the gang
Quoted text here. Click to load it


ARMv7-M did away with most modes, leaving only thread and handler modes  
(corresponding to the old "usr" and "svc" modes). ARMv8-M added secure  
variants of both. There are no equivalents of the "fiq", "irq", "abt",  
"sys" and "und" modes. Another difference is that R13 (stack pointer) is  
the only banked register in the base architecture, plus secure state  
versions of some control registers in ARMv8-M.

-a

Re: ARM Cortex Mx vs the rest of the gang
On 7/13/2017 4:42 AM, snipped-for-privacy@kapsi.spam.stop.fi.invalid wrote:
Quoted text here. Click to load it

I spoke about "ARM cores", not *just* M-series cores.  E.g., the A cores still
support these modes (you can find cores that don't support FPUs, big.LITTLE,
NEON, etc.;  that doesn't mean the FIRQ feature -- or the rationale behind
it, is "obsolescent").

Site Timeline