Cortex-M: share an int between two tasks

Reply to
David Brown
Loading thread data ...

ARM has an application note (DAI-0321A) explaining the barriers, see .

--

-TV
Reply to
Tauno Voipio

I was excited to open your response, then disappointed. Forget to hit Save?

CH

Reply to
Clifford Heath

No, I just haven't had the time to do a decent reply yet. It is still an open draft on my desktop. I haven't forgotten about it - I've just had too many other things to do.

Reply to
David Brown

No worries. I look forward to it.

CH

Reply to
Clifford Heath

Barriers are very difficult to get right for a programmer. In my opinion, it is also an CPU architectural mistake to need barriers for user-level code. (It's OK for system software to need some barriers like ISB). Many people disagree with me on this point, and we don't need to argue it here.

The basic problem is there's a mismatch between what the programmer is thinking about, and what the compiler/CPU are requiring. This is one reason why multithreading is more difficult than it should be.

ARM let's you choose whether the barrier is a read or write or both barrier. I suggest you always do "both", which ARM calls SYSTEM, and is the default if you just say "DMB" or "DSB". I suspect there's almost no performance difference, and it's one less thing you have to worry about. It makes sense for macros for Linux to try to be more aggressive, since they need to show off, and they care more about performance and have the time to test and debug their logic across a variety of systems. This stuff is easy to make a mistake on, and very hard to debug.

For application programming, you should generally only need DMB for ordering of volatile accesses.

DSB is for ordering other traffic with data accesses--things like cache invalidates, icache fetches, or TLB shoot downs, etc. If you have anything like that which needs to be ordered, you want DSB. Again, ordering accesses to variables in normal memory don't need DSB, but you can use it if you want much lower performance (DSB is a super-set of DMB, so you can use DSB anyplace you would use DMB). Generally, user code doesn't need DSB unless you're doing self-modifying code.

ISB is for ordering special system register accesses, or odering data accesses with other CPU actions. If you want to read the TLB using an AT instruction, you must do an ISB before reading the PAR register. I cannot think of a user-level code sequence case that needs ISB off the top of my head now. ISB does nothing to order data operations, and it's not what you want. There's hidden magic in the combination: "DSB; ISB". This waits for most (but not quite all) previous bus traffic to complete before executing any new instructions, including instruction fetches, data fetches, etc. User code generally never needs to do this, but OS code sometimes does.

In terms of "heaviness", DMB is the lightest--it will slow the pipeline a few clocks to get the data accesses right. DSB is actually the slowest--it generally causes a bus transaction, which is sent to other agents, and a response is sent back (CPU optimizations can avoid this traffic sometimes, especially if another DSB was recently done). You don't want to use DSB unless you really need it. And ISB is in the middle--also a few cycles, but likely a little more than DMB. And "DSB; ISB" basically brings the CPU to temporary halt--waits for (almost) every current fetch to finish, then restarts.

To make things worse, the way to actually insert the barrier is another level of complexity, and which sadly seems to be compiler dependent.

It's almost as if this whole area is a big giant mess.

Kent

Reply to
Kent Dickey

Many thanks Kent, that's very useful and to-the-point.

Clifford Heath

Reply to
Clifford Heath

I agree on the principle. And usually it can be done in practice too, but it can come at a cost. For most embedded systems, the way to avoid needing barrier instructions is to set up memory areas with different characteristics such as cacheable, bufferable, etc. Typically memory mapped peripherals will be in an area where all accesses are strictly ordered and uncacheable, and then no barrier instructions are needed. For small microcontroller cores, this has no cost since you don't have caches or write buffers anyway, but on bigger processors it can be significant when you have larger blocks of data to transfer. This can be a measurable hit on things like Ethernet performance or data in DMA buffers.

The most important thing is always that the code should be correct. It is better to be slower and correct than faster and incorrect!

Thus usually you have the such memory setups to cover the normal cases, and put any required cache or barrier instructions in system code. If you are going to need some cache flush and data ordering instruction before starting a DMA transfer, then those should be in the "start_dma_transfer" function - written by a programmer who /does/ know how these things work.

Another kind of barrier is the compiler memory barrier. Again, it can be hard for users to get these right - and they should be put in system code for things like interrupt disable functions so that users don't have to worry about them.

Agreed. C11 and C++11 can help a bit with atomics and fences, but relatively few people understand these well. I am a fan of message passing and queues as a way of inter-thread communication, as it is a lot easier to understand and get right than using locks or critical sections. It is also much easier to scale with SMP or AMP. You don't need to worry about whether data is written to memory before the lock is taken, or whether you want a compiler memory barrier, a processor barrier instruction, volatile accesses - just put the message you want on the queue and off it goes. (Just don't pass pointers to data on the local stack...)

(Write or read/write barrier - there is AFAIK no read barrier.)

For smaller microcontrollers, there will be no noticeable difference. By the time you have external dynamic memory connected via a quad SPI bus, the latency on reads can be much more dramatic. Writes can be buffered further down the chain (such as in the QSPI or SDRAM controller), but you don't want to wait for reads if you don't have to.

Still, it is always better to be safe than fast, and use "both" if you are not sure.

Agreed.

Generally you don't need that either. The volatile accesses will be ordered by the compiler (as long as the programmer doesn't make the mistake of thinking that volatile accesses also order with non-volatile accesses). If the memory setup is done right, then when writing to peripherals the cpu will enforce the order without the need of DMB. And you don't need DMB for purely cpu-related actions, such as interaction between interrupt routines or threads on the same processor (volatile and compiler barriers are sufficient).

The point you typically need DMB is for data that is in main memory and shared between bus masters, like other processors, DMA, or Ethernet controllers. Then you might need a DMB before informing the other masters that data is ready. You may also need cache control instructions too. (You need this sort of thing for reads as well as writes.)

Yes, and also changes to the MPU mappings are a common case.

Just say "no" to self-modifying code! Firmware updates are an exception, of course. And your DSB is likely to be combined with data cache flushes (to make sure the changes are written to memory), instruction cache flushes (to make sure you don't have stale data there) and ISB.

Sometimes this sort of thing can be recommended for entering "sleep" modes - often in combination with chip errata on early versions of devices.

Note that the cost of these instructions varies significantly from system to system. On an M0, all three barrier instructions will likely be no more expensive than a NOP. On an M7 with cache and outstanding transactions to external memory, they can cost a lot.

That is partly true - ARM has made a reasonable attempt at headers that can be used with a variety of compilers for at least some of this stuff. But there are always complications when you are dealing with features that simply cannot be described in languages like C.

Well, it's all a big compromise. You can design a processor system that doesn't need barriers of any kind, but it won't scale for higher speeds and certainly won't work with multiple processors. (And once you get to multiple processors, you have another layer with the memory models - you can have programmer-friendly "strong" models like the x86, or far simpler and more efficient "weak" models like most RISC processors, requiring more effort from the programmer.)

Reply to
David Brown

And, despite opinions to the contrary, that is just as true in Java[1].

The major advantage that Java has (probably, belatedly, "had") is a memory model that took account of caches and multicore processors. But even that model had to be "adjusted" after a few years out in the wild; humility is beneficial :)

[1] Or assembler for that matter!
Reply to
Tom Gardner

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.