How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on

Even I write software for embedded systems for more than 10 years,
there's an argument that from time to time let me think for hours and
leave me with many doubts.
Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx).
The software is bare metal, without any OS. The main pattern is the well
known mainloop (background code) that is interrupted by ISR.
Interrupts are used mainly for timings and for low-level driver. For
example, the UART reception ISR move the last received char in a FIFO
buffer, while the mainloop code pops new data from the FIFO.
static struct {
unsigned char buf[RXBUF_SIZE];
uint8_t in;
uint8_t out;
} rxfifo;
/* ISR */
void uart_rx_isr(void) {
unsigned char c = UART->DATA;
rxfifo.buf[in % RXBUF_SIZE] = c;
rxfifo.in++;
// Reset interrupt flag
}
/* Called regularly from mainloop code */
int uart_task(void) {
int c = -1;
if (out != in) {
c = rxfifo.buf[out % RXBUF_SIZE];
out++;
}
return -1;
}
From a 20-years old article[1] by Nigle Jones, this seems a situation
where volatile must be used for rxfifo.in, because is modified by an ISR
and used in the mainloop code.
I don't think so, rxfifo.in is read from memory only one time in
uart_task(), so there isn't the risk that compiler can optimize badly.
Even if ISR is fired immediately after the if statement, this doesn't
bring to a dangerous state: the just received data will be processed at
the next call to uart_task().
So IMHO volatile isn't necessary here. And critical sections (i.e.
disabling interrupts) aren't useful too.
However I'm thinking about memory barrier. Suppose the compiler reorder
the instructions in uart_task() as follows:
c = rxfifo.buf[out % RXBUF_SIZE]
if (out != in) {
out++;
return c;
} else {
return -1;
}
Here there's a big problem, because compiler decided to firstly read
rxfifo.buf[] and then test in and out equality. If the ISR is fired
immediately after moving data to c (most probably an internal register),
the condition in the if statement will be true and the register value is
returned. However the register value isn't correct.
I don't think any modern C compiler reorder uart_task() in this way, but
we can't be sure. The result shouldn't change for the compiler, so it
can do this kind of things.
How to fix this issue if I want to be extremely sure the compiler will
not reorder this way? Applying volatile to rxfifo.in shouldn't help for
this, because compiler is allowed to reorder access of non volatile
variables yet[2].
One solution is adding a memory barrier in this way:
int uart_task(void) {
int c = -1;
if (out != in) {
memory_barrier();
c = rxfifo.buf[out % RXBUF_SIZE];
out++;
}
return -1;
}
However this approach appears to me dangerous. You have to check and
double check if, when and where memory barriers are necessary and it's
simple to skip a barrier where it's nedded and add a barrier where it
isn't needed.
So I'm thinking that a sub-optimal (regarding efficiency) but reliable
(regarding the risk to skip a barrier where it is needed) could be to
enter a critical section (disabling interrupts) anyway, if it isn't
strictly needed.
int uart_task(void) {
ENTER_CRITICAL_SECTION();
int c = -1;
if (out != in) {
c = rxfifo.buf[out % RXBUF_SIZE];
out++;
}
EXIT_CRITICAL_SECTION();
return -1;
}
Another solution could be to apply volatile keyword to rxfifo.in *AND*
rxfifo.buf too, so compiler can't change the order of accesses them.
Do you have other suggestions?
[1]
formatting link

[2]
formatting link

Reply to
pozz
Loading thread data ...
This is a good introduction to how Linux makes this possible for its horde of device-driver authors:

Clifford Heath
Reply to
Clifford Heath
It's nice to see a thread like this here - the group needs such discussions!
Unless you are sure that RXBUF_SIZE is a power of two, this is going to be quite slow on an AVR. Modulo means division, and while division by a constant is usually optimised to a multiplication by the compiler, you still have a multiply, a shift, and some compensation for it all being done as signed integer arithmetic.
It's also wrong, for non-power of two sizes, since the wrapping of your increment and your modulo RXBUF_SIZE get out of sync.
The usual choice is to track "head" and "tail", and use something like:
void uart_rx_isr(void) { unsigned char c = UART->DATA; // Reset interrupt flag uint8_t next = rxfifo.tail; rxfifo.buf[next] = c; next++; if (next >= RXBUF_SIZE) next -= RXBUF_SIZE; rxfifo.tail = next; }
int uart_task(void) { int c = -1; uint8_t next = rxfifo.head; if (next != rxfifo.tail) { c = rxfifo.buf[next]; next++; if (next >= RXBUF_SIZE) next -= RXBUF_SIZE; rxfifo.head = next; } return c; }
These don't track buffer overflow at all - you need to call uart_task() often enough to avoid that.
(I'm skipping volatiles so we don't get ahead of your point.)
Certainly whenever data is shared between ISR's and mainloop code, or different threads, then you need to think about how to make sure data is synchronised and exchanged. "volatile" is one method, atomics are another, and memory barriers can be used.
That is incorrect in two ways. One - baring compiler bugs (which do occur, but they are very rare compared to user bugs), there is no such thing as "optimising badly". If optimising changes the behaviour of the code, other than its size and speed, the code is wrong. Two - it is a very bad idea to imagine that having code inside a function somehow "protects" it from re-ordering or other optimisation.
Functions can be inlined, outlined, cloned, and shuffled about. Link-time optimisation, code in headers, C++ modules, and other cross-unit optimisations are becoming more and more common. So while it might be true /today/ that the compiler has no alternative but to read rxfifo.in once per call to uart_task(), you cannot assume that will be the case with later compilers or with more advanced optimisation techniques enabled. It is safer, more portable, and more future-proof to avoid such assumptions.
You are absolutely correct.
It is not an unreasonable re-arrangement. On processors with out-of-order execution (which does not apply to the AVR or Cortex-M), compilers will often push loads as early as they can in the instruction stream so that they start the cache loading process as quickly as possible. (But note that on such "big" processors, much of this discussion on volatile and memory barriers is not sufficient, especially if there is more than one core. You need atomics and fences, but that's a story for another day.)
The important thing about "volatile" is that it is /accesses/ that are volatile, not objects. A volatile object is nothing more than an object for which all accesses are volatile by default. But you can use volatile accesses on non-volatile objects. This macro is your friend:
#define volatileAccess(v) *((volatile typeof((v)) *) &(v))
(Linux has the same macro, called ACCESS_ONCE. It uses a gcc extension - if you are using other compilers then you can make an uglier equivalent using _Generic. However, if you are using a C compiler that supports C11, it is probably gcc or clang, and you can use the "typeof" extension.)
That macro will let you make a volatile read or write to an object without requiring that /all/ accesses to it are volatile.
Note that you are forcing the compiler to read "out" twice here, as it can't keep the value of "out" in a register across the memory barrier. (And as I mentioned before, the compiler might be able to do larger scale optimisation across compilation units or functions, and in that way keep values across multiple calls to uart_task.)
Memory barriers are certainly useful, but they are a shotgun approach - they affect /everything/ involving reads and writes to memory. (But remember they don't affect ordering of calculations.)
Critical sections for something like this are /way/ overkill. And a critical section with a division in the middle? Not a good idea.
Marking "in" and "buf" as volatile is /far/ better than using a critical section, and likely to be more efficient than a memory barrier. You can also use volatileAccess rather than making buf volatile, and it is often slightly more efficient to cache volatile variables in a local variable while working with them.
Reply to
David Brown
Yes, RXBUF_SIZE is a power of two.
This isn't the point of this thread, anyway... You insist that tail is always in the range [0...RXBUF_SIZE - 1]. My approach is different.
RXBUF_SIZE is a power of two, usualy
Sure, with a good number for RXBUF_SIZE, buffer overflow shouldn't happen ever. Anyway, if it happens, the higher level layers (protocol) should detect a corrupted packet.
Yes of course, but I don't think the absence of volatile for rxfifo.in, even if it can change in ISR, could be a *real* problem with *modern and current* compilers.
voltile attribute needs to avoid compiler optimization (that would be a bad thing, because of volatile nature of the variabile), but on that code it's difficult to think of an optimization, caused by the absence of volatile, that changes the behaviour erroneously... except reorering.
I didn't say this, at the contrary I was thinking exactly to reordering issues.
Ok, you are talking of future scenarios. I don't think actually this could be a real problem. Anyway your observation makes sense.
This is a good point. The code in ISR can't be interrupted, so there's no need to have volatile access in ISR.
Yes, you're right. A small penalty to avoid the problem of reordering.
Yes, I think so too. Lastly I read many experts say volatile is often a bad thing, so I'm re-thinking about its use compared with other approaches.
Reply to
pozz
Disable interrupts while accessing the fifo. you really have to. alternatively you'll often get away not using a fifo at all, unless you're blocking for a long while in some part of the code.
Reply to
Johann Klammer
Il 23/10/2021 18:09, David Brown ha scritto: [...]
I think I got your point, but I'm wondering why there are plenty of examples of ring-buffer implementations that don't use volatile at all, even if the author explicitly refers to interrupts and multithreading.
Just an example[1] by Quantum Leaps. It promises to be a *lock-free* (I think thread-safe) ring-buffer implementation in the scenario of single producer/single consumer (that is my scenario too).
In the source code there's no use of volatile. I could call RingBuf_put() in my rx uart ISR and call RingBuf_get() in my mainloop code.
From what I learned from you, this code usually works, but the standard doesn't guarantee it will work with every old, current and future compilers.
[1]
formatting link
Reply to
pozz
You don't have to use "volatile". You can make correct code here using critical sections - it's just a lot less efficient. (If you have a queue where more than one context can be reading it or writing it, then you /do/ need some kind of locking mechanism.)
You can also use memory barriers instead of volatile, but it is likely to be slightly less efficient.
You can also use atomics instead of volatiles, but it is also quite likely to be slightly less efficient. If you have an SMP system, on the other hand, then you need something more than volatile and compiler memory barriers - atomics are quite possibly the most efficient solution in that case.
And sometimes you can make code that doesn't need any special treatment at all, because you know the way it is being called. If the two ends of your buffer are handled by tasks in a cooperative multi-tasking scenario, then there is no problem - you don't need to worry about volatile or any alternatives. If you know your interrupt can't occur while the other end of the buffer is being handled, that can reduce your need for volatile. (In particular, that can also avoid complications if you have counter variables that are bigger than the processor can handle atomically - usually not a problem for a 32-bit Cortex-M, but often important on an 8-bit AVR.)
If you know, for a fact, that the code will be compiled by a weak compiler or with weak optimisation, or that the "get" and "put" implementations will always be in a separately compiled unit from code calling these functions and you'll never use any kind of cross-unit optimisations, then you can get often away without using volatile.
It's lock-free, but not safe in the face of modern optimisation (gcc has had LTO for many years, and a lot of high-end commercial embedded compilers have used such techniques for decades). And I'd want to study it in detail and think a lot before accepting that it is safe to use its 16-bit counters on an 8-bit AVR. That could be fixed by just changing the definition of the RingBufCtr type, which is a nice feature in the code.
You don't want to call functions from an ISR if you can avoid it, unless the functions are defined in the same unit and can be inlined. On many processors (less so on the Cortex-M) calling an external function from an ISR means a lot of overhead to save and restore the so-called "volatile" registers (no relation to the C keyword "volatile"), usually completely unnecessarily.
Yes, that's a fair summary.
It might be good enough for some purposes. But since "volatile" will cost nothing in code efficiency but greatly increase the portability and safety of the code, I'd recommend using it. And I am certainly in favour of thinking carefully about these things - as you did in the first place, which is why we have this thread.
Reply to
David Brown
If you read carefuly what he wrote you would know that he does. The trick he uses is that his indices may point outside buffer: empty is equal indices, full is difference equal to buffer size. Of course his approach has its own limitations, like buffer size being power of 2 and with 8 bit indices maximal buffer size is 128.
--
                              Waldek Hebisch
Reply to
antispam
The issue with not using 'volatile' (or some similar memory barrier) is that without it, the implementation is allowed to delay the actual write of the results into the variable.
If optimization is limited to just within a single translation unit, you can force it to work by having the execution path leave the translation unit, but with whole program optimization, it is theoretically possible that the implementation sees that the thread of execution NEVER needs it to be spilled out of the registers to memory, so the ISR will never see the change.
Reply to
Richard Damon
AFAIK OP considers this not a problem in his application. Of course, if such changes were a problem he would need to add test preventing writing to full buffer (he already have test preventing reading from empty buffer).
Empty buffer.
Does not matter.
The same as has been stored. Point is that received is always bigger or equal to removed and does not exceed removed by more than 128. So you can exactly recover difference between received and removed.
Well, personally I would avoid storing to full buffer. And even on small MCU it is not clear for me if his "savings" are worth it. But his core design is sound.
Concerning other developers, I always working on assumption that code is "as is" and any claim what it is doing are of limited value unless there is convincing argument (proof or outline of proof) what it is doing. Fact that code worked well in past system(s) is rather unconvincing. I have seen small (few lines) pieces of code that contained multiple bugs. And that code was in "production" use for several years and passed its tests.
Certainly code like FIFO-s where there are multiple tradeofs and actual code tends to be relatively small deserves examination before re-use.
--
                              Waldek Hebisch
Reply to
antispam

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.