How to write a simple driver in bare metal systems: volatile, memory barrier, critical sec...

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Even I write software for embedded systems for more than 10 years,  
there's an argument that from time to time let me think for hours and  
leave me with many doubts.

Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx).  
The software is bare metal, without any OS. The main pattern is the well  
known mainloop (background code) that is interrupted by ISR.

Interrupts are used mainly for timings and for low-level driver. For  
example, the UART reception ISR move the last received char in a FIFO  
buffer, while the mainloop code pops new data from the FIFO.


static struct {
   unsigned char buf[RXBUF_SIZE];
   uint8_t in;
   uint8_t out;
} rxfifo;

/* ISR */
void uart_rx_isr(void) {
   unsigned char c = UART->DATA;
   rxfifo.buf[in % RXBUF_SIZE] = c;
   rxfifo.in++;
   // Reset interrupt flag
}

/* Called regularly from mainloop code */
int uart_task(void) {
   int c = -1;
   if (out != in) {
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   return -1;
}


 From a 20-years old article[1] by Nigle Jones, this seems a situation  
where volatile must be used for rxfifo.in, because is modified by an ISR  
and used in the mainloop code.

I don't think so, rxfifo.in is read from memory only one time in  
uart_task(), so there isn't the risk that compiler can optimize badly.  
Even if ISR is fired immediately after the if statement, this doesn't  
bring to a dangerous state: the just received data will be processed at  
the next call to uart_task().

So IMHO volatile isn't necessary here. And critical sections (i.e.  
disabling interrupts) aren't useful too.

However I'm thinking about memory barrier. Suppose the compiler reorder  
the instructions in uart_task() as follows:


   c = rxfifo.buf[out % RXBUF_SIZE]
   if (out != in) {
     out++;
     return c;
   } else {
     return -1;
   }


Here there's a big problem, because compiler decided to firstly read  
rxfifo.buf[] and then test in and out equality. If the ISR is fired  
immediately after moving data to c (most probably an internal register),  
the condition in the if statement will be true and the register value is  
returned. However the register value isn't correct.

I don't think any modern C compiler reorder uart_task() in this way, but  
we can't be sure. The result shouldn't change for the compiler, so it  
can do this kind of things.

How to fix this issue if I want to be extremely sure the compiler will  
not reorder this way? Applying volatile to rxfifo.in shouldn't help for  
this, because compiler is allowed to reorder access of non volatile  
variables yet[2].

One solution is adding a memory barrier in this way:


int uart_task(void) {
   int c = -1;
   if (out != in) {
     memory_barrier();
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   return -1;
}


However this approach appears to me dangerous. You have to check and  
double check if, when and where memory barriers are necessary and it's  
simple to skip a barrier where it's nedded and add a barrier where it  
isn't needed.

So I'm thinking that a sub-optimal (regarding efficiency) but reliable  
(regarding the risk to skip a barrier where it is needed) could be to  
enter a critical section (disabling interrupts) anyway, if it isn't  
strictly needed.


int uart_task(void) {
   ENTER_CRITICAL_SECTION();
   int c = -1;
   if (out != in) {
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   EXIT_CRITICAL_SECTION();
   return -1;
}


Another solution could be to apply volatile keyword to rxfifo.in *AND*  
rxfifo.buf too, so compiler can't change the order of accesses them.

Do you have other suggestions?



[1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
[2] https://blog.regehr.org/archives/28

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 23/10/21 9:07 am, pozz wrote:
Quoted text here. Click to load it



Quoted text here. Click to load it




Quoted text here. Click to load it






Quoted text here. Click to load it







Quoted text here. Click to load it







Quoted text here. Click to load it








Quoted text here. Click to load it

This is a good introduction to how Linux makes this possible for its  
horde of device-driver authors:

<https://www.kernel.org/doc/Documentation/memory-barriers.txt


Clifford Heath

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/22/2021 3:07 PM, pozz wrote:
Quoted text here. Click to load it

Why?  And why a retval from uart_task -- if it is always "-1"?

Quoted text here. Click to load it

This is a bug(s) waiting to happen.

How is RXBUF_SIZE defined?  How does it reflect the data rate (and,
thus, interrupt rate) as well as the maximum latency between "main
loop" accesses?  I.e., what happens when the buffer is *full* -- and,
thus, appears EMPTY?  What stops the "in" member from growing to the
maximum size of a uint8 -- and then wrapping?  How do you convey this
to the upper level code ("Hey, we just lost a whole RXBUF_SIZE of
characters so if the character stream doesn't make sense, that might
be a cause...")?  What if RXBUF_SIZE is relatively prime wrt uint8max?

When writing UART handlers, I fetch the received datum along with
the uart's flags and stuff *both* of those things in the FIFO.
If the FIFO would be full, I, instead, modify the flags of the
preceeding datum to reflect this fact ("Some number of characters
have been lost AFTER this one...") and discard the current character.

I then signal an event and let a task waiting for that specific event
wake up and retrieve the contents of the FIFO (which may include more
than one character, at that time as characters can arrive after the
initial event has been signaled).

This lets me move the line discipline out of the ISR and still keep
the system "responsive".

Figure out everything that you need to do before you start sorting out
how the compiler can "shaft" you...

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Il 23/10/2021 07:09, Don Y ha scritto:
Quoted text here. Click to load it



Quoted text here. Click to load it




Quoted text here. Click to load it

Quoted text here. Click to load it


It was my mistake. The last instruction of uart_task() should be

   return c;

And maybe the name of uart_task() is not so good, it should be uart_rx().

Quoted text here. Click to load it





Quoted text here. Click to load it

Power of two.


Quoted text here. Click to load it

Rx FIFO filled by interrupt is needed to face a burst (a packet?) of  
incoming characters.

If the baudrate is 9600bps 8n1, interrupt would be fired every  
10/9600=1ms. If maximum interval between two successive uart_task()  
calls is 10ms, it is sufficient a buffer of 10 bytes, so RXBUF_SIZE  
could be 16 or 32.


Quoted text here. Click to load it

These are good questions, but I didn't want to discuss about them. Of  
course ISR is not complete, because before pushing a new byte, we must  
check if FIFO is full. For example:

/* The difference in-out gives a correct result even after a
  * wrap-around of in only, thanks to unsigned arithmetic. */
#define RXFIFO_IS_FULL()     (rxfifo.in - rxfifo.out < RXBUF_SIZE)

void uart_rx_isr(void) {
    unsigned char c = UART->DATA;
    if (!RXFIFO_IS_FULL()) {
      rxfifo.buf[in % RXBUF_SIZE] = c;
      rxfifo.in++;
    } else {
      // FIFO is full, ignore the char
    }
    // Reset interrupt flag
}




Quoted text here. Click to load it

As I wrote, this should work even after a wrap-around.


Quoted text here. Click to load it

FIFO full is event is extremely rare if I'm able to size rx FIFO  
correctly, i.e. on the worst case.
Anyway I usually ignore incoming chars when the FIFO is full. The high  
level protocols are usually defined in such a way the absence of chars  
are detected, mostly thanks to CRC.


Quoted text here. Click to load it

Signal an event? Task waiting for a specific event? Maybe you are  
thinking of a full RTOS. I was thinking of bare metal systems.


Quoted text here. Click to load it


Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/23/2021 1:12 PM, pozz wrote:
Quoted text here. Click to load it

The point was its relationship to the actual code.

Quoted text here. Click to load it

What GUARANTEES this in your system?  Folks often see things that "can't
happen" -- yet DID (THEY SAW IT!).  Your code/design should ensure that
"can't happen" REALLY /can't happen/.  It costs very little to explain
(commentary) WHY you don't have to check for X, Y or Z in your code.

[If the user's actions (or any outside agency) can affect operation,
then how can you guarantee that THEY "behave"?]

And, give that a very high degree of visibility so that when someone
decides they can increase the baudrate or add some sluggish task
to your "main loop" that this ASSUMPTION isn't silently violated.

Quoted text here. Click to load it

My point is that you should fleshout your code before you start
thinking about what can go wrong.

E.g., if the ISR is the *only* entity to modify ".in" and always does so
in with interrupts off, then it can do so without worrying about conflict
with something else -- if those other things always ensure they read it
atomically (if they read it just before or just after it has been modified
by the ISR, the value will still "work" -- they just may not realize, yet,
that there is an extra character in the buffer that they haven't yet seen).

Likewise, if the "task" is the only entity modifying ".out", then ensuring
that those modifications are atomic means the ISR can safely use any *single*
reference to it.

Quoted text here. Click to load it

"Rare" and "impossible" are too entirely different scenarios.
It is extremely rare for a specific individual to win the lottery.
But, any individual *can* win it!

Quoted text here. Click to load it

What if the CRC characters disappear?  Are you sure the front of one message
can't appear to match the ass end of another?

"Pozz is here."
"Don  is not here."

"Pozz is not here."

And that there is no value in knowing that one or more messages may have been
dropped?

Quoted text here. Click to load it

You can implement as much or as little of an OS as you choose;
you're not stuck with "all or nothing".

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 23/10/2021 00:07, pozz wrote:
Quoted text here. Click to load it

It's nice to see a thread like this here - the group needs such discussions!

Quoted text here. Click to load it



Quoted text here. Click to load it




Unless you are sure that RXBUF_SIZE is a power of two, this is going to
be quite slow on an AVR.  Modulo means division, and while division by a
constant is usually optimised to a multiplication by the compiler, you
still have a multiply, a shift, and some compensation for it all being
done as signed integer arithmetic.

It's also wrong, for non-power of two sizes, since the wrapping of your
increment and your modulo RXBUF_SIZE get out of sync.

The usual choice is to track "head" and "tail", and use something like:

void uart_rx_isr(void) {
  unsigned char c = UART->DATA;
  // Reset interrupt flag
  uint8_t next = rxfifo.tail;
  rxfifo.buf[next] = c;
  next++;
  if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
  rxfifo.tail = next;
}


Quoted text here. Click to load it






Quoted text here. Click to load it

int uart_task(void) {
  int c = -1;
  uint8_t next = rxfifo.head;
  if (next != rxfifo.tail) {
      c = rxfifo.buf[next];
      next++;
      if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
      rxfifo.head = next;
  }
  return c;
}

These don't track buffer overflow at all - you need to call uart_task()
often enough to avoid that.

(I'm skipping volatiles so we don't get ahead of your point.)

Quoted text here. Click to load it

Certainly whenever data is shared between ISR's and mainloop code, or
different threads, then you need to think about how to make sure data is
synchronised and exchanged.  "volatile" is one method, atomics are
another, and memory barriers can be used.

Quoted text here. Click to load it

That is incorrect in two ways.  One - baring compiler bugs (which do
occur, but they are very rare compared to user bugs), there is no such
thing as "optimising badly".  If optimising changes the behaviour of the
code, other than its size and speed, the code is wrong.  Two - it is a
very bad idea to imagine that having code inside a function somehow
"protects" it from re-ordering or other optimisation.

Functions can be inlined, outlined, cloned, and shuffled about.
Link-time optimisation, code in headers, C++ modules, and other
cross-unit optimisations are becoming more and more common.  So while it
might be true /today/ that the compiler has no alternative but to read
rxfifo.in once per call to uart_task(), you cannot assume that will be
the case with later compilers or with more advanced optimisation
techniques enabled.  It is safer, more portable, and more future-proof
to avoid such assumptions.

Quoted text here. Click to load it







Quoted text here. Click to load it

You are absolutely correct.

Quoted text here. Click to load it

It is not an unreasonable re-arrangement.  On processors with
out-of-order execution (which does not apply to the AVR or Cortex-M),
compilers will often push loads as early as they can in the instruction
stream so that they start the cache loading process as quickly as
possible.  (But note that on such "big" processors, much of this
discussion on volatile and memory barriers is not sufficient, especially
if there is more than one core.  You need atomics and fences, but that's
a story for another day.)

Quoted text here. Click to load it

The important thing about "volatile" is that it is /accesses/ that are
volatile, not objects.  A volatile object is nothing more than an object
for which all accesses are volatile by default.  But you can use
volatile accesses on non-volatile objects.  This macro is your friend:

#define volatileAccess(v) *((volatile typeof((v)) *) &(v))

(Linux has the same macro, called ACCESS_ONCE.  It uses a gcc extension
- if you are using other compilers then you can make an uglier
equivalent using _Generic.  However, if you are using a C compiler that
supports C11, it is probably gcc or clang, and you can use the "typeof"
extension.)

That macro will let you make a volatile read or write to an object
without requiring that /all/ accesses to it are volatile.


Quoted text here. Click to load it







Quoted text here. Click to load it

Note that you are forcing the compiler to read "out" twice here, as it
can't keep the value of "out" in a register across the memory barrier.
(And as I mentioned before, the compiler might be able to do larger
scale optimisation across compilation units or functions, and in that
way keep values across multiple calls to uart_task.)

Quoted text here. Click to load it

Memory barriers are certainly useful, but they are a shotgun approach -
they affect /everything/ involving reads and writes to memory.  (But
remember they don't affect ordering of calculations.)

Quoted text here. Click to load it








Quoted text here. Click to load it

Critical sections for something like this are /way/ overkill.  And a
critical section with a division in the middle?  Not a good idea.

Quoted text here. Click to load it

Marking "in" and "buf" as volatile is /far/ better than using a critical
section, and likely to be more efficient than a memory barrier.  You can
also use volatileAccess rather than making buf volatile, and it is often
slightly more efficient to cache volatile variables in a local variable
while working with them.

Quoted text here. Click to load it


Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Il 23/10/2021 18:09, David Brown ha scritto:
Quoted text here. Click to load it



Quoted text here. Click to load it



Quoted text here. Click to load it

Yes, RXBUF_SIZE is a power of two.



Quoted text here. Click to load it

This isn't the point of this thread, anyway...
You insist that tail is always in the range [0...RXBUF_SIZE - 1]. My  
approach is different.

RXBUF_SIZE is a power of two, usualy <25%6. head and tail are uint8_t  
and *can* reach the maximum value of 255, even RXBUF_SIZE is 128. All  
works well.

Suppose rxfifo.in=rxfifo.out12%7, FIFO is empty. When a new char is  
received, it is saved into rxfifo.buf[127 % 12812%7] and rxfifo.in will  
be increased to 128.
Now mainloop detect the new char (in != out), reads the new char at  
rxfifo.buf[127 % 12812%7] and increase out that will be 128.

The next byte will be saved into rxfifo.rxbuf[rxfifo.in % 12812%8 % 128  
= 0] and rxfifo.in will be 129. Again, the next byte will be saved to  
rxbuf[rxfifo.in % 12812%9 % 128=1] and rxfifo.in will be 130.

When the mainloop tries to pop data from fifo, it tests

    rxfifo.in(130) !=rxfifo.out(128)

The test is true, so the code extracts chars from rxbuf[out % 128] that  
is rxbuf[0]... and so on.

I hope that explanation is good.


Quoted text here. Click to load it

Quoted text here. Click to load it






Quoted text here. Click to load it

Sure, with a good number for RXBUF_SIZE, buffer overflow shouldn't  
happen ever. Anyway, if it happens, the higher level layers (protocol)  
should detect a corrupted packet.


Quoted text here. Click to load it

Yes of course, but I don't think the absence of volatile for rxfifo.in,  
even if it can change in ISR, could be a *real* problem with *modern and  
current* compilers.

voltile attribute needs to avoid compiler optimization (that would be a  
bad thing, because of volatile nature of the variabile), but on that  
code it's difficult to think of an optimization, caused by the absence  
of volatile, that changes the behaviour erroneously... except reorering.


Quoted text here. Click to load it

I didn't say this, at the contrary I was thinking exactly to reordering  
issues.


Quoted text here. Click to load it

Ok, you are talking of future scenarios. I don't think actually this  
could be a real problem. Anyway your observation makes sense.


Quoted text here. Click to load it







Quoted text here. Click to load it

This is a good point. The code in ISR can't be interrupted, so there's  
no need to have volatile access in ISR.


Quoted text here. Click to load it







Quoted text here. Click to load it

Yes, you're right. A small penalty to avoid the problem of reordering.


Quoted text here. Click to load it








Quoted text here. Click to load it

Yes, I think so too. Lastly I read many experts say volatile is often a  
bad thing, so I'm re-thinking about its use compared with other approaches.


Quoted text here. Click to load it


Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Il 23/10/2021 18:09, David Brown ha scritto:
[...]
Quoted text here. Click to load it

I think I got your point, but I'm wondering why there are plenty of  
examples of ring-buffer implementations that don't use volatile at all,  
even if the author explicitly refers to interrupts and multithreading.

Just an example[1] by Quantum Leaps. It promises to be a *lock-free* (I  
think thread-safe) ring-buffer implementation in the scenario of single  
producer/single consumer (that is my scenario too).

In the source code there's no use of volatile. I could call  
RingBuf_put() in my rx uart ISR and call RingBuf_get() in my mainloop code.

 From what I learned from you, this code usually works, but the standard  
doesn't guarantee it will work with every old, current and future compilers.



[1] https://github.com/QuantumLeaps/lock-free-ring-buffer

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 25/10/2021 20:15, pozz wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it

You don't have to use "volatile".  You can make correct code here using
critical sections - it's just a lot less efficient.  (If you have a
queue where more than one context can be reading it or writing it, then
you /do/ need some kind of locking mechanism.)

You can also use memory barriers instead of volatile, but it is likely
to be slightly less efficient.

You can also use atomics instead of volatiles, but it is also quite
likely to be slightly less efficient.  If you have an SMP system, on the
other hand, then you need something more than volatile and compiler
memory barriers - atomics are quite possibly the most efficient solution
in that case.

And sometimes you can make code that doesn't need any special treatment
at all, because you know the way it is being called.  If the two ends of
your buffer are handled by tasks in a cooperative multi-tasking
scenario, then there is no problem - you don't need to worry about
volatile or any alternatives.  If you know your interrupt can't occur
while the other end of the buffer is being handled, that can reduce your
need for volatile.  (In particular, that can also avoid complications if
you have counter variables that are bigger than the processor can handle
atomically - usually not a problem for a 32-bit Cortex-M, but often
important on an 8-bit AVR.)

If you know, for a fact, that the code will be compiled by a weak
compiler or with weak optimisation, or that the "get" and "put"
implementations will always be in a separately compiled unit from code
calling these functions and you'll never use any kind of cross-unit
optimisations, then you can get often away without using volatile.

Quoted text here. Click to load it

It's lock-free, but not safe in the face of modern optimisation (gcc has
had LTO for many years, and a lot of high-end commercial embedded
compilers have used such techniques for decades).  And I'd want to study
it in detail and think a lot before accepting that it is safe to use its
16-bit counters on an 8-bit AVR.  That could be fixed by just changing
the definition of the RingBufCtr type, which is a nice feature in the code.

Quoted text here. Click to load it

You don't want to call functions from an ISR if you can avoid it, unless
the functions are defined in the same unit and can be inlined.  On many
processors (less so on the Cortex-M) calling an external function from
an ISR means a lot of overhead to save and restore the so-called
"volatile" registers (no relation to the C keyword "volatile"), usually
completely unnecessarily.

Quoted text here. Click to load it

Yes, that's a fair summary.

It might be good enough for some purposes.  But since "volatile" will
cost nothing in code efficiency but greatly increase the portability and
safety of the code, I'd recommend using it.  And I am certainly in
favour of thinking carefully about these things - as you did in the
first place, which is why we have this thread.

Quoted text here. Click to load it


Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/25/21 2:15 PM, pozz wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it

The issue with not using 'volatile' (or some similar memory barrier) is  
that without it, the implementation is allowed to delay the actual write  
of the results into the variable.

If optimization is limited to just within a single translation unit, you  
can force it to work by having the execution path leave the translation  
unit, but with whole program optimization, it is theoretically possible  
that the implementation sees that the thread of execution NEVER needs it  
to be spilled out of the registers to memory, so the ISR will never see  
the change.

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/23/2021 12:07 AM, pozz wrote:
Quoted text here. Click to load it
Disable interrupts while accessing the fifo. you really have to.
alternatively you'll often get away not using a fifo at all,
unless you're blocking for a long while in some part of the code.  


Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Quoted text here. Click to load it

If you read carefuly what he wrote you would know that he does.
The trick he uses is that his indices may point outside buffer:
empty is equal indices, full is difference equal to buffer
size.  Of course his approach has its own limitations, like
buffer size being power of 2 and with 8 bit indices maximal
buffer size is 128.

--  
                              Waldek Hebisch

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Quoted text here. Click to load it

AFAIK OP considers this not a problem in his application.
Of course, if such changes were a problem he would need to
add test preventing writing to full buffer (he already have
test preventing reading from empty buffer).

Quoted text here. Click to load it

Empty buffer.


Does not matter.

Quoted text here. Click to load it

The same as has been stored.  Point is that received is
always bigger or equal to removed and does not exceed
removed by more than 128.  So you can exactly recover
difference between received and removed.

Quoted text here. Click to load it

Well, personally I would avoid storing to full buffer.  And
even on small MCU it is not clear for me if his "savings"
are worth it.  But his core design is sound.

Concerning other developers, I always working on assumption
that code is "as is" and any claim what it is doing are of
limited value unless there is convincing argument (proof
or outline of proof) what it is doing.  Fact that code
worked well in past system(s) is rather unconvincing.
I have seen small (few lines) pieces of code that contained
multiple bugs.  And that code was in "production" use
for several years and passed its tests.

Certainly code like FIFO-s where there are multiple tradeofs
and actual code tends to be relatively small deserves
examination before re-use.

--  
                              Waldek Hebisch

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/26/2021 5:20 PM, snipped-for-privacy@math.uni.wroc.pl wrote:
Quoted text here. Click to load it

And I don't think I have to test for division by zero -- as
*my* code is the code that is passing numerator and denominator
to that operator, right?

Can you remember all of the little assumptions you've made in
any non-trivial piece of code -- a week later?  a month later?
6 months later (when a bug manifests or a feature upgrade
is requested)?

Do not check the inputs of routines for validity -- assume everything is
correct (cuz YOU wrote it to be so, right?).

Do not handle error conditions -- because they can't exist (because
you wrote the code and feel confident that you've anticipated
every contingency -- including those for future upgrades).

Ignore compiler warnings -- surely you know better than a silly
"generic" program!

Would you hire someone who viewed your product's quality (and
your reputation) in this regard?

Quoted text here. Click to load it

No, it means you can't sort out *if* there have been any characters
received, based solely on this fact (and, what other facts are there
to observe?)

Quoted text here. Click to load it

Of course it does!  Something has happened that the code MIGHT have
detected in other circumstances (e.g., if uart_task had been invoked
more frequently).  The world has changed and the code doesn't know it.
Why write code that only *sometimes* works?

Quoted text here. Click to load it

If it can wrap, then "some data" can look like "no data".
If "no data", then NOTHING has been received -- from the
viewpoint of the code.

Tell me what prevents 256 characters from being received
after .in (and .out) are initially 0 -- without any
indication of their presence.  What "limits" the difference
to "128"?  Do you see any conditionals in the code that
do so?  Is there some magic in the hardware that enforces
this?

This is how you end up with bugs in your code.  The sorts
of bugs that you can witness -- with your own eyes -- and
never reproduce (until the code has been released and
lots of customers' eyes witness it as well).

Quoted text here. Click to load it

Ever worked on 100KLoC projects?  500KLoC?  Do you personally examine
the entire codebase before you get started?  Do you purchase source
licenses for every library that you rely upon in your design?
(or, do you just assume software vendors are infallible?)

How would you feel if a fellow worker told you "yeah, the previous
guy had a habit of cutting corners in his FIFO management code"?
Or, "the previous guy always assumed malloc would succeed and
didn't even build an infrastructure to address the possibility
of it failing"

You could, perhaps, grep(1) for "malloc" or "FIFO" and manually
examine those code fragments.  What about division operators?
Or, verifying that data types never overflow their limits?  Or...

Quoted text here. Click to load it

It's not "FIFO code".  It's a UART driver.   Do you examine every piece
of code that might *contain* a FIFO?  How do you know that there *is* a FIFO
in a piece of code -- without manually inspecting it?  What if it is a
FIFO mechanism but not explicitly named as a FIFO?

One wants to be able to move towards the goal of software *components*.
You don't want to have to inspect the design of every *diode* that
you use; you want to look at it's overall specifications and decide
if those fit your needs.

Unlikely that this code will describe itself as "works well enough
SOME of the time..."

And, when/if you stumble on such faults, good luck explaining to
your customer why it's going to take longer to fix and retest the
*existing* codebase before you can get on with your modifications...

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Quoted text here. Click to load it

Well, I do not test for zero if I know that divisor must be
nonzero.  To put it differently, having zero in such place
is a bug and there is already enough machinery so that
such bug will not remain undetected.  Having extra test
adds no value.

OTOH is zero is possible, then handling it is part of program
logic and test is needed to take correct action.

Quoted text here. Click to load it

Well, my normal practice is that there are no "little assumptions".
To put it differently, code is structured to make things clear,
even if this requires more code than some "clever" solution.
There may be "big assumptions", that is highly nontrivial facts
used by the code.  Some of them are considered "well known",
with proper naming in code it is easy to recall them years later.
Some deserve comments/referece.  In most of may coding I have
pretty comfortable situation: for human there is quite clear
what is valid and what is invalid.  So code makes a lot of
effort to handle valid (but possibly quite unusual) cases

Quoted text here. Click to load it

Well, correct inputs are part of contract.  Some things (like
array indices inside bounds) are checked, but in general you can
expect garbage if you pass incorrect input.  Most of my code is
of sort that called routine can not really check validity of input
(there are complex invariants).  Note: here I am talking mostly
about my non-embedded code (which is majority of my coding).
In most of may coding I have pretty comfortable situation: for
human there is quite clear what is valid and what is invalid.
So code makes a lot of effort to handle valid (but possibly quite
unusual) cases.  User input is normally checked to give sensible
error message, but some things are deemed to tricky/expensive
to check.  Other routines are deemed "system level", and here
there us up to user/caller to respect the contract.

My embedded code consists of rather small systems, and normally
there are no explicit validity checks.  To clarify: when system
receives commands it recognizes and handles valid commands.
So there is implicit check: anything not recognized as valid
is invalid.  OTOH frequently there is nothing to do in case
of errors: if there are no display to print error message,
no persistent store to log erreor and shuting down is not helpful,
then what else potential error handler would do?

I do not check if 12-bit ADC really returns numbers in range.
My 'print_byte' routine takes integer argument and blindly
truncates it to 8-bit without worring about possible
spurious upper bits.  "Safety critical" folks my be worried
by such practice, but my embedded code is fairly non-critical.

Quoted text here. Click to load it

Well, you do not know what OP code is doing.  I would prefer
my code to be robust and I feel that I am doing resonably
well here.  OTOH, coming back to serial comunication, it
is not hard to design communication protocal such that in
normal operation there is no possibility for buffer
overflow.  It would still make sense to add a single line
to say drop excess characters.  But it does not make
sense to make big story of lack of this line.  In particular
issue that OP wanted to discuss is still valid.

Quoted text here. Click to load it

Of course you can connect to system and change values of variables
in debugger, so specific values mean nothing.  I am telling
you what to protocal is.  If all part of system (including parts
that OP skipped) obey the protocal, then you have meaning above.
If something misbehaves (say cosmic ray flipped a bit), it does
not mean that protocal is incorrect.  Simply _if_ probability
of misbehaviour is too high you need to fix the system (add
radiation shielding, appropiate seal to avoid tampering with
internals, extra checks inside, etc).  But what/if to fix
something is for OP to decide.

Quoted text here. Click to load it

All code works only sometimes.  Parafrazing famous answer to
Napoleon: fisrt you need a processor.  There are a lot of
conditons so that code works as intended.  Granted, I would
not skip needed check in real code.  But this is obvious
thing to add.  You are somewhat making OP code as "broken
beyond repair".  Well, as discussion showed, OP had problem
using "volatile" and that IMHO is much more important to
fix.

Quoted text here. Click to load it

That is the protocol.  How to avoid violation is different
matter: dropping characters _may_ be solution.  But dropping
characters means that some data is lost, and how to deal
with lost data is different issue.  As is OP code will loose
some old data.  It is OP problem to decide which failure
mode is more problematic and how much extra checks are
needed.

Quoted text here. Click to load it

IME it is issues that you can not prodict that catch you.
The above is obvious issue, and should not be a problem
(unless designer is seriously incompenent and misjudged
what can happen).

Quoted text here. Click to load it

Of course I do not read all code before start.  But I accept
risc that code may turn out to be faulty and I may be forced
to fix or abandon it.  My main project has 450K wc lines.
I know that parts are wrong and I am working on fixing that
(which will probably involve substantial rewrite).  I worked
a little on gcc and I can tell you that only sure thing in
such projects is that there are bugs.  Of course, despite
bugs gcc is quite useful.  But I also met Modula 2 compiler
that carefuly checked programs for violation of language
rules, but miscompiled nested function calls.

Quoted text here. Click to load it

Well, for several years I work exclusively with open source code.
I see a lot of defects.  While my experience with comercial codes
is limited I do not think that commercial codes have less defects
than open source ones.  In fact, there are reasons to suspect
that there are more defects in commercial codes.

Quoted text here. Click to load it

Well, there is a lot of bad code.  Sometimes best solution is simply
to throw it out.  In other cases (likely in your malloc scenario above)
there may be simple workaround (replace malloc by checking version).

Quoted text here. Click to load it

Yes, that one of possible appraches.

Quoted text here. Click to load it

I have a C parser.  In desperation I could try to search parse
tree or transform program.  Or, more likely decide that program
is broken beyond repair.

Quoted text here. Click to load it

Well, one thing is to look at structure of program.  Code may
look complicated, but some programs are reasonably testable:
few random inputs can give some confidence that "main"
execution path computes correct values.  Then you look if
you can hit limits.  Actually, much of my coding is in
arbitrary precision, so overflow is impossible.  Instead
program may run out of memory.  But there parts for speed
use fixed precision.  If I correctly computed limits
overflow is impossible.  But this is big if.

Quoted text here. Click to load it

Sure, I would love to see really reusable components.  But IMHO we
are quite far from that.  There are some things which are reusable
if you accept modest to severe overhead.  For example things tends
to compose nicely if you dynamically allocate everything and use
garbage collection.  But performace cost may be substantial.
And in embedded setting garbage collection may be unacceptable.
In some cases I have found out that I can get much better
speed joing things that could be done as composition of library
operations into single big routine.  In other cases I fixed
bugs by replacing composition of library routines by a single
routine: there were interactions making simple composition
incorrect.  Correct alterantive was single routine.

As I wrote my embedded programs are simple and small.  But I
use almost no external libraries.  Trying some existing libraries
I have found out that some produce rather large programs, linking
in a lot of unneeded stuff.  Of course, writing for scratch
will not scale to bigger programs.  OTOH, I feel that with
proper tooling it would be possible to retain efficiency and
small code size at least for large class of microntroller
programs (but existing tools and libraries do not support this).

Quoted text here. Click to load it

Commercial vendors like to say how good their progam are.  But
market reality is that program my be quite bad and still sell.

--  
                              Waldek Hebisch

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/26/2021 10:22 PM, snipped-for-privacy@math.uni.wroc.pl wrote:
Quoted text here. Click to load it

Do you use the standard libraries?  Aren't THEY components?
You rely on the compiler to decide how to divide X by Y -- instead
of writing your own division routine.  How often do you reimplement
?printf() to avoid all of the bloat that typically accompanies it?
(when was the last time you needed ALL of those format specifiers
in an application?  And modifiers?

Quoted text here. Click to load it

What you need is components with varying characteristics.
You can buy diodes with all sorts of current carrying capacities,
PIVs, package styles, etc.  But, they all still perform the
same function.  Why so many different part numbers?  Why not
just use the biggest, baddest diode in ALL your circuits?

I.e., we readily accept differences in "standard components"
in other disciplines; why not when it comes to software
modules?

Quoted text here. Click to load it

Sure, but now you're tuning a solution to a specific problem.
I've designed custom chips to solve particular problems.
But, they ONLY solve those particular problems!  OTOH,
I use lots of OTC components in my designs because those have
been designed (for the most part) with an eye towards
meeting a variety of market needs.

Quoted text here. Click to load it

Because they try to address a variety of solution spaces without
trying to be "optimal" for any.  You trade flexibility/capability
for speed/performance/etc.

Quoted text here. Click to load it

Templates are an attempt in this direction.  Allowing a class of
problems to be solved once and then tailored to the specific
application.

But, personal experience is where you win the most.  You write
your second or third UART driver and start realizing that you
could leverage a previous design if you'd just thought it out
more fully -- instead of tailoring it to the specific needs
of the original application.

And, as you EXPECT to be reusing it in other applications (as
evidenced by the fact that it's your third time writing the same
piece of code!), you anticipate what those *might* need and
think about how to implement those features "economically".

It's rare that an application is *so* constrained that it can't
afford a couple of extra lines of code, here and there.  If
you've considered efficiency in the design of your algorithms,
then these little bits of inefficiency will be below the noise floor.

Quoted text here. Click to load it

The same is true of FOSS -- despite the claim that many eyes (may)
have looked at it (suggesting that bugs would have been caught!)

 From "KLEE: Unassisted and Automatic Generation of High-Coverage
Tests for Complex Systems Programs":

     KLEE finds important errors in heavily-tested code. It
     found ten fatal errors in COREUTILS (including three
     that had escaped detection for 15 years), which account
     for more crashing bugs than were reported in 2006, 2007
     and 2008 combined. It further found 24 bugs in BUSYBOX, 21
     bugs in MINIX, and a security vulnerability in HISTAR? a
     total of 56 serious bugs.

Ooops!  I wonder how many FOSS *eyes* missed those errors?

Every time you reinvent a solution, you lose much of the benefit
of the previous TESTED solution.

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Quoted text here. Click to load it

Yes, I uses libraries when appropriate.

Quoted text here. Click to load it

Well, some folks expect more from components than from
traditional libraries.  Some evan claim to deliver.
However, libraries have limitations and ATM I see nothing
that fundamentally change situation.

Quoted text here. Click to load it

Well, normally in C code I relay on compiler provied division.
To say the truth, my MCU code uses division sparingly, only
when I can not avoid it.  OTOH I also use languages with
multiprecision integers.  In one case I use complier provided
routines, but I am provider of modifed compiler and modification
includes replacement of division routine.  In other case I
override compiler supplied division routine by my own (which
in turn sends real work to external library).

Quoted text here. Click to load it

I did that once (for OS kernel where standard library would not
work).  If needed I can reuse it.  On PC-s I am not worried by
bloat due to printf.  OTOH, on MCU-s I am not sure if I ever used
printf.  Rather, printing was done by specialized routines
either library provided or my own.

Quoted text here. Click to load it

I heard such electronic analogies many times.  But they miss
important point: there is no way for me to make my own diode,
I am stuck with what is available on the market.  And diode
is logically pretty simple component, yet we need many kinds.

Quoted text here. Click to load it

Well, software is _much_ more compilcated than physical
engineering artifacts.  Physical thing may have 10000 joints,
but if joints are identical, then this is moral equivalent of
simple loop that just iterates fixed number of times.
At software level number of possible pre-composed blocks
is so large that it is infeasible to deliver all of them.
Classic trick it to parametrize.  However even if you
parametrize there are hundreds of design decisions going
into relatively small piece of code.  If you expose all
design decisions then user as well may write his/her own
code because complexity will be similar.  So normaly
parametrization is limited and there will be users who
find hardcoded desion choices inadequate.

Another things is that current tools are rather weak
at supporting parametrization.

Quoted text here. Click to load it

Maybe I made wrong impression, I think some explanation is in
place here.  I am trying to make my code reusable.  For my
problems performance is important part of reusablity: our
capability to solve problem is limited by performance and with
better perfomance users can solve bigger problems.  I am
re-using code that I can and I would re-use more if I could
but there there are technical obstacles.  Also, while I am
trying to make my code reusable, there are intrusive
design decision which may interfere with your possiobility
and willingness to re-use.

In slightly different spirit: in another thread you wrote
about accessing disc without OS file cache.  Here I
normaly depend on OS and OS file caching is big thing.
It is not perfect, but OS (OK, at least Linux) is doing
this resonably well I have no temptation to avoid it.
And I appreciate that with OS cache performance is
usually much better that would be "without cache".
OTOH, I routinly avoid stdio for I/O critical things
(so no printf in I/O critical code).

Quoted text here. Click to load it

I think that this is more subtle: libraries frequently force some
way of doing things.  Which may be good if you try to quickly roll
solution and are within capabilities of library.  But if you
need/want different design, then library may be too inflexible
to deliver it.

Quoted text here. Click to load it

Yes, templates could help.  But they also have problems.  One
of them is that (among others) I would like to target STM8
and I have no C++ compiler for STM8.  My idea is to create
custom "optimizer/generator" for (annotated) C code.
ATM it is vapourware, but I think it is feasible with
reasonable effort.

Quoted text here. Click to load it

Well, I am not talking about "couple of extra lines".  Rather
about IMO substantial fixed overhead.  As I wrote, one of my
targets is STM8 with 8k flash, another is MSP430 with 16k flash,
another is STM32 with 16k flash (there are also bigger targets).
One of libraries/frameworks for STM32 after activating few featurs
pulled in about 16k code, this is substantial overhead given
how little features I needed.  Other folks reported that for
trivial programs vendor supplied frameworks pulled close to 30k
code.  That may be fine if you have bigger device and need features,
but for smaller MCU-s it may be difference between not fitting into
device or (without library) having plenty of free space.

When I tried it Free RTOS for STM32 needed about 8k flash.  Which
is fine if you need RTOS.  But ATM my designs run without RTOS.

I have found libopencm3 to have small overhead.  But is routines
are doing so little that direct register access may give simpler
code.

Quoted text here. Click to load it

Open source folks tend to be more willing to talk about bugs.
And the above nicely shows that there is a lot of bugs, most
waiting to by discovered.

Quoted text here. Click to load it

TESTED part works for simple repeatable tasks.  But if you have
complex task it is quite likely that you will be the first
person with given use case.  gcc is borderline case: if you
throw really new code at it you can expect to see bugs.
gcc user community it large and there is resonable chance that
sombody wrote earlier code which is sufficiently similar to
yours to catch troubles.  But there are domains that are at
least as complicated as compilation and have much smaller
user community.  You may find out that there are _no_ code
that could be reasonably re-used.  Were you ever in situation
when you looked how some "standard library" solves a tricky
problem and realized that in fact library does not solve
the problem?

--  
                              Waldek Hebisch

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 10/31/2021 3:54 PM, snipped-for-privacy@math.uni.wroc.pl wrote:
Quoted text here. Click to load it

A component is something that you can use as a black box,
without having to reinvent it.  It is the epitome of reuse.

Quoted text here. Click to load it

You can also create a ?printf() that you can configure at build time to
support the modifiers and specifiers that you know you will need.

Just like you can configure a UART driver to support a FIFO size defined
at configuration, hardware handshaking, software flowcontrol, the
high and low water marks for each of those (as they can be different),
the character to send to request the remote to stop transmitting,
the character you send to request resumption of transmission, which
character YOU will recognize as requesting your Tx channel to pause,
the character (or condition) you will recognize to resume your Tx,
whether or not you will sample the condition codes in the UART, how
you read/write the data register, how you read/write the status register,
etc.

While these sound like lots of options, they are all relatively
trivial additions to the code.

Quoted text here. Click to load it

Sure there is!  It is just not an efficient way of spending your
resources when you have so many OTS offerings available.

You can design your own processor.  Why do you "settle" for an
OTS device (ANS:  because there is so little extra added value
you will typically gain from rolling your own vs. the "inefficiency"
of using a COTS offering)

Quoted text here. Click to load it

This is the argument in favor of components.  You'd much rather
read a comprehensive specification ("datasheet") for a software
component than have to read through all of the code that implements
it.  What if it was implemented in some programming language in
which you aren't expert?  What if it was a binary "BLOB" and
couldn't be inspected?

Quoted text here. Click to load it

You don't have to deliver all of them.  When you wire a circuit,
you still have to *solder* connections, don't you?  The
components don't magically glue themselves together...

Quoted text here. Click to load it

Look at a fleshy UART driver and think about how you would decompose
it into N different variants that could be "compile time configurable".
You'll be surprised as to how easy it is.  Even if the actual UART
hardware differs from instance to instance.

Quoted text here. Click to load it

If you don't know where the design is headed, then you can't
pick the components that it will need.

I approach a design from the top (down) and bottom (up).  This
lets me gauge the types of information that I *may* have
available from the hardware -- so I can sort out how to
approach those limitations from above.  E.g., if I can't
control the data rate of a comm channel, then I either have
to ensure I can catch every (complete) message *or* design a
protocol that lets me detect when I've missed something.

There are costs to both approaches.  If I dedicate resource to
ensuring I don't miss anything, then some other aspect of the
design will bear that cost.  If I rely on detecting missed
messages, then I have to put a figure on their relative
likelihood so my device doesn't fail to provide its desired
functionality (because it is always missing one or two characters
out of EVERY message -- and, thus, sees NO messages).

Quoted text here. Click to load it

My point about the cache was that it is of no value in my case;
I'm not going to revisit a file once I've seen it the first
time (so why hold onto that data?)

Quoted text here. Click to load it

Use a different diode.

Quoted text here. Click to load it

A "framework" is considerably more than a set of individually
selectable components.  I've designed products with 2KB of code and
128 bytes of RAM.  The "components" were ASM modules instead of
HLL modules.  Each told me how big it was, how much RAM it required,
how deep the stack penetration when invoked, how many T-states
(worst case) to execute, etc.

So, before I designed the hardware, I knew what I would need
by way of ROM/RAM (before the days of FLASH) and could commit
the hardware to foil without fear of running out of "space" or
"time".

Quoted text here. Click to load it

Sure.  But a component will have a datasheet that tells you what
it provides and at what *cost*.

Quoted text here. Click to load it

RTOS is a commonly misused term.  Many are more properly called
MTOSs (they provide no real timeliness guarantees, just multitasking
primitives).

IMO, the advantages of writing in a multitasking environment so
far outweigh the "costs" of an MTOS that it behooves one to consider
how to shoehorn that functionality into EVERY design.

When writing in a HLL, there are complications that impose
constraints on how the MTOS provides its services.  But, for small
projects written in ASM, you can gain the benefits of an MTOS
for very few bytes of code (and effectively zero RAM).

Quoted text here. Click to load it

Part of the problem is ownership of the codebase.  You are
more likely to know where your own bugs lie -- and, more
willing to fix them ("pride of ownership").  When a piece
of code is shared, over time, there seems to be less incentive
for folks to tackle big -- often dubious -- issues as the
"reward" is minimal (i.e., you may not own the code when the bug
eventually becomes a problem)

Quoted text here. Click to load it

As I said, your *personal* experience tells you where YOU will
likely benefit.  I did a stint with a company that manufactured
telecommunications kit.  We had all sorts of bizarre interface
protocols with which we had to contend (e.g., using RLSD as
a hardware "pacing" signal).  So, it was worthwhile to spend
time developing a robust UART driver (and handler, above it)
as you *knew* the next project would likely have need of it,
in some form or other.

If you're working free-lance and client A needs a BITBLTer
for his design, you have to decide how likely client B
(that you haven't yet met) will be to need the same sort
of module/component.

For example, I've never (until recently) needed to interface
to a disk controller in a product.  So, I don't have a
ready-made "component" in my bag-of-tricks.  When I look
at a new project, I "take inventory" of what I am likely to
need... and compare that to what I know I have "in stock".
If there's a lot of overlap, then my confidence in my bid
goes up.  If there'a a lot of new ground that I'll have to
cover, then it goes down (and the price goes up!).

Reuse helps you better estimate new projects, especially as
projects grow in complexity.

[There's nothing worse than having to upgrade someone else's
design that didn't plan for the future.  It's as if you
have to redesign the entire product from scratch --- despite
the fact that it *seems* to work, "as is" (but, not "as desired"!]

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
Quoted text here. Click to load it
<snip>
Quoted text here. Click to load it

Well, if there is simple to use component that performs what
you need, then using it is fine.  However, for many tasks
once component is flexible enough to cover both your and
my needs its specification may be longer and more tricky
than code doing task at hand.

Quoted text here. Click to load it

There are many reasons when existing code can not be reused.
Concerning BLOB-s, I am trying to avoid them and in first
order approximation I am not using them.  One (serious IMO)
problem with BLOB-s is that sooner or later they will be
incompatible with other things (OS/other libraries/my code).
Very old source code usually can be run on modern systems
with modest effort.  BLOB-s normally would require much
more effort.

Quoted text here. Click to load it

Yes, one needs to make connections.  In fact, in programming
most work is "making connections".  So you want something
which is simple to connect.  In other words, you can all
parts of your design to play nicely together.  With code
deliverd by other folks that is not always the case.

Quoted text here. Click to load it

UART-s are simple.  And yet some things are tricky: in C to have
"compile time configurable" buffer size you need to use macros.
Works, but in a sense UART implementation "leaks" to user code.

Quoted text here. Click to load it

Well, there are routine tasks, for them it is natural to
re-use existing code.  There are new tasks that are "almost"
routine, than one can come with good design at the start.
But in a sense "interesting" tasks are when at start you
have only limited understanding.  In such case it is hard
to know "where the design is headed", except that it is
likely to change.  Of course, customer may be dissatisfied
if you tell "I will look at the problem and maybe I will
find solution".  But lack of understanding is normal
in research (at starting point), and I think that software
houses also do risky projects hoping that big win on succesful
ones will cover losses on failures.

Quoted text here. Click to load it

Well, with UART there will be some fixed transmission rate
(with wrong clock frequency UART would be unable to receive
anything).  I would expect MCU to be able to receive all
incoming characters (OK, assuming hardware UART with drivier
using high priority interrupt).  So, detecting that you got too
much should not be too hard.  OTOH, sensibly handling
excess input is different issue: if characters are coming
faster than you can process them, then either your CPU is
underpowered or there is some failure causing excess transmission.
In either case specific application will dictate what
should be avoided.

Quoted text here. Click to load it

My thinking goes toward using relatively short messages and
buffer big enough for two messages.  If there is need for
high speed I would go for continous messages and DMA
transfers (using break interrupt to discover end of message
in case of variable length messages).  So device should
be able to get all messages and in case of excess message
trafic whole message could be dropped (possibly looking
first for some high priority messages).  Of course, there
may be some externaly mandated message format and/or
communitation protocal making DMA inappropriate.
Still, assuming interrupts, all characters should reach
interrupt handler, causing possibly some extra CPU
load.  The only possiblity of unnoticed loss of characters
would be blocking interrupts too long.  If interrupts can
be blocked for too long, then I would expect loss of whole
messages.  In such case protocol should have something like
"dont talk to me for next 100 miliseconds, I will be busy"
to warn other nodes and request silence.  Now, if you
need to faithfully support sillyness like Modbus RTU timeouts,
then I hope that you are adequatly paid...

Quoted text here. Click to load it

Well, OS "cache" has many functions.  One of them is read-ahead,
another is scheduling of requests to minimize seek time.
And beside data there is also meta-data.  OS functions need
access to meta-data and OS-es are designed under assumption
that there is decent cache hit rate on meta-data access.

Quoted text here. Click to load it

Well, when needed I use my own library.

Quoted text here. Click to load it

Nice, but I am not sure how practical this would be in modern
times.  I have C code and can resonably estimate resource use.
But there are changable parameters which may enable/disable
some parts.  And size/speed/stack use depends on compiler
optimizations.  So there is variation.  And there are traps.
Linker transitively pulls dependencies, it there are "false"
dependencies, they can pull much more than strictly needed.
One example of "false" dependence are (or maybe were) C++
VMT-s.  Namely, any use of object/class pulled VMT which in
turn pulled all ancestors and methods.  If unused methods
referenced other classes that could easily cascade.  In both
cases authors of libraries probably thought that provided
"goodies" justified size (intended targets were larger).

Quoted text here. Click to load it

My 16x2 text LCD routine may pull I2C driver.  If I2C is not needed
anyway, this is additional cost, otherwise cost is shared.
LCD routine depends also on timer.  Both timer and I2C affect
MCU initialization.  So even in very simple situations total
cost is rather complex.  And libraries that I tried presumably
were not "components" in your sense, you had to link the program
to learn total size.  Documentation mentioned dependencies,
when they affected correctness but otherwise not.  To say
the truth, when library supports hundreds or thousends of different
targets (combinations of CPU core, RAM/ROM sizes, peripherial
configurations) with different compilers, then there is hard
to make exact statements.

IMO, in ideal world for "standard" MCU functionality we would
have configuration tool where user can specify needed
functionality and tool would generate semi-custom code
and estimate its resource use.  MCU vendor tools attempt to
offer something like this, but reports I heard were rather
unfavourable, in particular it seems that vendors simply
deliver thick library that supports "everything", and
linking to this library causes code bloat.

Quoted text here. Click to load it

Well, Free RTOS comes with "no warranty", but AFAICS they make
honest effort to have good real time behaviour.  In particular,
code paths trough Free RTOS from events to user code are of
bounded and rather short length.  User code still may be
delayed by interrupts/process priorities, but they give resonable
explanation.  So it is up to user to code things in way that gives
needed real-time behaviour, but Free RTOS normally will not spoil it
and may help.

Quoted text here. Click to load it

Well, looking at books and articles I did not find convincing
argument/example showing that one really need multitasking for
small systems.  I tend to think rather in terms of collection
of coupled finite state machines (or if you prefer Petri net).
State machines transition in response to events and may generate
events.  Each finite state machine could be a task.  But it is
not clear if it should.  Some transitions are simple and should
be fast and that I would do in interrupt handlers.  Some
other are triggered in regular way from other machines and
are naturally handled by function calls.  Some need queues.
The whole thing fits resonably well in "super loop" paradigm.

I have found one issue that at first glance "requires"
multitasking.  Namely, when one wants to put system in
sleep mode when there is no work natural "super loop"
approach looks like

        if (work_to_do) {
           do_work();
        } else {
           wait_for_interrupt();
        }

where 'work_to_do' is flag which may be set by interrupt handlers.
But there is nasty race condition, if interrupt comes between
test for 'work_to_do' and 'wait_for_interrupt': despite
having work to do system will go to sleep and only wake on
next interrupt (which depending on specific requirements may
be harmless or disaster).  I was unable to find simple code
that avoids this race.  With multitasking kernel race vanishes:
there is idle task which is only doing 'wait_for_interrupt'
and OS scheduler passes control to worker tasks when there is
work to do.  But when one looks how multitasker avoids race,
then it is clear that crucial point is doing control transfer
via return from interrupt.  More precisely, variables are
tested with interrupts disabled and after decision is made
return from interrupt transfers control.  Important point is
that if interrupt comes after control transfer interrupt handler
will re-do test before returning to user code.  So what is needed
is piece of low-level code that uses return from interrupt for
control transfer and all interrupt handlers need to jump to
this code when finished.  The rest (usually majority) of
multitasker is not needed...

Quoted text here. Click to load it

Ownership may cause problems: there is tendency to "solve"
problems locally, that is in code that given person "owns".
This is good if there is easy local solution.  However, this
may also lead to ugly workarounds that really do not work
well, while problem is easily solvable in different part
("owned" by different programmer).  I have seen such thing
several times, looking at whole codebase after some effort
it was possible to do simple fix, while there were workarounds
in different ("wrong") places.   I had no contact with
original authors, but it seems that workarounds were due to
"ownership".

--  
                              Waldek Hebisch

Re: How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on
On 11/10/2021 9:34 PM, snipped-for-privacy@math.uni.wroc.pl wrote:

Quoted text here. Click to load it

You can configure using manifest constants, conditional compilation,
or even run-time switches.  Or, by linking against different
"support" routines.  How and where the configuration "leaks"
into user code is a function of the configuration mechanisms that
you decide to employ.

E.g., You'd likely NOT design your network stack to be tightly integrated
with your choice of NIC (all else being equal) -- simply because you'd
want to be able to reuse the stack with some *other* NIC without having
to rewrite it.

OTOH, it's not unexpected to want to isolate the caching of ARP results
in an "application specific" manner as you'll likely know the sorts (and
number!) of clients/services with which the device in question will be
connecting.  So, that (sub)module can be replaced with something most
appropriate to the application yet with a "standardized" interface to
the stack itself (*YOU* define that standard)

All of these require decisions up-front; you can't expect to be able to
retrofit an existing piece of code (cheaply) to support a more
modular/configurable implementation in the future.

But, personal experience teaches you what you are likely to need
by way of flexibility/configurability.  Most folks tend to eork
in a very narrow set of application domains.  Chances are, the
network stack you design for an embedded product will be considerably
different than one for a desktop OS.  If you plan to straddle
both domains, then the configurability challenge is greater!

Quoted text here. Click to load it

You can also design with the intent of parsing messages before they are
complete and "reducing" them along the way.  This is particularly
important if messages can have varying length *or* there is a possibility
for the ass end of a message to get dropped (how do you know when the
message is complete?  Imagine THE USER misconfiguring your device
to expect CRLFs and the traffic only contains newlines; the terminating
CRLF never arrives!)

[At the limit case, a message reduces to a concept -- that is represented
in some application specific manner:  "Start the motor", "Clear the screen",
etc.]

Barcodes are messages (character sequences) of a sort.  I typically
process a barcode at several *concurrent* levels:
- an ISR that captures the times of transitions (black->white->black)
- a task that reduces the data captured by the ISR into "bar widths"
- a task that aggregates bar widths to form characters
- a task that parses character sequences to determine valid messages
- an application layer interpretation (or discard) of that message
This allows each layer to decide when the data on which it relies
does not represent a valid barcode and discard some (or all) of it...
without waiting for a complete message to be present.  So, the
resources that were consumed by that (partial?) message are
freed earlier.

As such, there is never a "start time" nor "end time" for a barcode
message -- because you don't want the user to have to "do something"
to tell you that he is now going to scan a barcode (otherwise, the
efficiency of using barcodes is subverted).

[Think about the sorts of applications that use barcodes; how many
require the user to tell the device "here comes a barcode, please start
your decoder algorithm NOW!"]

As users can abuse the barcode reader (there is nothing preventing them
from continuously scanning barcodes, in violation of any "protocol"
that the product may *intend*), you have to tolerate the case where
the data arrives faster than it can be consumed.  *Knowing* where
(in the event stream) you may have "lost" some data (transitions,
widths, characters or messages) lets you resync to a less pathological
event stream later (when the user starts "behaving properly")

Quoted text here. Click to load it


The advantages of multitasking lie in problem decomposition.
Smaller problems are easier to "get right", in isolation.
The *challenge* of multitasking is coordinating the interactions
between these semi-concurrent actors.  Experience teaches you how
to partition a "job".

I want to blink a light at 1 Hz and check for a button to be
pressed which will start some action that may be lengthy.  I
can move the light blink into an ISR (which GENERALLY is a ridiculous
use of that "resource") to ensure the 1Hz timeliness is maintained
regardless of what the "lengthy" task may be doing, at the time.

Or, I can break the lengthy task into smaller chunks that
are executed sequentially with "peeks" at the "light timer"
between each of those segments.

    sequence1 := sequence2 := sequence3 := sequence4 := 0;
    while (FOREVER) {
task1:
       case sequence1++ {
       0 => do_task1_step0;
       1 => do_task1_step1;
       2 => do_task1_step2;
       ...
       }

       do_light;

task2:
       case sequence2++ {
       0 => do_task2_step0;
       1 => do_task2_step1;
       2 => do_task2_step2;
       ...
       }

       do_light;

task3:
       switch sequence3++ {
       0 => do_task3_step0;
       1 => do_task3_step1;
       2 => do_task3_step2;
       ...
       }

       do_light;

       ...
     }

When you need to do seven (or fifty) other "lengthy actions"
concurrently (each of which may introduce other "blinking
lights" or timeliness constraints), its easier (less brittle)
to put a structure in place that lets those competing actions
share the processor without requiring the developer to
micromanage at this level.

[50 tasks isn't an unusual load in a small system; video arcade
games from the early 80's -- 8 bit processors, kilobytes of
ROM+RAM -- would typically treat each object on the screen
(including bullets!) as a separate process]

The above example has low overhead for the apparent concurrency.
But, pushes all of the work onto the developer's lap.  He has
to carefully size each "step" of each "task" to ensure the
overall system is responsive.

A nicer approach is to just let an MTOS handle the switching
between tasks.  But, this comes at a cost of additional run-time
overhead (e.g., arbitrary context switches).

Quoted text here. Click to load it

I use FSMs for UIs and message parsing.  They let the structure
of the code "rise to the top" where it is more visible (to another
developer) instead of burying it in subroutines and function calls.

"Event sources" create events which are consumed by FSMs, as
needed.  So, a "power monitor" could generate POWER_FAIL, LOW_BATTERY,
POWER_RESTORED, etc. events while a "keypad decoder" could put out
ENTER, CLEAR, ALPHA_M, NUMERIC_5, etc. events.

Because there is nothing *special* about an "event", *ANY* piece of
code can generate them.  Their significance assigns based on where
they are "placed" (in memory) and who/what can "see" them.  So,
you can use an FSM to parse a message (using "received characters"
as an ordered stream of events) and "signal" MESSAGE_COMPLETE to
another FSM that is awaiting "messages" (along with a pointer to the
completed message)

Quoted text here. Click to load it

You are *always* at the mercy of the code's owner.  Just as folks
are at YOUR mercy for the code that you (currently) exert ownership
over.  The best compliments you'll receive are from folks who
inherit your codebase and can appreciate its structure and
consistency.  Conversely, your worst nightmares will be inheriting
a codebase that was "hacked together", willy-nilly, by some number
of predecessors with no real concern over their "product" (code).

E.g., For FOSS projects, ownership isn't just a matter of who takes
"responsibility" for coordinating/merging diffs into the
codebase but, also, who has a compatible "vision" for the
codebase, going forward.  You'd not want a radically different
vision from one owner to the next as this leads to gyrations in
the codebase that will be seen as instability by its users
(i.e., other developers).

I use PostgreSQL in my current design.  I have no desire to
*develop* the RDBMS software -- let folks who understand that
sort of thing work their own magic on the codebase.  I can add
value *elsewhere* in my designs.

But, I eventually have to take ownership of *a* version of the
software as I can't expect the "real owners" to maintain some
version that *I* find useful, possibly years from now.  Once
I assume ownership of that chosen release, it will be my
priorities and skillset that drive how it evolves.  I can
choose to cherry pick "fixes" from the main branch and back-port
them into the version I've adopted.  Or, decide to live with
some particular set of problems/bugs/shortcomings.

If I am prudent, I will attempt to adopt the "style" of the
original developers in fitting any changes that I make to
that codebase.  I'd want my changes to "blend in" and seem
consistent with that which preceded them.

Folks following the main distribution would likely NOT be interested
in the changes that I choose to embrace as they'll likely have
different goals than I.  But that doesn't say my ownership is
"toxic", just that it doesn't suit the needs of (most) others.

---

I've got to bow out of this conversation.  I made a commitment to
release 6 designs to manufacturing before year end.  As it stands,
now, it looks like I'll only have time enough for four of them as
I got "distracted", spending the past few weeks gallavanting (but
it was wicked fun!).

OTOH, It won't be fun starting the new year two weeks "behind"...  :<

[Damn holidays eat into my work time.  And, no excuse on my part;
it's not like I didn't KNOW they were coming!!  :< ]

Site Timeline