Draft Background Paper on Asserts in Embedded Systems (very long)

Below is a draft background paper I'm writing on the use of asserts in embedded systems.

Please discuss, comment, chew, masticate, mangle, improve, send pointers to references and additional material.

Thanks.

John Carter

"We have more to fear from The Bungling of the Incompetent Than from the Machinations of the Wicked." (source unknown)

Asserts for Embedded Systems by John Carter ============================

This is a background paper to introduce, and provide a common understanding of and terminology for asserts within Embedded Systems.

1.What is an assert? ====================

Definition:

An "assert" statement asserts that it's condition must be true. Undefined "Bad Things" happen if it is false.

There is a deliberate degree of fuzziness in this definition. The only thing clear is that the expression should be true.

What is not specified is...
Whether the expression is evaluated.
What exactly will happen if the expression is false.
Whether the control flow exits out of the the assert.
Whether they are even compiled into the code at all.

Rule 0 You cannot rely on the condition being evaluated. You cannot rely on it to alter the control flow.

Corollary of Rule 0 . No Side Effects! The assert expression should never have side-effects. eg. assert( n++ < 10); // NO! NO! NO!

1.1.Asserts are....

-------------------

Assertions are Sanity checks,
Assertions are debug tools,
Assertions are endo-tests. (Tests from inside).
Assertions are a tool to aid the writing of correct software.
Assertions validate the software itself, not the data fed to the software.
Assertions document the prerequisites and effects of software in a manner that is - verifiable - verified - and up to date with the software itself.

2.What so special about Embedded Systems?

-----------------------------------------

Embedded systems have stringent requirements on the following attributes. * Reliability, which implies Simplicity. * Efficiency. * Rom/Ram footprint. * Service availability. * Hard and Soft Real Time response deadlines.

In particular our systems are used * in remote and unattended locations. ie. No manual resets! * Far from the factory and service facilities. * by people with far higher priorities than our systems, but still requiring push services from our systems. ie. No time or inclination to clear modal error dialogs. * In many thousands of distinct geographically distributed devices for which it is way too costly to recall and reprogram. * In particular, they are used far from where they are configured.

Here are some silly examples which should never happen...

No matter how dramatic the software error, it is unacceptable for an unattended device on a remote mountain top that has been hardworking and error free for months to lock up with an Assert failure and then to require a manual reset.

As another Silly Example, a fireman busy attending a fire does not have time to clear an error message, but still needs to receive any incoming emergency calls.

As a last Silly Example. After working fine for a day, the user is now many miles from service facility. He finds he needs a feature which triggers an assert. "Your device is improperly configured!" and then drops into reprogramming mode.

3.Types of Exceptional Conditions - Or what's the difference between and Assert and a System Error?.

----------------------------------------------------------------------

1.Invalid User Inputs. All user inputs must be validated, errors reported to the user in a readable fashion and an opportunity given to correct and retry.

2.Data corruption of packets coming in via serial ports or over the air. The transport layer should CRC the packets and invoke retry mechanism if need be. After a limited number of retries a failure should be reported to the user and the radio returned to idle/standby state.

3.Hard Hardware fault. (Not user correctable) Error reported to user and hard reset to attempt to reinitialize all hardware systems to a known state.

4.Soft Hardware Fault (User Correctable) Error report to user and return to standby state.

5.Subsystem / Peer error. An autonomous subsystem or Peer system, running on a separate CPU, event tick, process or thread has failed to provide proper responses in the appropriate time. Watchdog reset fits in here. Log and Hard reset.

6.Configuration Error, ONLY AND FULLY checked at startup, if failed place radio in programming mode.

7.Protocol Error. If, for example, a command cannot be completed, the error must be reported via the appropriate protocol response and then await the next command. Since our devices can usually be driven remotely via one of several protocols, the user interface needs to be sufficiently abstracted so that all instances of "Report Error to User and await input" can be translated to "Report Error via Protocol and await next command"

8.Connection Failure - The heart beat to a remote system has timed out, or it has unexpectedly been terminated by the far side. Resources deployed in the connection must be recovered and depending on the requirements one of...

1.Display error message and return to standby. 2.Exponentially backoff (wait) and reconnect a limited number of times. 3.Report failure to master system.

9.Programming Bug. The guiding principle here is "Once a system goes Mad, there is no limit to how Mad it can get." Trying to build software error recovery mechanisms results in greater complexity and more software defects. Simply log error and warm restart.

Rule 2: Asserts are ONLY for Programming Errors!

If you are attempting to identify ANY of the other exceptional conditions with an assert, you are not doing asserts but something strange and hazardous that will cease to work when the assert behaviour changes.!

4.Reliable Collaborations

-------------------------

Portions of this from "Object Design: Roles, Responsibilities and Collaborations" by R. Wirfs-Brock & A. McKean

4.1.Trust Regions

Carve your software into regions where Trusted Communications can occur. Objects within the same Trust Region can communicate collegially.

Communications across a trust boundary need to be validated AND A MECHANISM PROVIDED FOR FEEDBACK, CORRECTION AND RETRY. A robust system should be able to cope with, correct and recover from incorrect inputs.

Communications within a Trust Region are assume to be perfect and correct. Incorrect inputs within a trust region are the result of defective programming and are only correctable by new firmware and only recoverable by a reset.

ie. At the boundary of the trust region you check for all the exceptional conditions listed above. Within the Trust Region, you check for none of them, but you assert the sanity of the software.

5.Design by Contract (DbC)

-------------------------

Portions of this from "Object-Oriented Software Construction: 2nd Ed." by B. Meyer. Within a Trust Region the correctness of the software itself can be verified using DbC.

5.1.Preconditions and Postconditions

------------------------------------

Preconditions are assertions placed at the start of a function. They are a boolean expression that asserts what the state of the system must be prior to executing this function.

Postconditions are assertions placed after the function (just before the return) and are a boolean expression that asserts what the state of the system will be after the function has executed, PROVIDED THAT THE PRECONDITION WAS TRUE.

5.1.1REQUIRE and ENSURE

-----------------------

The following is from the EventHelix framework site. [4]

Create a REQUIRE macro to assert preconditions and a matching ENSURE macro to assert postconditions. As recommended by Meyer [2], if you must switch any asserts off, start with postconditions.

5.1.2 Strong and Weak Conditions.

A Stronger condition is on that is one that is more difficult to satisfy. The strongest being assert(false); since it is never satisfied.

The weakest condition is assert( true); since it is always satisfied no matter what the software did!

Software that is easy to use has weak preconditions and strong postconditions.

Software that is easy to write has strong preconditions and weak postconditions.

5.2.Who to blame...

-------------------

1.A run-time assertion violation is the manifestation of a bug in the software.
2.A precondition violation is the manifestation of a bug in the client.
3.A postcondition violation is the manifestation of a bug in the supplier.
4.An invariant failure means the object itself is Barking Mad.
6.Interaction between Assert's and Architecture.

Software Defects and hence Assert's certainly do not exist independently of notions of Good Architecture. Let's look at some classic ones and see what they imply about our use of asserts. See Reference [3] from Object Mentor.

The Acyclic Dependencies Principle

The dependency structure between packages must be a Directed Acyclic Graph (DAG). That is, there must be no cycles in the dependency structure.

The Stable Dependencies Principle (SDP)

The dependencies between packages in a design should be in the direction of the stability of the packages . A package should only depend upon packages that are more stable that it is .

A stable system obeys the Stable Dependencies Principle. Thus the client code is usually more unstable, fragile, newer, buggier. Hence...

The Heirarchy of Trust Rule

We should have less trust in (and hence more asserts on) data coming from code in higher layers than on data from code in the same or lower layers.

6.1.Defendable Programming.

"Defensive Programming" is the habit of checking everything before doing anything. Symptoms of this disease is unreadable code that contains more error checks and error reports than working code!

A far better idea is to do "Defendable Programming" by Data Encapsulation and Information Hiding.

Basically there are two things all that error checking code is trying to do...

1.Stop the client from making a mess of things,

2.and check that this module itself isn't breaking things.

To stop the client from making a mess is simple. Hide all your internal data structures so your clients cannot screw them up because they simply can't even see them.

You would never give your wallet to a shop keeper, so why do you expect your clients to manipulate your data structures directly?

By creating a narrow (and hence defendable) interface to your code, you limit what needs to be checked.

To stop your own code from making a mess of things, you need to work out what you are trying to protect, what are you trying to ensure, what constraints are you trying to enforce.

6.2.Class Invariants.

--------------------

A class invariant is a boolean expression that is always true at the end of the initialization routine, and at the start and end of every public method.

Bjarne Strostrup, the designer of the C++ language says :- "You should create a class if and only if you have an invariant to protect."

In C, the notion of a class more or less conforms to the notion of a compilation module.

A good way of finding invariants is to look at the state held by this module seeking...

constraints like bounds on indices, and address ranges for pointers on variables internal to your module. * things that must vary together, eg. if this changes so must that. * things that must always be true if this module is to work correctly.

You can sweep all these assertions into a private "check_invariant" method which you can call at the start and end of every public method.

If the internal state of a module is not co-related in some way it is a strong hint that your module lacks cohesion and should be split into two or more smaller modules. It it complete lacks cohesion, then perhaps it is a module, but a Plain Old Data struct.

6.3.Asserts and Layered Software.

--------------------------------

Assume the following call graph....

void A( int i) { // Some stuff B(i); }

void C( int i) { // Do other stuff B(i); } void B(int i) { D(i); }

void D(int i) { for( int j = 0; j < lots; ++j) { E(i); } }

void E(int i) { // Do stuff that requires i > 0 } Where do you place the precondition assert(s)?

Let's consider the options...

1.Everywhere.

This means i gets tested against the same constraint 3 times on every call.

Recommendation : Don't do this.

2.At the top level in function A and C.

This means the fault is found early, but the fact "i must be greater than zero" appears in two places, both far from the code that has the actual constraint.

Recommendation: Only do this enroute to performing a hazardous operation, resulting in less that needs to be put back to a safe mode.

3.At the start of the first common function B.

Pro:, fault gets caught early, the constraint is documented (by the assert) at a higher level.

Con: The constraint is documented and enforced at a distance from where the constraint is.

Recommendation: Don't do this.

The assert placed at the highest level where at which things actually go wrong if constraint not met. ie. In function E.

Pro: Constraint and Assert live together and are easier to change together, one fact one place.

Con: Assert now lives in an inner loop.

Recommendation: Do this until profiling / inspection demonstrates you mustn't. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

Options thereafter are:

1.Move the assert one level up to start of function D.

2.Make E an inline function and trust to the compilers optimizer to move constant sub expressions out of the loop.

These recommendations can be summed up as...

The One Fact One Place, Don't Repeat Yourself DRY rule.

Keep the assert as near as is reasonable to the place where the constraint actually pinches.

1.Trust on your systems ability to do backtraces to show you how you got into that invalid state. (See reference 7, eCrash)
2.If your assert is far from the constraint you risk introducing assert bugs as the code evolves and people forget to update the asserts to match the new constraints.
3.Constructors / Initializers are an exception to this rule. Often an invalid parameter will be stashed in an instance variable. If the constraint only pinches in a subsequent method call, the bug will not be on the backtrace.

7.Assert failures.

------------------

So the worst has happened, your program has a defect and an assert has fired in a customer's device in the field, on a remote mountain top, in a firemans hands. What happens now?

7.1.Hazardous Operations

Suppose the assert is in the path towards the device performing a hazardous operation. Unless it get's it exactly Right, Very Bad Things will happen. eg. The device melts, the users lunch gets eaten, ...

The assert has fired, thus we know the higher level software is Barking Mad. What should we do?

Answer 1: Back off carefully, putting things into a safe state as we go and then reset.

Answer 2: We cannot trust any service to be operating sanely. Quite possibly other threads are already gibberingly bonkers and we have only now woken up to the fact things are going wrong. Thus it is safest just to reboot and assume that the boot processes will put everything in a safe state.

Recommendation: Trust to boot process.

7.2.Standard Operations and Software Reliability.

------------------------------------------------

In considering what to do with non-hazardous operations the following facts are pertinent...

The vast majority of software defects are either do not have any impact on the customer, or only have minimal impact.

Assert expressions also have defects.

Thus for most operations taking any customer perceptible action on assert failure will decrease the reliability of the software!

Recommendation: Log the Instruction Counter and carry on.

8.Programming Style for Asserts.

Once you have written a function, review it looking for assumptions you have made. Consider replacing certain comments by assertions. (Think of assertions as "executable comments")

Document these assumptions in the form of..

precondition "require" asserts
Compile time asserts. (Very useful for catching portability "gotchas" on word sizes and the like.)

Review the function looking for ways in which it could go wrong, Asserts within the code document how you believe the algorithm should work.

Unless the reason for the assert is obvious add a comment explaining why this constraint exists or assumption is necessary. Unless you do so, the next programmer (or yourself six months later) on meeting a failed assert may blame the assert instead of themselves!

Robust coding practices may mask errors. Use asserts to make them explicit...eg Many programmers will rather do...

while( i < N) { .... } instead of while( i != N) { ... } since it defends against the possibility that i may step past N creating an infinite loop. A slightly better style is... while( i < N) { .... } assert( N == i); // i should start < N and then single step.

Asserts should remain in the code for the full life of the product. Do not remove them.

8.1.Use the right assert.

---------------------------

By choosing the right assert from the pallet of available asserts, you create more options for creative optimizations without loss of value.

Use a "require" assert for preconditions and an "ensure" assert for postconditions. This enables you to selectively disable postcondition asserts if need be.

If your system has throws an exception on a null pointer access, why bother to assert that the pointer is not null? Trust to the hardware and the exception handler to do The Right Thing.

However,

This is not portable coding.

It loses the documentation value of the assert.

Thus instead of... assert( pointer); // or.... assert( pointer != NULL); use... assert_valid_pointer( pointer);

This permits the system architect to elect to trust to the hardware, or create an assert that checks that the pointer points to a valid address.

8.2.Compile Time Asserts Are Better Than Run Time Asserts.

----------------------------------------------------------

Why wait until the customer gets it? Fail asserts at compile time if you at all can.

Good candidates for compile time asserts are to document non-portable assumptions about word sizes, endianess and non-standard compiler features.

However, better than compile time asserts is to use the C99 standard predefined typedefs to select the appropriate word size.

Other candidates are library version numbers and constraints on sizes of statically allocated resources.

8.3.Real Time Programming needs Real Time Asserts.

--------------------------------------------------

So we have real time dead lines. Do we meet them? We don't know unless a customer whinges. Use a real time assert that asserts how long an operation takes.

9.Implementing Asserts.

The primary "gotcha" in implementing asserts is to forget to handle the following case...

if( expression) assert( other_expression); else do_stuff;

Whatever implementation you choose, that example should compile correctly without warnings whether asserts are compiled in or compiled out.

Furthermore, when compiled out, it should vanish entirely from the generated machine code.

9.1.Implementing Compile time asserts.

-------------------------------------

The simplest version does not permit being switched off... #if EXPRESSION #error Whinge, Whinge, Moan, Despair! #endif

This version below relies on the compiler to warn when a typedef to an array of negative size if declared. The nice thing is it disappears entirely from the compiled code.

#define CompileTimeAssert(ex) do {\ typedef char COMPILE_TIME_ASSERTION_FAILURE[(ex) ? 1 : -1];\ } while(0)

The above version can only be used within the body of a function. At file scope, you need this one...

#define FileScopedCompileTimeAssert(ex) \ extern char COMPILE_TIME_ASSERTION_FAILURE[(ex) ? 1 : -1];

9.2.Trust Your Exception Handlers to do The Right Thing.

--------------------------------------------------------

If you are hunting an extra burst of speed from your assert implementation, no code is faster than no code! If your CPU traps invalid pointer dereferences anyway, implement "assert_valid_pointer" as an empty no operation.

Read your CPU specsheet and create a pallet of assertion definitions that can be switched off and entrusted to the hardware.

9.3.An implementation of the macros.

-----------------------------------

The following code also does The Right Thing if you check your code with splint.

#ifndef S_SPLINT_S #if ERROR_ENABLE_ASSERT #if 1 /* change to 0 to save space */ #define error_Assert(expression) ((expression) ? (void)NULL : error_LocalAssert(#expression, __FILE__, __LINE__)) #else #define error_Assert(expression) ((expression) ? (void)NULL : error_LocalAssert((char *)NULL, __FILE__, __LINE__)) #endif #else #define error_Assert(expression) ((void)NULL) #endif #else /* to get splint side effect checking */ extern /*@noreturnwhenfalse@*/ void error_Assert(/*@sef@*/ bool_t expression); #endif

A valid criticism of this implementation from the embedded system point of view is it takes way too much rom space to store all those expressions __FILE__ and __LINE__ string constants.

A better implementation would use a non-portable one assembler instruction hack to just use the PC counter.

10.Programming Style Downstream of an Assert.

--------------------------------------------

Suppose you have asserted that a certain condition cannot happen.

How do you code after the assert?

After all, sometimes the asserts will be turned on, sometimes they will be turned off. Sometimes they will trigger a reset, sometimes they will just log the event and fall through.

An Assert Failure means there is a software defect fixable only by a new version of the firmware!

So the answer is simple.

The Certainty Principle Prior to the assert you code as if the verity of the expression is in doubt. After the assert you code as if the expression was an absolute unquestionable certainty.

To have code that checks for that condition after the assert is undoubtably a waste of lines of code, CPU cycles, ram and rom and makes the system less reliable not more.

Non-Redundancy Corollary The body of a routine should never test for the routine's precondition.

11.Maintaining Service in the Presence of Errors.

Consider my two examples.

1.The "unattended device on a remote mountain top"
2.The "user with way higher priorities still requiring emergency priority 'push services' no matter what the state of the device is"

How do we design error handling services to minimise the disruption of service?

The issues here are...

By the time an assert fires, the system is already Barking Mad. In a way that you simply didn't and quite likely could not foresee. To attempt to code your way out of all imaginable scenarios is the fast route to spaghetti and stress breakdowns.

Since the system is Barking Mad, we cannot really rely on any service performing to spec.

We need to return the system to a sane state as rapidly as possible, WITHOUT MANUAL INTERVENTION, so that we can receive incoming emergency push services.

Now there are two possibilities here.

1.The assert expression is possibly buggy and even if it isn't, the bug probably won't impact the customer.

Recommendation: The Program Counter is logged in persistent store and control flow continues as if nothing happened.

2.The system undoubtably will die messily if it proceeds. Recommendation:

If by dying we mean, "throw a hardware exception", then implement the assert as a null op and trust to the exception handler to do "The Right Thing".

Otherwise, disabled all interrupt, stash error information in persistent memory and warm reboot. At the end of the boot process the error message is recovered from persistent store and displayed.

12.Conclusion : Asserts in or Asserts out of Production Code.

------------------------------------------------------------

In a field fraught with subtle disagreements, there is one universally agreed on principle with regard asserts.

Test like you Fly, Fly what you tested.

Whatever you decide regarding asserts in production code, send exactly the same thing to the test team as you send to the customer

Removing asserts from production code has been likened to removing your seat belts once you hit the highway.

Recommendations Asserts in Production software.

1.Disable postcondition asserts.
2.Use real time asserts to check you aren't missing deadlines.
3.Profile the asserts and remove or move asserts in hot spots.
4.Where sensible move asserts to constructors / initialization / start up code.
5.If an assert fires, log the PC counter and continue.
6.For hazardous operations have a "fatal assert" macro that ALWAYS logs PC counter and resets and is NEVER disabled.
7.Trust to your hardware exceptions and disable assert types that will be caught by the hardware anyway.

12.1.Glossary

-------------

Cold start - initial power on reset, usually includes some power on self tests like configuration database sanity checks.

Warm Start - A fast reboot that assumes the system was on previously, some flags may have been cached in persistent memory to indicate alternate start up actions should be performed, extended Power on Self Tests are skipped.

Pull Services - The web browser is a classic "pull service". The user has to initiate an action which "pulls" the service from a remote server.

Push Services - An incoming phone call is the classic "push service". Without requiring any user action, by remote initiation, the phone starts to ring.

12.2.References

---------------

[1]"Object Design: Roles, Responsibilities and Collaborations" by R. Wirfs-Brock & A. McKean [2]"Object-Oriented Software Construction: 2nd Ed." by B. Meyer. [3] The Stable Dependencies Principle.

formatting link

[4] Design by Contract Programming in C++

formatting link

[5] "Writing Solid Code " Steve Maguire [6] "A Different Take On Asserts" by Jack Ganssle

formatting link

[7] "eCrash: Debugging without Core Dumps" By David Frascone

formatting link

13.Anecdotes

From Interview by Kerneltrap with Matthew Dillon, a well-known FreeBSD kernel hacker. ..

formatting link

I always document code as I work on it, to make it easier both for me and for anyone else working on the system, and I am not shy about putting assertions in the code for conditions that are supposed to be true. I would much rather hit the assertion and panic early then allow an incorrect assumption to slowly corrupt the system. I started doing this in the 4.X codebase and it greatly contributed to our famed stability in 4.0 and later releases. Introduced instabilities, either due to bugs or purposeful assertions, typically lasted no more then a few days. The result of this has had a long term stabilizing effect on the codebase. Even now if someone breaks something horribly in the system there's a good chance their breakage will be noticed quickly due to assertions I and others have strewn all over the VM system. Assertions are good. Sometimes my 'fixes' are misinterpreted as mistakes. This contributed to some of the friction I had with older developers circa 1998. The most noteable example of this is the VM Page cache. The cache contains several page queues including a 'cache' queue which is only supposed to contain clean pages. The system is allowed to free pages in the cache queue at any time, so a dirty page in this queue could lead to a loss of data. People had noticed that, in fact, dirty pages could wind up in the cache queue. Instead of fixing the problem they instead applied a bandaid in one of the code paths where they noticed the case and then proceeded to move the page out of the queue. This led to at least three bugs in the VM system going unnoticed (or being noticed but not being traceable) for over a year. When I came across this piece of code I ripped it out without a second thought and then added an assertion to panic the system if a dirty page was ever added to or found in the cache queue. The result was about two weeks of system instability in the development branch during which I found and fixed 3 serious bugs exposed by the assertion, and we've never had a problem with that particular area of the system since. This practice of asserting conditions as a reality check against a documented algorithm is now standard practice in FreeBSD. This is why I hate bandaids. A bandaid, in the long term, only adds to the instability of a system. The correct solution is to make the code do what it is supposed to do and assert (panic the system) if it does something it isn't supposed to do. You might get a few panics in the short term, but in the long term you solve the problem. Permanently. Bandaids have the effect of causing problems to return and haunt you, sometimes for years. The dirty-cache-page bug was in the system for at least 3 years because of a bandaid.

Draft Background Paper on Asserts in Embedded Systems (very long)

Join the Discussion

Didn't find your answer?