I'd like to implement a Power On Self Test to be sure all (or many) parts of the electronics are functioning well.
The tests to be done to check external hardware depends on the actual hardware that is present.
What about the MCU that features an internal Flash (where is the code) and RAM and some peripherals? Are there any strategies to test if the internal RAM or Flash are good? Do you think these kind of tests could be useful?
What about a test of the clock based on an external crystal?
You can test whatever you think you need confidence in prior to declaring the system healthy enough to boot.
Historically, a small area of ROM was relied upon to contain enough code to verify it's integrity along with the rest of the POST/BIST code's integrity. This was done without referencing RAM (which may be defective).
Some folks would include "CPU tests" to verify the basic integrity of the processor. I think these are dubious as it likely either works or it doesn't ("Gee, it can ADD but can't JMP!"). With more advanced CPUs, you'd likely want to verify the cache, ECC and VMM hardware behave as expected (sometimes this requires adding hooks in order to be able to synthesize faults).
RAM was then tested using strategies appropriate to the RAM technology used. E.g., DRAM wanted to be tested with long delays between the write and read-verify to ensure any failures in hardware refresh mechanisms were given an opportunity to manifest. (Even nonexistent memory can appear to be present and functional if the test is poorly designed, in certain hardware configurations)
From there, different peripherals could be tested while relying on the now assumed *functional* ROM & RAM to conduct those tests. I.e., the test application can start to look like a more full-featured application instead of tight little bits of code.
You're at the mercy of the hardware designer to incorporate appropriate hooks to test many aspects of the circuitry. E.g., can you generate a serial data stream to test if a UART is receiving correctly? transmitting? Does your MAC let you push octets onto the wire and see them or is the loopback interface purely inside the NIC?
In the past, I've taken unused outputs and used them as termination voltages for high impedance pullups/pulldowns that I could use to determine if an external bit of kit was plugged into the system. I.e., drive the termination up, then down -- possibly multiple times, depending on what those "inputs" feed -- and see if anything is detected. If not, it is hopefully because the external device is driving those inputs with lower impedance signals. So, test the external device!
You can test for stuck keys/buttons -- if you can ensure the user (or mechanism) -- can be relied upon NOT to activate them during the POST.
You can test for a functional XTAL -- but only if you have some other timebase (which may be crude/inaccurate) is operational.
[I once diagnosed a pinball machine as having a defective crystal simply by observing the refresh of the displays with my unaided eyes -- PGDs appear to vibrate when lit. Had the POST for the machine been able to detect -- and flag -- that, it could have diagnosed itself!]
You also have to decide what role the test will play in the user's device experience; will you flash an indicator telling the user that a fault has been detected (if so, what will the user do)? Or, will you attempt to workaround any faults? How reliable will your "indicator" be?? Will you want to convey anything more informative than "check engine"?
I've added circuitry to my designs to allow me to dynamically (POST as well as BIST) verify the operational status of the hardware. E.g., every speaker is accompanied by a microphone -- so I can "listen" to the sounds that I'm generating to verify the speaker is operational. And, likewise, so I can generate sounds of particular characteristics to know that my microphone is working! Of course, having those "devices" on hand means I can also find uses for them that I might not have originally included in the design!
In my application, I can move the device to a "testing" state at any time. In this state, I can load diagnostics (once the device itself has verified that it is capable of executing those diagnostics!) to do whatever testing I deem necessary. E.g., if I encounter lots of ECC errors in the onboard RAM, I can take the device offline and run a comprehensive memory diagnostic. Depending on the results of that test, I can recertify the device for normal operation, some subset of normal *or* flag it as faulted.
But, my environment expects the devices to operate "unattended" for very long periods of time, 24/7/365, so I can't rely on the activation of a POST at power-up.
Think hard about the types of failures you EXPECT to see (i.e., many are USER errors!) and don't invest too much time detecting things that will likely never fail OR whose failure you won't be able to do much about.
A lot of testing "requirements" that are specified are completely pointless - or far worse than useless, as they introduce real points of failure in their attempts to cover everything.
First, figure out what you should /not/ test.
Don't bother testing something unless you can usefully handle the failure. If the way you communicate errors is through a UART, there is no point in trying to check that the UART is working. If you have a single microcontroller in the system, there is no point in trying to check the cpu or the on-chip ram. There is no point in checking that you can write to flash or on-chip eeprom - all you do is reduce its lifetime and make it more likely to fail.
Don't write any test code which cannot itself be tested. If you cannot induce a failure, or at least simulate it reasonably, do not write code to check or handle that failure. The reality is that the untested code will have a higher risk of problems than the thing you are testing.
Don't check the ram or the flash of the microcontroller - there's nothing you can do if there is a failure. (You can check that you have successfully loaded a new software update, or that there hasn't been a reset during an update - a CRC for that kind of thing is a good idea.) If you have a system that is important enough that ram or flash failures need to be checked and handled, use a safety-qualified microcontroller with ECC ram, flash, cache, etc., and perhaps even redundant cores (you get these with PowerPC and Cortex-R cores).
And think about what can reasonably go wrong, how it can go wrong, and what can be done about it. Other than for devices susceptible to current surges (like filament light bulbs), most hardware failures are in usage, not while power is off - checking on power-up (rather than while the system is in use) usually only makes sense if it is likely for a user to see there is a problem and try to "fix" it by turning power off and on again.
Testing RAM can be useful, letting the system fail gracefully rather than acting flaky, perhaps just locking up into a tight loop flashing a LED as a fault indicator. Similarly, you could CRC check the program flash, and fail on an error, preferably falling into a minimal system that allows a user reflash, but it might mean just bricking.
Note, that as you say, most faults will happen while powered up, but many faults will cause a system crash, that the user is likely to power cycle to try and clear, so power up is a good time to check (since many things are a lot harder to check while the system is running in operation).
Have you ever seen microcontroller RAM that failed? It's a possibility for dynamic ram on PC's that is pushed to its limits for power and speed, and made as cheaply as possible. But for static RAM in a microcontroller, the risk of failures is pretty much negligible. The exception is if a bit is hit by a cosmic ray (or other serious radiation), which can flip a bit, but that won't be detected by any RAM test of this kind.
Testing RAM is useful /if/ it can fail, and /if/ you can do something useful when it fails. (I agree that "a tight loop flashing an LED" might count as something useful, depending on the situation.)
I've seen "safety standards requirements" that included regular ram tests. Such requirements generally originate decades ago, and are not appropriately nuanced for real-life systems. I've seen resulting code used to implement such tests, added solely to fulfil such requirements. And I've seen such code written in a way that is untested and untestable, and in a way that has risks that /hugely/ outweigh those of a fault occurring in the on-board RAM.
If the OP is in the situation where there are customer requirements for fulfilling certain safety requirements that include ram tests, and where "mindlessly obeying these rules no matter how pointless they are in reality" is the right choice to please arse-covering lawyers, then go for it. If not, then think long and hard about the realism of such a failure and such a test, and whether it is truly a positive contribution to the project as a whole.
The possibility of a flash failure is a great deal higher than that of a RAM failure. Flash writes are analogue - a bit can be written in such a way that it reads back correctly at programming time, but goes outside the margins over time or at different temperatures or voltages. So yes, sometimes a CRC of the flash is worth doing. But remember that the program doing the check is just as much at risk of such failures (perhaps even more so, if you have a "boot" program that does the check of the "main" program, as the boot program is less likely to be updated and thus its bits will have decayed over a longer time).
If flash fails are a real risk, and the system is important enough, it's better to pick a microcontroller with ECC flash.
Yes, I mentioned that. (It assumes the embedded system has a user that can do such a power-cycle.)
Exactly. If memory is expected to work -- and NEVER expected to fail -- then it's a small cost to actually make some attempt to prove that is actually the case. Otherwise, when that "Can't Happen" actually does, you're left clueless.
[In the 70's, a common system failure I encountered was an address decoding error which would effectively disable all memory (think misprogrammed PLA). It was readily apparent as the processor would be found halted at ~0x0076 (IIRC) -- 0x76 being the opcode for HALT which would be the low byte of the address still "floating" on the multiplexed address/data bus. Nowadays, one can imagine similar failures -- including grown defects -- deleteriously affecting deployed product.]
You're assuming that there is only one, predetermined way to get into the self-test routine. And, that nothing in the machine has failed that would render that assumption false.
At each point in your code, you should know what assumptions are safe and which are yet to be proven/made safe. If you're in the self-test routine, you shouldn't have to wonder if memory works, is configured as you expect it to be, etc. ("Hey, I'm running code so why bother to TEST the code image??") Assuming that the memory is operational NOW (while I am executing this piece of self-test code) is a hazard waiting to happen.
For example, an errant RETURN could land the program IN the self-test code WITHOUT the benefit of having been through the controlled, repeatable start-up sequence. (i.e., the RAM -- or other resource -- may NOW be mapped to a different location in the address space such that the code written under the assumption that it resides in its "power on reset" configuration no longer works properly.)
I'd rather have that code FAIL and report the error to me -- because it tried to verify some assumption(s) and failed -- than have the code continue to operate FROM THERE on the assumption that it is actually (later) talking to functional RAM that has yet-to-be reconfigured. Otherwise, you get a "fluke" that you can never resolve (and, because you can't easily sort out what might have happened in order to reproduce and repair, you shrug it off due to time pressures -- even though YOU SAW IT FAIL!).
[I have an entry point in all of my products called RESET. It manually and deliberately works to restore the hardware to the same condition that it was in just after the application of power. So, any code that executes after passing through that entry point -- to "RESETTED" -- SHOULD behave the same regardless of whether power was just applied, or not]
Note that there's a difference between the sort of "confidence testing" that occurs at POST (how many devices perform exhaustive tests at POST? How many users would tolerate that sort of delay?) and "diagnostic testing" which truly provides an assessment of the health of the device and can often be used to assist in determine the need for replacement (or, for self-healing).
In most cases, you can test RAM with a single write pass followed by a verification read pass and be reasonably sure that you've caught stuck-at failures as well as decode failures -- no need for a whole barrage of different tests when you're typically looking for a simple Go-NoGo.
[I run three passes on a 512MB block and use that as a crude assessment as to whether or not the memory will LIKELY accept a program image. Installing -- and verifying -- that image acts as a further test of the memory's crude functionality. Thereafter, I swap pages of memory out and exercise them to verify that I'm not seeing an increase in ECC activity in a particular region -- which I will remap if need be.]
You also need to know how the device is fabricated; a memory module will experience different errors than memory that is soldered down. (and, in the latter case, you have to be prepared for the memory to NOT be what you THOUGHT it was going to be, at design time). And, soldered down memory will behave different than chip-on-chip.
Folks write ONE memory test and then assume all memory behaves (fails!) the same.
If you don't understand your hardware and how it can fail, you shouldn't be the one who is designing the test suite!
That bit is correct. The rest - well, I don't want to get into a long and protracted argument.
Any system is made up of layers. Higher level layers assume that lower level layers work according to specification (which may include indicating an error for some kinds of detectable fault). If you think the higher level part can fully verify the lower level parts - "prove" that the assumptions hold - you are fooling yourself. When you design a system based on a microcontroller, you pick a device that is as reliable as you need it to be - so that you /can/ assume the core parts (cpu, ram, flash, interrupts, etc.) work well enough for your needs. If you are not sure it is reliable enough, pick a different device or make a redundant system.
No amount of testing can /ever/ prove that something works - it can only prove that something does /not/ work.
That might not exist. E.g. it's common for security processors and software to continuously self-test while running, since the user might be trying to tamper with them. "Differential fault analysis" is a relevant search string. The attacker does stuff like intentionally overclock the processor in the hope of introducing errors, so they can observe the difference between the error result and the normal result, and infer stuff about the supposedly-secured info inside the processor. There is no magic way to defeat these attacks, but the cpu designers do what they can.
I think I have seen it once, the part had gotten electrically stressed in debugging and one of the banks of internal ram failed. We had only put the test in because the uniit was going to be in critical infrastructure where certain types of malfunctions could present dangers to people in the area.
It is also possible that many failure might trip a watchdog that forces a reset, and the unit then finds the fault and locks itself 'safe'.
Security against deliberate attacks is a completely different ballgame.
If you have made a system where an attacker can cause processor overclocking, and such attacks are realistic, then you need to put whatever checks, tests and mitigations are needed to deal with that situation.
If there are no feasible scenarios where the processor clock can be suddenly increased to the point where hardware or software becomes unreliable, then any tests or handling of such a situation is worse than useless. You are just adding more stuff that can go wrong (or be attacked), without any benefits.
Presumably you are careful about keeping the systems that developers have potentially broken separate from the systems that get delivered to customers. (Another possible cause of this kind of failure is ESD damage. Production departments are usually a lot more meticulous about ESD than developers.)
If you have a system that is safety critical, you have to do an analysis of the risks of things going wrong, the consequences of those failures, and how these (risks and consequences) can be reduced or mitigated. If you figure out that static failure of the memory is a risk, then testing can be worth doing. You might also decide that ECC ram, or redundant devices, or external monitors are a better solution. There's no fixed answer.
That is definitely possible. But again, be very careful with watchdogs
- watchdog handling code is rarely properly tested, because it is handling situations that don't occur. (Usually it /can/ be tested, but that does not mean it /is/ tested.)
I wasn't saying that such a test does make sense, but that such a test CAN be done reasonably, if for some legal/political reason it is introduced as a requirement. I brought up the example to show that this type of error CAN occur. Yes, unless some externally imposed requirement says to test internal ram, I am unlikely to add such a test for a production system (I have at time done it in development, mostly to confirm that I understand the limitations and operation of the device).