I worked on a project substantially larger than a single microcontroller but the idea we applied might be appropriate. We took a very hard line on this and the charter of the group was that there were going to be no bugs delivered to the customers. In some of the functions that we wrote it was feasible to write one, or a small number, of "sanity checks", small tests that would evaluate whether arguments being passed and/or state variables had values that were appropriate at the moment.
If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn was the program counter at the point where the check failed, and then we halted the processor.
This had a number of interesting and sometimes unexpected consequences. The first was that it quickly became the case that nobody wanted to be the one responsible for passing bad data to someone else's sanity check. That seemed to result in people being much more careful that they would not pass bad data. Secondly, it became a very popular thing for people to carefully craft these checks to keep themselves from being responsible for a failure. Thirdly, in an embedded environment when everyone is in a panic to get all the work done, it seems that when the box just locks up and you know it is going to take hours to try to figure out what just happened, it seems much more reasonable to just hit the reset button and try to get on with your own work. But when "Fatal Error nnnnn" pops up and in seconds you can look at the build file and tell exactly where the error happened and what sanity check failed you are much more likely to yell "FATAL ERROR NNNNN!" over the wall. Everybody in the team would cringe, hoping it wasn't them who had just called that function with bad data. And the person who had just observed this, plus the person who had inserted that sanity check were both "the good guys." This soon led to adding sanity checks when we would find the box crashed in some strange way and it took hours to realize we hadn't caught some bad case.
But this then led us to being able to test in a novel way. We wrote some code on a test harness that would hammer the box with random input. It would poke buttons and send in commands and present data, pretty much completely randomly, but at 100 commands/second! Within seconds of trying this a check blew up and we had another Fatal Error nnnnn. But that let us find and fix an oversight quickly. After a number of iterations we were to the point where this would run all weekend with zero failures.
Then the decision was made, we were going to leave all these in the code and live when we shipped it. Another team working across the wall with a similar product was horrified, "You don't want your customers to know you have BUGS, DO YOU?!?!" And our reply was that they were going to know one way or the other. We shipped. And we waited. And we waited. All the checks apparently had made us find almost all the bugs before it went out the door.
One afternoon I did get a call from the marketing rep. He had a message from the marketing secretary. She had a message from the receptionist. She had a call from Hughes. They had been using this and it had popped up "Fatal Error nnnnn" and just locked up. They were so astonished that they went over to another building, got a camera, brought it back and took a picture. Then they called. And I got nnnnn from 1500 miles away. In 30 seconds I knew which check had failed, knew that it was a single variable, knew it must have been out of range and I could now hammer the box until I could figure out a way to find and fix that. I did.
After 18 months and with 2000 of the product in the field being used by people pretty much full time we had 3 Fatal Errors found, and I thought that was pretty much all of them that were ever seen because in the manual it told them that if they ever saw this to call this phone number and tell us that number so we could fix it for them. I found and fixed those 3 and a number of others that I knew about but no customer would likely ever see.
The guys across the wall, they had ten times the support team and didn't even bother about bugs that didn't just crash the box, and if it did, they just cycled the power and went on. I even tried to get marketing to offer a campaign, I'd PAY customers for the first Fatal Error found. They squashed that, it would have made the other team look bad.
One other item that helped with the sanity checks, we filled all memory with 0xAAAA initially, and even when some memory was released. That oddball value was unlikely to be a reasonable value for most state variables and helped us fail more sanity checks.