I have been having some private email discussions about this thread and an example that first focused me about the impact of standard reliability math on software was brought home in a early software project that I had a contract for.
This project involved fixing daisy wheel printer software that some one had written that was implemented as a strange state machine that was losing characters both on paper and part of the serial protocol. The printer had evolved over time to have more features and support faster serial speeds. It had what appeared to be random failures resulting in increasing customer complaints. Among other things the printer was being used to print banking check masters.
A total of maybe a 100 lines of code (out of about 20k lines) that changed the scheduler from truly round robin with about 75 items in the loop many of them duplicated because of there need to be serviced in less time than the loop time. Most items most of the time didn't execute but sometimes many did all contributing to the series communication irregularities and timing problems for everybody.
The serial communication was separated from the printer parts and two co_routine loops were created. The implementation had no interrupts. Instead of executing all the possible printer functions each time I executed one printer function dropping the 75 count back to about
30 eliminating some boolean flags and duplicate and split functions. The printer would actually run faster than previously and about 30% of the processor (F8 it does go back) still available.Most of the code was unchanged (about 0.5% of the code was changed) The primary failure was timing and in a few cases random combinations of the order of execution of events. The two conflicting timing parts serial data and actual print functions were isolated so they would not interfere with each other.
The fix actually took less than a week and rigorous testing on the site of a particular critical (and knowledgeable) customer confirmed the approach.
Walter Banks Byte Craft Limited