Detailed article about Mars Rover falure in EE Times

- J
- Jim Stewart
  
  Contact options for registered users
posted
20 years ago

Sun, Feb 22, 2004 5:56 AM

formatting link

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sun, Feb 22, 2004 10:57 AM

Thanks for this.

we'd told it to do," Klemm lamented.

- J
- Janvi
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sun, Feb 22, 2004 12:21 PM

that is a good article but I believe many many more things went wrong If you read this link from February the 17th:

formatting link

simply click to the "View all oportunity images from this press release" found below the right hand located photo. This will change the release date of the event from February the 17th to January the 17th. If you go on reading below "The Road Less Traveled" they obviosly exchanged the

1,4 meter distance to 4.6 meter instead 4.6 feet. The world is not only laughing about ESA.

Thinking more "interplanetary assistance", most people on the earth do not have any access to the internet and even we cannot achieve to have sufficient potable water to survive in many huge areas ...

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sun, Feb 22, 2004 1:28 PM

I would class it more as operator error. They allowed the first half of a composite command to execute and leave the system in a critical state, even though the second half had not been uploaded. Sounds like something E. Robert Tisdale might do.

The design problem seems to have been more in the creation of that composite command in the first place. There must have been safer paths.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- S
- Scott Moore
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 7:54 AM

And failed for the standard reasons.

First, the case of being out of memory was not extensively tested for. That is, one test would have been to fill the file system with nonsense files and check the behavior during the full condition, with frequent program faults.

Second, there was no planned fallback action for the software to perform. A fault was generated, and the "program" (actually, more appropriately termed a "task" here), "faulted", which simply means to terminate. No attempt to suspend it until file space freed up, no attempt to do anything about the file full condition, etc.

Third, the system was allowed to create a directory system that would not have room enough to load during a restart condition.

These are all standard failure modes for off the shelf software. The case of "edge conditions" are inadequately checked for. Windows used to crash very reliably when placed deliberately in a full ram or disk enivronment. I suspect the fact it no longer does has more to do with virtual memory than any program improvement, but it remains a fact that most systems, even major operating systems, don't do well with their disk full or nearly full.

Further, programmers rarely think of the consequences of hitting an error such as file system full. The current program/task can be killed, but this is not only likely to continue to fail programs, but may make it worse if simply starting these doomed to fail programs takes more memory. The answer is to DO something about it. If the OS does something about it, the program/task need not even know about it, it can be held in suspend until the problem is cleared.

I would guess for the above situation, dumping files by either age or priority, or both would be appropriate.

I am not stating this to "prove I am a genius", but simply to state what I have stated all along: the current state of software is NOT good, and using industry standard practices and operating systems is not the path to reliability.

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 9:41 AM

I totally agree.

In the early days of NASA, Bob Gilruth (NASA head) used to implore his guys to "keep it simple". The simpler the system, the fewer failures modes needed to be considered. Fairly elementary stuff.

As I've said before, as a hardware/software engineer I'm fascinated by the difference between these two disciplines. In hardware, we're fairly mature, and we're adept at "complexity management". In software, we seem determined to throw out all the lessons learned and start over - in an undisciplined and sloppy kind of way (so far). We approach it horizontally when we should be thinking vertically i.e. hierarchically.

Time and time again I have seen projects compromised by a decision to "save time" by buying in an RTOS, or basing the product on Windows CE, etc etc - decisions which increase the complexity (and hence failure modes) of the product by leaps and bounds. When comprehensibility is reduced, so is reliability. Bob Gilruth would turn in his grave - from his obit at

formatting link

things are important," [Alan Bean] said. "With the quality control and documentation, you had the history of everything and could lay your hands on it in a flash."

- B
- Bob Stephens
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 3:06 PM

I agree with your points about the importance of structured, disciplined design as opposed to hack and patch, but given the fallibility of humans in general, and NASA in particular - design by the lowest bidder - I am surprised that some redundancy wasn't built in to accomodate a catastrophic unanticipated failure mode. I realize that weight is at a premium, but even so.

Bob

- S
- Scott Moore
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 7:03 PM

I am also cross discipline, and this is really key to getting better software. In the old days, you could say that hardware is "simply simpler" than software. But nowdays, the two disciplines are converging. Verilog/vhdl designs are reaching or have reached massive complexity, and appear as complex text software descriptions of hardware. Even C occasionally gets compiled to hardware. So the question is not academic: "why does hardware quality lead software quality by such a large margin". The answer is simple to anyone doing hardware work today. The mindsets are completely different. Here are some of the points:

Hardware is completely simulated. Although some PCB level (Printed Circuit board, or multiple chip level) designs remain unsimulated, virtually no designers just "roll and roll" the dice by trying design iterations emperically, even using FPGA chips that make such behavior possible (FPGAs can be downloaded and run in a few minutes time).Hardware engineers know that simulation delivers better observability and consistency than emperical tests.
Hardware is fully verified. Chip designs are not considered done until all sections have been excersized and automatically verified. Tools exist for hardware to discover sections that have not been excersized by the tests, and more tests are added until "%100 test coverage" is acheived.

There is interest in applying these methods to software, and it can be done. Profilers can find if sections of code have been run, even down to the statement and machine instruction level. Automatic test methods are not making as much progress, but there is no fundamental reason why they won't work. Finally, software engineers need to understand that ANYTHING can be simulated. There is far too much temptation to simply defer testing until the hardware comes back. But this serializes software and hardware development, and I believe it significantly degrades software reliability by deferring simple software bugs (I.e., not related to timing or interface issues) to the environment with the least potential for observability and automatic verification.

"Hackaneering" :-)

I would only add that the counter to "it takes good programmers" idea is that, certainly, the programers must do the job, but there is also the idea of "best practices". This is the idea that good programmers produce best work by adopting the best practices they can. For example, a type safe language does not make for automatic quality, but adding a type safe language as a tool for a good programmer, along with other best practices like simulation, automatic verification, modularity and other things will allow the maximum reliability to be achieved.

- S
- Scott Moore
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 7:08 PM

You mean like a complete 2nd processor ? I suspect that NASA decided that this covers hardware problems, but not software problems, due to having the same software run on both processors. I suspect the correct answer would be to have three voting processors whose software was written by three totally separate groups who were disallowed to communicate with each other (the groups that is). Of course, that's probably a recipe for high costs !

- M
- Michael N. Moran
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Feb 24, 2004 11:36 PM

reaching

descriptions

The answer is definitely *not* simple.

Without getting into a theological discussion of why software

*is* more complicated than hardware:

o Hardware is usually applied to well bounded problems, often with the unstated idea that the software will magically fill in the gaps.

o Software interfaces vary much more widely than the ones,zeros and clocks of digital logic.

designers

Simulation allows emperical testing. Without vectors (test harness), there is no test.

I agree that simulation is valuable for executing test harnesses before the target hardware is available.

One of the nice things about the trend toward using Linux in embedded systems is that much of the application (and even driver) work can often be done on a PC improving the development throughput.

Usually, the difficult thing about using simulators for developing embedded software is that much of the software must interact with the target hardware, and most target simulators don't provide a good way to model the hardware behavior. Even if the simulator

*does* provide a hardware modelling method, building the models is time consuming and error prone.

hardware

Hmmm. If hardware were "fully verified" I would expect

*much* shorter errata sheets ;-)

statement

Many of us have an interest in improving quality, unfortunately, there are many counter forces.

o Apathy and narrow mindedness of the engineers. o Business pressures. o Lack of experience, both individually and collectively.

Notice that the first bullet places the blame squarely on the shoulders of the practioners.

Just read this news-group and you will find endless presentations by those who claim that they know the answer, and that most of their colleagues are fools. C rocks! C++ sucks! Real men write assembler! RTOS users are fools! You get the idea.

I agree that being consistent and using a "best practice" approach (reducing the number of variables) is an excellent way to improve the stability of any software. However, this can also lead to stagnation and narrow mindedness.

Under-stating and over-simplifying the problem of software by saying "all-you-have-to-do-is-xyz" does not contribute to a solution.

There is no substitute for experienced engineers. If they're experienced enough, then they're in it for love ... not money ;-)

Lets not forget the most important best practices of all: Solid requirements, careful analysis and a design that covers the temporal aspects as well as procedural aspects.

Wouldn't it be nice if those things took as little time as many *think* they take! :-)

--
Michael N. Moran           (h) 770 516 7918
5009 Old Field Ct.         (c) 678 521 5460
Kennesaw, GA, USA 30144

"... abstractions save us time working, but they don't
  save us time learning."
Joel Spolsky, The Law of Leaky Abstractions

The Beatles were wrong: 1 & 1 & 1 is 1

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Feb 25, 2004 5:54 AM

Just a few points:

FWIW, that's not my position. The right tool for the job etc. However, I do notice a tendency to overcomplicate. Recently I worked on a project that, quite typically, had grown out of proportion over the years and no-one now fully understood. It had been based on an RTOS when a simple round-robin scheduler would have been adequate - and far more comprehensible. In this case, as in many that I've had first-hand experience of, there were none of the real justifications for using an RTOS. IME, this is far from unusual.

Agreed.

If you mean that a true craftsman is always ready to revise and refine his definition of "best practices", then I agree.

Not sure about this one. As designers, I feel our job is to reduce complexity into a collection of simple elements, with simple well-defined interfaces and no side-effects. Sorry to belabour this point, but I do see it being missed far more often than makes sense.

I am *not* saying it's easy, BTW. Good decompositional skills are, I believe, the hardest thing to learn, and, in my experience, far more rare in the software domain than I'd reasonably expect. I think this is significant - in hardware many of the subassemblies are available already partitioned (ICs, discretes, modules, products). Not so in software (usually).

Now *this* is an interesting point. I have noticed that one of the side-benefits of a simple, comprehensible design is a reduction in implementation time. The design time is increased, but the overall time is reduced - often considerably.

Steve

formatting link

- S
- Scott Moore
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Feb 25, 2004 9:10 AM

I guess to keep this from being a "do as I say, not as I do" discussion, I should outline how I use these principles in my own projects. The proprietary details have been removed.

Current project. Large, actually stunningly large (100's of thousands of lines), written in stages since 1980, and maintained since.

o Currently a Windows XP based system, formerly other computers. o Written using type safe language. o No simulation, since there is no target (not an embedded system). o Generous use of "assert" type constructs. I long ago learned to check virtually every possible bad condition, even if done redundantly. Now it is very rare not to have even serious problems trip one of my asserts, which result in a message giving the exact location, in source, of the fault. o Formal testing methodology. A series of extensive test files give automatic coverage of software faults. After that, an extensive series of real world examples are run for further testing.

Results: Virtually all errors are caught in a sensible way. The errors that don't result in asserts are then caught by the type protections in the language. Program development proceeds virtually without system faults from the operating system, even on new, untried sections of the code.

Last project:

o Embedded to new hardware, which used IBM-PC standard hardware. o Written using C. o Simulation of chip related tests. Basic workability of test platform was performed by "emulating" the full software on a standard PC, made possible by the commonality of target and PC hardware. However, one version used POSIX I/O and worked on both Windows and Linux, and was used to give a complete preview of the system, long before hardware was even designed. o Formal test methodology was used for chip tests, which ran both against a full hardware simulation, then transferred to real hardware. Test platform was fully scriptable, and allowed for building complete regression tests.

Results: full functionality with real hardware in 2 weeks after hardware proved functional.

Before that:

o Embedded, custom platform. o Written using C. o Full simulation, and test of code in simulation environment. An arrangement was used where the code used to compile the hardware chips in Verilog was coupled with a CPU model that executed the code. The result was a full simulation with real code, against the actual code used to construct the hardware. o Formal test methodology. TCL was used to drive the system under test for full regression testing.

Results: We brought up several platforms. It was very common to have the software running the same DAY as hardware declared the unit to be running.

Final comment: I design in whatever language my client wants. Personally, however, I find that projects proceed twice as fast using type safe languages. The development time to write the code is the same, and most debugging occurs at the same pace. However, it has been my experience that C, and probally all type unsafe languages, will throw out several problems per year that consume more than one week to solve, such as lost pointer errors or array overruns. These problems typically cause random serious schedule slips to occur. The occurance of these types of problems have effects beyond just the problem itself. For example, I typically run a much tighter circle of write-to-test for C, because I want to know if I have a serious fault show up a better idea of just what code might have introduced this fault. This kind of defensive programming costs development time. Also, debug of type safe languages occurs faster because of the better diagnostics that occur with even minor errors.

Because I design perhaps %70, and design 10's to 100's of thousands of lines each year typically, I don't believe that the above effects occur because I have "better knowledge" of one language or another. On the contrary, that should favor C. I also don't believe that it is an effect of programming knowledge in general, since I am often the debugger of choice for serious system errors, especially in compiled low level code, since I have written several compilers, including a C compiler.

Do I make much progress convincing my clients to use type safe languages ? No, unless you consider my work with TCL or Perl. Since I am a low level (close to hardware) specialist, I rarely even see requests for C++ code, much less anything higher level than that. It is my experience that C is usually picked for projects without even debating or even asking the programmers what they would like to use.

I like to be paid, so I don't bring language issues to work unless asked. Its enough work just to try to get my clients to use modular concepts. However, I would draw the line, and have drawn the line, at systems critical to human life. I would most certainly avoid working on any life support system such as medical, aircraft naviagation, etc., if it were performed in C.

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Feb 25, 2004 10:03 AM

"Scott Moore" wrote in message news:XDZ_b.118983$jk2.515792@attbi_s53...

Interesting. My "angle" is just slightly different.

- My background is primarily true embedded products (firmware) in a market (process control) where there is *no* tolerance of s/w bugs: if it fails, it's broken, and the result might be lawsuits and/or product recalls. - I consider defensive programming (within reason) to represent good value for money - it usually saves time further down the line, and as a hardware engineer, I still embrace the catechism that debug costs escalate at every stage. - I tend to write my own "type-safeness" into the application ;). That is, I use strong typing (more than C actually supports) and I explicitly check bounds etc. Anything a language can do, I figure I can do too, but with more insight into what's actually going on at runtime. (Which is one reason I'm not a fan of C++.) - I almost never run into stray pointers etc, whether I'm using C, assembler or whatever. When I do, it's at an early stage - like you I use asserts etc, along with a variety of other means of making oversights jump out at me. - Many of my applications are close to life-critical (in all but legal terms), and most are certainly mission-critical. Beyond the process control work, I've written safety monitoring applications involving naval navigation, fire alarm reporting, and personnel-at-risk monitoring, amongst others. I wouldn't be able to sleep nights if I had anything other than 100% confidence in them ;). - As I've said here before, I avoid debugging. I hate it ;). Instead, I basically write lots of small, trivially simple elements, test them individually and collectively, and ensure that *all* runtime errors are caught and dealt with sensibly. Many of my colleagues find this process strange and tedious - but I find it far more constructive than debugging, which can only yield empirical results.

I apologise if I'm blowing my own trumpet, and reiterating points I've made before in other threads. I'm genuinely curious as to why my bug count is so low when the average for the industry is so high. It ain't because I'm a genius - if anything, I like things simple because I'm *not* a genius. I assume it's because I was a hardware engineer first, and have been trained and indoctrinated in *engineering*. Possibly also the exposure to mission-critical apps, which has forced me to find ways of making software robust.

NASA, sounds like you need me ;).

Steve

formatting link

- R
- Rene
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Feb 25, 2004 8:30 PM

[more snips]

until all

for hardware

more tests

Haha, I would have laughed if it were funny. Hardware 100% verified... Have you ever come close to a PowerPC CPU, AMD Ethernet Chip (Lance), Infineon DuSLIC, Motorola MPC 180 Crypto Chip and so on ?

I am a doing low level software and some hardware design. I have stopped counting the hardware bugs in components we used. The lesson was that we now evaluate parts by their errate sheet.

- Rene

- E
- Elder Costa
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Feb 25, 2004 11:29 PM

Adding another point of view on the hardware side. I am reading a book about worst case analysis that was recommended by somebody in this NG (I don't remember the title/author from the top of my head). Hardware can become a nightmare if one don't take worst case figures in account when designing, just the typical ones. I'm affraid that must be more the rule than the exception. Not to mention digital hardware based on software (VHDL, Verilog etc.) :-)

Still software is IMHO by far more complex than hardware.

Just my $0,00999999999.

- B
- Brian Dean
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Feb 26, 2004 12:56 AM

Oh, that reminds me:

"We are Pentium of Borg. Division is Futile. You will be approximated."

Cheers,

-Brian

--
Brian Dean, bsd@bdmicro.com
BDMICRO - MAVRIC & MAVRIC-II ATmega128 Based Dev Boards
http://www.bdmicro.com/

- J
- Jim Stewart
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Feb 26, 2004 1:11 AM

A Pentium FPU engineer goes into a bar and orders a drink. The bartender says "That'll be five dollars", the Pentium engineer slaps a 5 dollar bill on the bar and says "keep the change"

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Feb 26, 2004 1:39 AM

and the variant that has "Precision is futile"....

- B
- Brian Dean
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Feb 26, 2004 7:07 AM

Your version is more appropriate - I probably just remembered it incorrectly. My memory was "approximated" :-) Would that be a software bug or a hardware glitch?

-Brian

- S
- Scott Moore
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Feb 26, 2004 7:27 AM

Didn't say %100 verified. Said %100 coverage. There is a difference.