Detailed article about Mars Rover falure in EE Times

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View

http://www.eetimes.com/sys/news/OEG20040220S0046

Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

Thanks for this.

Quoted text here. Click to load it
we'd told it to do," Klemm lamented. <<

As I rather suspected, it appears to have been a comprehension/system design
issue rather than a coding issue per se.

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

I would class it more as operator error.  They allowed the first
half of a composite command to execute and leave the system in a
critical state, even though the second half had not been
uploaded.  Sounds like something E. Robert Tisdale might do.

The design problem seems to have been more in the creation of that
composite command in the first place.  There must have been safer
paths.

--
Chuck F ( snipped-for-privacy@yahoo.com) ( snipped-for-privacy@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
We've slightly trimmed the long signature. Click to see the full one.
Re: Detailed article about Mars Rover falure in EE Times
that is a good article but I believe many many more things went wrong
If you read this link from February the 17th:

http://origin.mars5.jpl.nasa.gov/newsroom/pressreleases/20040217a.html

simply click to the "View all oportunity images from this press release"
found below the right hand located photo. This will change the release
date of the event from February the 17th to January the 17th. If you go
on reading below "The Road Less Traveled" they obviosly exchanged the
1,4 meter distance to 4.6 meter instead 4.6 feet. The world is not
only laughing about ESA.

Thinking more "interplanetary assistance", most people on the earth do
not have any access to the internet and even we cannot achieve to have
sufficient potable water to survive in many huge areas ...

Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

And failed for the standard reasons.

First, the case of being out of memory
was not extensively tested for. That is, one test would have been
to fill the file system with nonsense files and check the behavior
during the full condition, with frequent program faults.

Second, there was no planned fallback action for the software to
perform. A fault was generated, and the "program" (actually, more
appropriately termed a "task" here), "faulted", which simply means
to terminate. No attempt to suspend it until file space freed up,
no attempt to do anything about the file full condition, etc.

Third, the system was allowed to create a directory system that
would not have room enough to load during a restart condition.

These are all standard failure modes for off the shelf software.
The case of "edge conditions" are inadequately checked for. Windows
used to crash very reliably when placed deliberately in a full
ram or disk enivronment. I suspect the fact it no longer does has
more to do with virtual memory than any program improvement, but
it remains a fact that most systems, even major operating systems,
don't do well with their disk full or nearly full.

Further, programmers rarely think of the consequences of hitting
an error such as file system full. The current program/task can be
killed, but this is not only likely to continue to fail programs,
but may make it worse if simply starting these doomed to fail
programs takes more memory. The answer is to DO something about
it. If the OS does something about it, the program/task need
not even know about it, it can be held in suspend until the
problem is cleared.

I would guess for the above situation, dumping files by either
age or priority, or both would be appropriate.

I am not stating this to "prove I am a genius", but simply to
state what I have stated all along: the current state of software
is NOT good, and using industry standard practices and operating
systems is not the path to reliability.



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

I totally agree.

In the early days of NASA, Bob Gilruth (NASA head) used to implore his guys
to "keep it simple". The simpler the system, the fewer failures modes needed
to be considered. Fairly elementary stuff.

As I've said before, as a hardware/software engineer I'm fascinated by the
difference between these two disciplines. In hardware, we're fairly mature,
and we're adept at "complexity management". In software, we seem determined
to throw out all the lessons learned and start over - in an undisciplined
and sloppy kind of way (so far). We approach it horizontally when we should
be thinking vertically i.e. hierarchically.

Time and time again I have seen projects compromised by a decision to "save
time" by buying in an RTOS, or basing the product on Windows CE, etc etc -
decisions which increase the complexity (and hence failure modes) of the
product by leaps and bounds. When comprehensibility is reduced, so is
reliability. Bob Gilruth would turn in his grave - from his obit at
http://www.space.com/peopleinterviews/gilruth_obit_000817.html :
Quoted text here. Click to load it
things are important," [Alan Bean] said. "With the quality control and
documentation, you had the history of everything and could lay your hands on
it in a flash." <<

Compare this with the current state of software engineering. Actually, I
hesitate to call it "engineering" at the moment. There is still far too much
emphasis on "hack and debug" and not enough on complexity management i.e.
good, solid hierarchical design. A complex design should be broken down into
many, simple, pieces. If the designers can't understand the hierarchy,
WARNING. If the elements are too clever to be comprehensible, WARNING. If
the designers can't "lay [their] hands on it in a flash", WARNING. We need
to be more disciplined, and to see simplicity and comprehensibility as
virtues. Too often the design evolves from the code in an ad-hoc fashion -
analagous to designing a car by starting with a piece of metal and a
hacksaw.

In the original "Mars Rover" thread here there was much emphasis on coding
issues e.g. typsesafe languages. This, in my view, is a symptom of the
problem - coding is NOT the main issue. (Sure, it matters, but it's not the
root problem.) Good *design* is the issue. Given a good, comprehensible
design and a competent, disciplined coder, I don't care if it's coded in C,
assembler, or ADA - it'll work, and stay working. If it's not 100%
comprehensible, it WILL fail.

</rant>

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times

Quoted text here. Click to load it

I agree with your points about the importance of structured, disciplined
design as opposed to hack and patch, but given the fallibility of humans in
general, and NASA in particular - design by the lowest bidder - I am
surprised that some redundancy wasn't built in to accomodate a catastrophic
unanticipated failure mode. I realize that weight is at a premium, but even
so.

Bob

Re: Detailed article about Mars Rover falure in EE Times

Quoted text here. Click to load it

You mean like a complete 2nd processor ? I suspect that NASA decided that
this covers hardware problems, but not software problems, due to having the
same software run on both processors. I suspect the correct answer would
be to have three voting processors whose software was written by three
totally separate groups who were disallowed to communicate with each other
(the groups that is). Of course, that's probably a recipe for high costs !



Re: Detailed article about Mars Rover falure in EE Times

Quoted text here. Click to load it

I am also cross discipline, and this is really key to getting better software.
In the old days, you could say that hardware is "simply simpler" than software.
But nowdays, the two disciplines are converging. Verilog/vhdl designs are
reaching
or have reached massive complexity, and appear as complex text software
descriptions
of hardware. Even C occasionally gets compiled to hardware. So the question is
not academic: "why does hardware quality lead software quality by such a large
margin". The answer is simple to anyone doing hardware work today. The mindsets
are completely different. Here are some of the points:

1. Hardware is completely simulated. Although some PCB level (Printed Circuit
board, or multiple chip level) designs remain unsimulated, virtually no designers
just "roll and roll" the dice by trying design iterations emperically, even
using FPGA chips that make such behavior possible (FPGAs can be downloaded
and run in a few minutes time).Hardware engineers know that simulation
delivers better observability and consistency than emperical tests.

2. Hardware is fully verified. Chip designs are not considered done until all
sections have been excersized and automatically verified. Tools exist for
hardware
to discover sections that have not been excersized by the tests, and more tests
are added until "%100 test coverage" is acheived.

There is interest in applying these methods to software, and it can be done.
Profilers can find if sections of code have been run, even down to the statement
and machine instruction level. Automatic test methods are not making as much
progress, but there is no fundamental reason why they won't work. Finally,
software engineers need to understand that ANYTHING can be simulated. There is
far too much temptation to simply defer testing until the hardware comes
back. But this serializes software and hardware development, and I believe it
significantly degrades software reliability by deferring simple software
bugs (I.e., not related to timing or interface issues) to the environment with
the least potential for observability and automatic verification.

Quoted text here. Click to load it

"Hackaneering" :-)

Quoted text here. Click to load it

I would only add that the counter to "it takes good programmers" idea is that,
certainly, the programers must do the job, but there is also the idea of
"best practices". This is the idea that good programmers produce best work
by adopting the best practices they can. For example, a type safe language
does not make for automatic quality, but adding a type safe language as a
tool for a good programmer, along with other best practices like simulation,
automatic verification, modularity and other things will allow the maximum
reliability to be achieved.

Quoted text here. Click to load it



Re: Detailed article about Mars Rover falure in EE Times
[...snip...]
Quoted text here. Click to load it
reaching
Quoted text here. Click to load it
descriptions
Quoted text here. Click to load it

The answer is definitely *not* simple.

Without getting into a theological discussion of why software
*is* more complicated than hardware:

o Hardware is usually applied to well bounded problems,
   often with the unstated idea that the software will
   magically fill in the gaps.

o Software interfaces vary much more widely than the
   ones,zeros and clocks of digital logic.

Quoted text here. Click to load it
designers
Quoted text here. Click to load it

Simulation allows emperical testing. Without vectors
(test harness), there is no test.

I agree that simulation is valuable for executing
test harnesses before the target hardware is available.

One of the nice things about the trend toward using
Linux in embedded systems is that much of the application
(and even driver) work can often be done on a PC improving
the development throughput.

Usually, the difficult thing about using simulators
for developing embedded software is that much of the
software must interact with the target hardware, and
most target simulators don't provide a good way to
model the hardware behavior. Even if the simulator
*does* provide a hardware modelling method, building
the models is time consuming and error prone.

Quoted text here. Click to load it
hardware
Quoted text here. Click to load it

Hmmm. If hardware were "fully verified" I would expect
*much* shorter errata sheets ;-)

Quoted text here. Click to load it
statement
Quoted text here. Click to load it

Many of us have an interest in improving quality,
unfortunately, there are many counter forces.

o Apathy and narrow mindedness of the engineers.
o Business pressures.
o Lack of experience, both individually and collectively.

Notice that the first bullet places the blame
squarely on the shoulders of the practioners.

Just read this news-group and you will find
endless presentations by those who claim that
they know the answer, and that most of their
colleagues are fools. C rocks! C++ sucks!
Real men write assembler! RTOS users are fools!
You get the idea.

I agree that being consistent and using a
"best practice" approach (reducing the number
of variables) is an excellent way to improve
the stability of any software. However, this
can also lead to stagnation and narrow mindedness.

Under-stating and over-simplifying the problem
of software by saying "all-you-have-to-do-is-xyz"
does not contribute to a solution.


Quoted text here. Click to load it

There is no substitute for experienced engineers.
If they're experienced enough, then they're in it
for love ... not money ;-)

Lets not forget the most important best practices
of all: Solid requirements, careful analysis and
a design that covers the temporal aspects as well
as procedural aspects.

Wouldn't it be nice if those things took as little
time as many *think* they take! :-)


--
Michael N. Moran           (h) 770 516 7918
5009 Old Field Ct.         (c) 678 521 5460
We've slightly trimmed the long signature. Click to see the full one.
Re: Detailed article about Mars Rover falure in EE Times

Just a few points:

Quoted text here. Click to load it

FWIW, that's not my position. The right tool for the job etc. However, I do
notice a tendency to overcomplicate. Recently I worked on a project that,
quite typically, had grown out of proportion over the years and no-one now
fully understood. It had been based on an RTOS when a simple round-robin
scheduler would have been adequate - and far more comprehensible. In this
case, as in many that I've had first-hand experience of, there were none of
the real justifications for using an RTOS. IME, this is far from unusual.

Quoted text here. Click to load it

Agreed.


If you mean that a true craftsman is always ready to revise and refine his
definition of "best practices", then I agree.

Quoted text here. Click to load it

Not sure about this one. As designers, I feel our job is to reduce
complexity into a collection of simple elements, with simple well-defined
interfaces and no side-effects. Sorry to belabour this point, but I do see
it being missed far more often than makes sense.

I am *not* saying it's easy, BTW. Good decompositional skills are, I
believe, the hardest thing to learn, and, in my experience, far more rare in
the software domain than I'd reasonably expect. I think this is
significant - in hardware many of the subassemblies are available already
partitioned (ICs, discretes, modules, products). Not so in software
(usually).

Quoted text here. Click to load it

Now *this* is an interesting point. I have noticed that one of the
side-benefits of a simple, comprehensible design is a reduction in
implementation time. The design time is increased, but the overall time is
reduced - often considerably.

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it


I guess to keep this from being a "do as I say, not as I do"
discussion, I should outline how I use these principles in my
own projects. The proprietary details have been removed.

Current project. Large, actually stunningly large (100's of
thousands of lines), written in stages since 1980, and maintained
since.

o Currently a Windows XP based system, formerly other computers.
o Written using type safe language.
o No simulation, since there is no target (not an embedded system).
o Generous use of "assert" type constructs. I long ago learned to
  check virtually every possible bad condition, even if done
  redundantly. Now it is very rare not to have even serious problems
  trip one of my asserts, which result in a message giving the exact
  location, in source, of the fault.
o Formal testing methodology. A series of extensive test files give
  automatic coverage of software faults. After that, an extensive
  series of real world examples are run for further testing.

Results: Virtually all errors are caught in a sensible way. The errors
that don't result in asserts are then caught by the type protections
in the language. Program development proceeds virtually without system
faults from the operating system, even on new, untried sections of the
code.

Last project:

o Embedded to new hardware, which used IBM-PC standard hardware.
o Written using C.
o Simulation of chip related tests. Basic workability of test platform
  was performed by "emulating" the full software on a standard PC, made
  possible by the commonality of target and PC hardware. However, one
  version used POSIX I/O and worked on both Windows and Linux,
  and was used to give a complete preview of the system, long before
  hardware was even designed.
o Formal test methodology was used for chip tests, which ran both against
  a full hardware simulation, then transferred to real hardware.
  Test platform was fully scriptable, and allowed for building complete
  regression tests.

Results: full functionality with real hardware in 2 weeks after hardware
proved functional.

Before that:

o Embedded, custom platform.
o Written using C.
o Full simulation, and test of code in simulation environment. An
  arrangement was used where the code used to compile the hardware
  chips in Verilog was coupled with a CPU model that executed the
  code. The result was a full simulation with real code, against
  the actual code used to construct the hardware.
o Formal test methodology. TCL was used to drive the system under
  test for full regression testing.

Results: We brought up several platforms. It was very common to have
the software running the same DAY as hardware declared the unit to
be running.

Final comment: I design in whatever language my client wants. Personally,
however, I find that projects proceed twice as fast using type safe
languages. The development time to write the code is the same, and most
debugging occurs at the same pace. However, it has been my experience
that C, and probally all type unsafe languages, will throw out several
problems per year that consume more than one week to solve, such as
lost pointer errors or array overruns. These problems typically cause
random serious schedule slips to occur. The occurance of these types
of problems have effects beyond just the problem itself. For example,
I typically run a much tighter circle of write-to-test for C, because
I want to know if I have a serious fault show up a better idea of just
what code might have introduced this fault. This kind of defensive
programming costs development time. Also, debug of type safe languages
occurs faster because of the better diagnostics that occur with even
minor errors.

Because I design perhaps %70, and design 10's to 100's of thousands of
lines each year typically, I don't believe that the above effects
occur because I have "better knowledge" of one language or another.
On the contrary, that should favor C. I also don't believe that it
is an effect of programming knowledge in general, since I am often
the debugger of choice for serious system errors, especially in
compiled low level code, since I have written several compilers,
including a C compiler.

Do I make much progress convincing my clients to use type safe languages ?
No, unless you consider my work with TCL or Perl. Since I am a low
level (close to hardware) specialist, I rarely even see requests
for C++ code, much less anything higher level than that. It is my
experience that C is usually picked for projects without even debating
or even asking the programmers what they would like to use.

I like to be paid, so I don't bring language issues to work unless
asked. Its enough work just to try to get my clients to use modular
concepts. However, I would draw the line, and have drawn the line,
at systems critical to human life. I would most certainly avoid
working on any life support system such as medical, aircraft
naviagation, etc., if it were performed in C.



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it
   <snip>
Quoted text here. Click to load it
   <snip>
Quoted text here. Click to load it

Interesting. My "angle" is just slightly different.

 - My background is primarily true embedded products (firmware) in a market
(process control) where there is *no* tolerance of s/w bugs: if it fails,
it's broken, and the result might be lawsuits and/or product recalls.
 - I consider defensive programming (within reason) to represent good value
for money - it usually saves time further down the line, and as a hardware
engineer, I still embrace the catechism that debug costs escalate at every
stage.
 - I tend to write my own "type-safeness" into the application ;). That is,
I use strong typing (more than C actually supports) and I explicitly check
bounds etc. Anything a language can do, I figure I can do too, but with more
insight into what's actually going on at runtime. (Which is one reason I'm
not a fan of C++.)
 - I almost never run into stray pointers etc, whether I'm using C,
assembler or whatever. When I do, it's at an early stage - like you I use
asserts etc, along with a variety of other means of making oversights jump
out at me.
 - Many of my applications are close to life-critical (in all but legal
terms), and most are certainly mission-critical. Beyond the process control
work, I've written safety monitoring applications involving naval
navigation, fire alarm reporting, and personnel-at-risk monitoring, amongst
others. I wouldn't be able to sleep nights if I had anything other than 100%
confidence in them ;).
 - As I've said here before, I avoid debugging. I hate it ;). Instead, I
basically write lots of small, trivially simple elements, test them
individually and collectively, and ensure that *all* runtime errors are
caught and dealt with sensibly. Many of my colleagues find this process
strange and tedious - but I find it far more constructive than debugging,
which can only yield empirical results.

I apologise if I'm blowing my own trumpet, and reiterating points I've made
before in other threads. I'm genuinely curious as to why my bug count is so
low when the average for the industry is so high. It ain't because I'm a
genius - if anything, I like things simple because I'm *not* a genius. I
assume it's because I was a hardware engineer first, and have been trained
and indoctrinated in *engineering*. Possibly also the exposure to
mission-critical apps, which has forced me to find ways of making software
robust.

NASA, sounds like you need me ;).

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it
[more snips]
Quoted text here. Click to load it
until all
for hardware
more tests


Haha, I would have laughed if it were funny. Hardware 100% verified... Have
you
ever come close to a PowerPC CPU, AMD Ethernet Chip (Lance), Infineon
DuSLIC,
Motorola MPC 180 Crypto Chip and so on ?

I am a doing low level software and some hardware design. I have stopped
counting
the hardware bugs in components we used. The lesson was that we now evaluate
parts
by their errate sheet.

- Rene



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

Adding another point of view on the hardware side. I am reading a book
about worst case analysis that was recommended by somebody in this NG (I
don't remember the title/author from the top of my head). Hardware can
become a nightmare if one don't take worst case figures in account when
designing, just the typical ones. I'm affraid that must be more the rule
than the exception. Not to mention digital hardware based on software
(VHDL, Verilog etc.) :-)

Still software is IMHO by far more complex than hardware.

Just my $0,00999999999.


Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

This is precisely the myth I'm trying to debunk.

All problems, hardware or software, tend to be complex until the complexity
is designed out. If it's still complex at the end of the exercise, then the
exercise has failed.

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

I totally agree.  It's just as possible to write
a half-page of C that is totally undecypherable
or 50 pages of easily understood assembly code.

As to the software being more complex than hardware,
the most complex hardware designs are in silicon and
are done in VHDL or Verilog.  So you're back to
virtually the same issues as software.




Re: Detailed article about Mars Rover falure in EE Times
On Thu, 26 Feb 2004 19:14:31 -0000, "Steve at fivetrees"

Quoted text here. Click to load it

It's hard to disprove the truth.

OK, that's a bit glib.  But it seems pretty obvious to me that in
order for software to fully exploit the hardware for which it was
written, it must be at least an order of magnitude more complex than
said hardware.

Quoted text here. Click to load it

This statement is true.  To a point, anyway.  Hoare said there are two
ways to write software: so complex there are no obvious deficiencies,
or so simple there are obviously no deficiencies.  But Einstein said
you should make your system as simple as possible, but no simpler.
(Note: both paraphrases)

But that says nothing about the relative complexity of hardware and
software.

Regards,

                               -=Dave
--
Change is inevitable, progress is not.

Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

I suspect we mean different things by "complex". The point I'm making is
that complex functionality can be achieved by hierarchical collections of
simple things. A case in point is the TCP/IP stack - everyone seems scared
of it, and the tendency is to buy it in or use an RTOS that includes it. But
layer by layer, it's not rocket science.

Quoted text here. Click to load it

I'm strongly in the "so simple there are obviously no deficiencies" camp
;) - as a conscious survival strategy, not as naive idealism. As for the
Einstein quote, I'm not sure how it applies to engineering - if it meets the
spec, and does what is required, why does it need to be any less simple?

Steve
http://www.fivetrees.com
http://www.sfdesign.co.uk



Re: Detailed article about Mars Rover falure in EE Times
Quoted text here. Click to load it

I understand your point though I don't fully agree with it. A software
with hundreds/thousands of functionalities will be complex no matter how
much you break it down into smaller/simpler pieces. Still your approach,
if realizable, can potentially make better software even though it
remains complex. :-)

Site Timeline