Spirit rover OS problems

I'm kinda surprised not to have seen discussion here of the Flash memory/OS problems suffered by the Spirit rover. It seems noteworthy that several $100 million's worth of kit was crippled for so long by what was reported as a priority-inversion issue. And you can't get much more hard-realtime than the Spirit rover.

NASA have said that their biggest problem with the WindRiver OS is comprehension. For me, this neatly underlines the problems I've had with RTOSs... and why I avoid them as far as possible.

Discuss ;).

Steve

formatting link
formatting link

Reply to
Steve at fivetrees
Loading thread data ...

memory/OS

$100

the

The subject of RTOSs is a religious issue, and I make it a rule never to discuss religion. ;-)

Tanya

Reply to
news.bigpond.com

Heh ;).

Steve

formatting link
formatting link

Reply to
Steve at fivetrees

memory/OS

$100

the

Priority inversion (and Wind River) also glitched Pathfinder.. Now that this enigma clearly has a price tag associated with it, someone may catch on that configuring priority inheritance is a lot cheaper than a 110,000,000 mile control-alt-delete.

After all, it's not rocket science..

Reply to
Ian McBride

First I've heard of a priority-inversion problem there. I understood them to be having the equivalent of a buffer overrun in stored data, but that is very hazy. I haven't seen anything with any details of the actual problem.

If they are using a RTOS without having sources, that seems incredible. Source is always the ultimate documentation. I would assume suitable non-disclosure agreements. Unfortunately this use-it blindly attitude infests many areas, especially including medical devices. I would expect WindRiver to gladly supply source (with non-disclosure) for the promotional value of operating on Mars.

cross-posted to c.l.c, because at least some WindRiver people hang out there. FUPs set to take c.l.c off again.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
 Click to see the full signature
Reply to
CBFalconer

Life is too short to roll everything by hand for every single project, though. My book is really about this topic, to a large degree. I don't want to trust an "RTOS" with ultimate control over my ship, but I want some of the high-level services it can provide. So I do the real trusty stuff in external OSless micros and the sloppy stuff (cameras, networking, bulk data storage) is done on an SBC.

For applications with few or no consequences for forced watchdog reboots, user-intervention-reboots, etc (consumer electronics for instance), I'd say damn the torpedoes and full speed ahead - use the RTOS, follow the vendor's suggested best practices, and point the finger back at them if there's a fatal problem.

Gee, maybe I'm qualified to be a PHB.

Reply to
Lewin A.R.W. Edwards

I agree with you.

Another big problem with RTOSs is driver support. If you are using a PC based platform you will only find drivers for Windows and Linux. The drivers for RTOSs have to be handcrafted.

Is there any RTOS that would work directly with drivers written for Windows/Linux?

Sandeep

--

formatting link
EventStudio 2.0 - Real-time and Embedded System Design CASE Tool

Reply to
EventHelix.com

Linux with realtime extensions.

Reply to
Lewin A.R.W. Edwards

I am personally suprised and saddened by the news that Nasa is using:

A. C, vs. a more reliable language for the rovers.

B. An off the shelf RTOS with priority based management, vs. a system which schedule based management.

Although there were several discussions here of the rover using Java, the news from Nasa mentions "C code downloads".

Bottom line is that the software industry is in a very poor state right now for reliability. Most projects are done in C, which, in the words of its own authors, was designed for system and low level tasks, and done without concern for reliability, hence the abismal reliability record of off the shelf and even embedded software.

Although the COTS (customer off the shelf) program has yeilded cost benifits, that is no excuse for inhaling the worst practicies of the software industry today.

So the result now is that systems costing the major portion of a billion dollars of our money are more likely to stop because of a software glitch than a hardware one. Its ridiculous that such an expensive vehicle that has survived radiation and heat with amazing hardware reliability is felled by by such simple nonsense.

What would I do ? I would use a language, any language, with type security. Java, Ada, Pascal, virtually any language but C or C++. C has no type security whatever, and C++ only has security if you refrain from using C constructs within it, which nobody does.

Second, priority based scheduling is NOT deterministic. In fact, it basically amounts to putting a random number generator in charge of your scheduling. There have been MUCH better systems detailed in the literature, such as "deadline" based scheduling that ARE deterministic. I will admit that I have been evangelistic on this subject, but the industry determination to stick with a scheme that has so many demonstratable flaws truly stuns me.

My 2 cents.

Reply to
Scott Moore

For an answer to any question like this, consider why the broadcast industry in North America stuck with NTSC despite the fact that it is a system with numerous demonstrable flaws. It solved a problem when it was introduced, a big investment in knowledge and equipment was made, it's a huge job to uproot it.

The cost benefits of using COTS stuff are very real. Consider it like this: If it's going to cost $200 million to send a COTS mission with a

75% likelihood of success, or $600 million to send a proprietary mission with a 95% likelihood of success, what makes better sense? You could send three of the cheap missions and get a much better target coverage, even if one of the cheapo probes is a total failure. Especially since the definition of "success" in this context is pretty vague.

NASA is under big pressure to spend less per mission. And at the end of the day, this COTS problem cost us... what? Seven or eight days out of a nominal 90-day designed mission life, which is really (according to the reports I've read) going to be a 180-day life? Do you think there is some specific scientific goal that we're now only ALMOST going to reach

- the arm is going to be stretching out for the rock with the Martian trilobite fossil in it when the mission runs out of time and the rover fails?

I have no problem seeing my tax dollars going into C and COTS projects in "el cheapo" interplanetary missions. Hell, if NASA lost Federal funding and had to rely on donations, I'd give what I could freely.

Reply to
Lewin A.R.W. Edwards

all I know is that they use a VME rack and VxWorks. At all, no information about the cards in the slots. Believe the public will never hear the truth on what happened. Nasa did many experiments on flash reliablity under extreme temperature, xray and other particles influence but there is still the question if the fault was HW or SW caused.

In todays press release the announce the have a "undefined workaround" and I assume in the meantime they know the exact reason

However, last days I experimented with the Infineon XC16 chips and was suprised to find a HW flash error correction. That uC contains a 8 bit HW ECC for each 64 bit flash array. Single bit errors can be corrected on the fly and double bit errors indicated. I did not see such a feature at uC flash before and it seems they are aggrieved by the former Siemens SAB 87C166 debacle (not for new designs)

Nasa wrote in earlier tech docs, they have error correction in RAM, they wrote about eeprom, but now, its the first time they talk from flash and probably there is no hard- ware ECC available.

Assume critical data like file system is always available as a second copy. With other words: Crashing a complete system like they had, would require more than one bug.

Reply to
Janvi

I am glad I am not the only one. I have used WindRiver a couple of times in jobs, but I was always very disappointed by the nature and level of support. Once I was trying to get some info on why a wait on a resource was consuming nearly a milisecond after the resource became available, using a 66 MHz 486 processor. In the end they would only tell me that "there are no known bugs" in that OS code. The refused to discuss it further. After that, I swore I would never use another WindRiver product again.

Seems that great minds think alike... no, wait, NASA is still using it!

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
Rick Collins

I don't have a problem with COTS. The error is in taking the most popular techniques of the industry as is, to whit C and priority based management (I left out relying on watchdogs, which are akin to helping a heart attack patient by hitting him in the head with a hammer once a minute). Nasa needed to take the "best of" the industry practices. The software industry is a mess. We are reaching new lows in reliability on a daily basis, so finding the "best of" reliability practices is certainly a challenge, but it IS doable.

No, it killed the last lander. This time they dodged the bullet. The oddesy has had software problems as well. The pathfinder died entirely. Thats %100 failure rate in Mars missions. I would fire anyone with that rate of failure.

Reply to
Scott Moore

The last description of the problem by Nasa was that the flash file storage system did not correctly handle being full or nearly full. Most software does not handle these edge conditions properly. Windows itself would reliably crash if you insisted on leaving main memory or the disk full or nearly full. In C language terms, all you have to do is look at how many times malloc is called with no checking for a return of zero (meaning no more memory available), and that is only the most obvious case. Most code that does handle the out of memory condition does nothing but abort the program currently running, which is not much improvement on not handling the zero case, since that will lead to an invalid address exception and program termination.

Reply to
Scott Moore

Nasa should require that Wind River perfom on-site maintenence on the product :-)

Reply to
Scott Moore

Travelling expenses and daily allowance paid by customer?

aha

--
Every program has at least one bug and can be reduced by at least one
line.  By induction, then, every program can be reduced to a single
 Click to see the full signature
Reply to
Andreas Hadler

It's not the language that's reliable or not, it's the programmers. A craftsman can write bulletproof code in any language and a hack can screw up in any language.

--
Rich Webb   Norfolk, VA
Reply to
Rich Webb

If you really believe that all languages are equivalent for reliability, then you either need to go back to school or you need a lot more experience.

The only thing I know that is common to *all* programmers is that they are *never* perfect and can use all the help they can get when designing a large system.

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
Rick Collins

Like another poster said, it's not the tools, it's the craftsman. You're not appreciating the full sweep of my assertion above: Using /common/ tools and techniques is a substantial part of cost saving. Anti-C hysteria is not going to achieve any increase in reliability.

More like whacking him in the chest if he doesn't answer "yes" when you ask him if he's OK.

For manned missions, yes. For robots, the criteria are cheap and fast. There's a bit of "and it has to work enough" mixed in there, so the public doesn't lose interest. It's a balance of money vs. PR. There are no lives at stake. Do it fast, do it cheap, try to get it as right as possible.

Viking lasted a couple of years on the surface. MER-A will last a few months. The difference isn't the software, it's the fact that Viking had RTGs to keep it warm all the time, and MER-A has to rely on solar cells, NiMH backups, and presumably the outpourings of bleeding-heart environmentalists. Remember what killed the last Viking lander, by the way - ground operator error. And VO/VLs weren't programmed in C.

Reply to
Lewin A.R.W. Edwards

Those that use a RTOS do and those who dont do not and never the two shall meet.

Reply to
???

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.