How to avoid a task not executed in a real-time OS?

- R
- Robert Willy
  
  Contact options for registered users
posted
5 years ago

Mon, Jan 21, 2019 12:40 PM

Hi, I was asked the question in the title some time ago. I had some real-time embedded system experience but with RTOS. A watch-dog can avoid a task not called in a real-time system. But in a RTOS, what is the right answer for it? I knew task priority, a timer triggered event all can influence a task execution. But they didn't look like the correct answer. What is the right answer do you think?

Best Regards,

- J
- John Speth
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 21, 2019 2:27 PM

It's a peculiar choice of words in asking these questions. I'm not sure exactly what you mean by "avoid".

In a typically designed single core system, at least one RTOS task is always present (blocked or waiting) but not always actively doing work (running). There is always more than one task, otherwise there is no justification for using an RTOS. The programmer can (usually) start (create) and stop (kill) tasks. Task creation usually happens at init time but there's no rule the forces that design. If you want to avoid the task running state, kill the task or inhibit the input to unblock it. You decide as the programmer.

Task priority is a means to direct a lower priority task to yield CPU usage while a higher priority task is running. When the high priority task is done running (that is, it blocks), the lower priority task automatically runs if it's runnable (not blocking). Task priority can delay a task running in response to what unblocked it but not inhibit it.

That's a simplified answer. Some RTOS's are quite sophisticated and the answers get complicated when you use their advanced features.

JJS

- S
- Stef
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 21, 2019 4:55 PM

On 2019-01-21 Robert Willy wrote in comp.arch.embedded:

A watchdog can not avoid a task being not called. In most cases the watchdog just resets the complete system if it is not serviced fast enough (by the task(s) being watched). There is however no guarantee that the offending task will run after the reset. There may be some permanent failure that prevents it from running.

You can use a watchdog in an RTOS as well, but you have to make sure your task will run within the required watchdog service interval.

What do you think the 'RT' in 'RTOS' stands for? What is the difference between your 'real time' and using an RTOS?

--
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail) 

It is better to have loved a short man than never to have loved a tall.

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 21, 2019 6:05 PM

If you question is how to make sure that every task gets the time it needs within it required time limits, in general this is impossible, as if you mis-design your system, you may end up with 110 units of work to do in 100 units of time, which is fundamentally unsolvable.

Different RTOSes have different scheduling methods, mostly in a few standard technologies.

If you have an RTOS with strict execution priorities, then the only thing that can keep a current task from getting the CPU to try and meet its deadlines are task with higher priorities (or there is a case where lower priority tasks might hold resources needed). If you can characterize and limit the time that those higher priority tasks might consume, then you can come up with some gaurentees for execution of that task.

Another method sets priorities based on how close the task is to its deadline, which as long as no other tasks have fallen behind their deadline (if past deadline is given priority) should get some time as the deadline approaches.

You can also create ad-hoc solutions where some high priority task periodically adjusts priorities to try and make sure a given task gets the time it needs (normally an indication that something wasn't designed right the first time).

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 21, 2019 9:44 PM

Am 21.01.2019 um 13:40 schrieb Robert Willy:

No, it really can't. It can only _react_ to a supervised task not having been called in time. But it cannot keep that from happening in the first place, i.e. it cannot "avoid" it. Nor can it usually guarantee that a bite of the dog will actually resolve the issue.

Quite probably the same, because there is not really a difference between "real-time system" and "RTOS".

Any sufficiently badly designed application can fail to perform some task before its allotted deadline. If that renders the performance of said application unacceptable, that means the buzz-word "real-time" has been applied to it correctly --- quite often it isn't.

Nor is there such a thing as "the" right answer to the question "what should I do if this happened to me?" Like for all somewhat interesting questions, the only universally applicable answer is "It depends."

In a perfect world the answer might be "Just make sure your system has sufficient resources in all areas that this simply cannot happen, and then some" --- but in most corners of this here reality bean-counters will tell you, in excruciating detail if you insist, why that answer is completely not incorrect, but actually blasphemous.

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 22, 2019 6:05 AM

There are some simple rules of thumb that I have used successfully for decades.

1.) Analyze your task and find out which priorities can be _lowered_ without harming the total system performance. In this way, the few higher priority tasks should have plenty of execution time, without constantly fighting with low priority tasks for CPU time. 2.) The higher the priority, the shorter the execution time should be. Treat the highest priority tasks like "pseudo interrupts". Move any non-hard-RT functions into the null task, which can consume all CPU time after all high priority tasks have ben served. 3.) If some task takes too long to execute for its priority, consider splitting the functionality into two task, less time critical things to a lower priority task (or even null task) and the essential things into a small high priority task. 4.) Avoid uncontrolled resource locking. If some resource needs to be locked, use a dedicated high priority transaction handler task with well defined execution time to handle the transaction from beginning to end.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 22, 2019 8:19 AM

I have heard the use of watchdogs as being like hitting a dead man on the head with a hammer in the hope that it will wake him.

Watchdogs with reset can be useful if you have hardware issues - dodgy power supplies, radiation, or something that has a risk of giving an unexpected one-off glitch that stops the system working properly. If the problem is in software, however, it will just lead to the same situation again and again.

For software issues, watchdogs are more about helping you identify and debug problems - they don't fix anything, they just let you know you have a problem.

But in no sense does a watchdog /avoid/ a task not doing its job. At best, it can help you see that the task has not succeeded. There are usually better ways (RTOS or not) to do this, if you think it is necessary.

The only way to avoid a task not being called is to be sure that you design the system properly.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 22, 2019 10:01 AM

The "RT" in RTOS stands for "real time".

A watchdog generally tells you that some task either was not run, or was not run on time. It knows this because the task resets the watchdog's timer as part of its operation. If the watchdog expires, you know that the task, for whatever reason, failed to reset it.

So the task running properly avoids the watchdog expiring.

From another point of view: a watchdog could be designed to start a particular task when it expires. In this case execution of that task could be avoided by something else continually resetting the watchdog.

If neither of these answer your question, then I don't understand what you're asking.

George

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 22, 2019 9:10 PM

That's overstating it teeny little bit. ;-)

It's more like hitting a newly dead heart with a hefty jolt of electricity to possibly make it restart --- a procedure that is quite definitely not recommended to be used on a non-dead one.

And just like the bite of a watchdog, that sometimes actually does work.

That's by no means certain. It all depends on how it happened that the software got itself stuck in a situation that didn't occur during testing (or the software would never have been released into the wild, right?) But somehow, right now it did.

If a e.g. once-in-a-blue-moon "forbidden" excession of design limitations on some input was the reason, a watch-dog reset cures the problem until the next event of that kind --- i.e. possibly forever.

- D
- Dave Nadler
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 22, 2019 9:28 PM

I just returned from a presentation by Don Eyles about the Apollo missions. He discussed how the watchdog kept resetting the LEM computer during the Apollo 11 descent. Even more scary was the Astronaut-hand-applied patch during Apollo 14 to circumvent an intermittent short in the LEM "Abort" button. Don't get too concerned, just press on...

Looking forward to reading his memoir (had to buy a copy)!

formatting link

See ya, Dave

PS: Now, how's THAT for thread drift?

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Jan 23, 2019 7:57 AM

Perhaps. But like a defibrillator, the watchdog does nothing to deal with the actual cause of the problem.

We can say without doubt that the watchdog does not cure the problem. If software causes a hang that triggers the watchdog, there is a bug in the software. That applies regardless of how it happened, how good or bad testing you had, what the input values were, etc. (Note that if something exceeded /specified/ design limitations, then that is outside the realm of the software.) Since the watchdog does not magically fix the software, the problem remains.

Clearly, not all systems need the same level of quality, reliability, and robustness. You don't design and test your "amusing" singing birthday card to the same levels as you do for your submarine control system. And so sometimes, a watchdog reset on software hang is a good enough way to handle the symptoms of some kinds of software bugs. You balance the cost of the unreliability against the cost of fixing it - engineering is about making things good enough, not perfect.

But you do need to be aware of the watchdog actually does, and does not do. Some developers use it as a crutch to avoid the effort of writing correct code, or testing appropriately. "If there is an error on the communication line, it will lead to a timeout - the watchdog will reset the system, so that's fine." "The tasks will only have a conflict and a deadlock if the user presses the button at the same time as the screen is updating - that is unlikely to happen, and the watchdog will fix it if it does". Some use it as a crutch to avoid debugging and fixing problems. "The software hung during testing, but the watchdog restarted it fine. We don't think it will happen at the customer's site."

Or "A watch-dog can avoid a task not called in a real-time system", as the OP claimed.

So that does /not/ mean I don't recommend a watchdog (though frequently I do not enable them - I'd rather the customer reported the problem so we can fix it properly). You just have to know /why/ you have a watchdog, and use it appropriately.

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Jan 23, 2019 9:13 PM

Obligatory:

formatting link

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Jan 24, 2019 7:31 AM

We used a Playstation controller for a whole submarine, not just a periscope. (It was an ROV - remote operated vehicle. So no people in it.)

- S
- StateMachineCOM
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 28, 2019 3:17 PM

A watchdog timer is really a hardware-assisted, time-based assertion in the code. As such, it is just a part of the larger software development strate gy known as Design by Contract (DbC).

The value of identifying the watchdog timer as an *assertion* is that it in forms you what to expect from it. For example, you can't expect an assertio n to "avoid" or "fix" a problem (like in the OP "avoid a task not executed" ). This is because assertions neither handle nor prevent errors, in the sam e way as fuses in electrical circuits don't prevent accidents or abuse. In fact, a fuse is an intentionally introduced weak spot in the circuit that i s designed to fail sooner than anything else, so actually the whole circuit with a fuse is less robust than without it.

Now, regarding using watchdog timers in the context of an RTOS: you should service the watchdog from the context of the task. A common mistake is to s ervice a watchdog from a periodic timer service. RTOS timers typically run in the ISR context, so they might be running and being serviced, while the task is starving. Another mistake along these lines is to service a watchdo g from various RTOS callbacks, also known as "hooks", which might also run in a different context than your task.

Once you use a watchdog timer, you need to carefully design (and test!) the behavior of the system when the watchdog expires. Here again, identifying the watchdog as an assertion helps, because you can use your general strate gy of handling failed assertions. I've written more about this in the blog: ["A nail for a fuse"]

formatting link

-for-a-fuse/).

I am always amazed by embedded designs, where developers go to great length s to apply memory protection (MPU or MMU) or watchdogs, while at the same t ime they don't sprinkle their code with basic code assertions that perform rudimentary sanity checks.

Even more bizarre to me is when developers use assertions, but *disable* th em in the production release (while keeping the MPU and the watchdogs.) I'm sure the readers of this forum never do such an illogical thing, and alway s ship the products with carefully designed assertions, right?

- A
- A.P.Richelieu
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 28, 2019 7:15 PM

Den 2019-01-28 kl. 16:17, skrev StateMachineCOM:

Assertions are there to check that your code is sane. They are designed to be removed in production code.

Assertions are not the same thing as checking your input. You definitely need to check your input, but once validated, they do not need revalidation. if the input is not valid, an intelligent handling/recovery of the erronous output is preferred over some rough action generated by an assertion failure.

- S
- StateMachineCOM
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 28, 2019 8:31 PM

Absolutely. You need to very carefully distinguish between the erroneous be havior (a.k.a. bug) and exceptional condition, which is rare but can arise legitimately. Assertions are for errors. I've written specifically about it in the Dr.Dobb's article "An Exception or a Bug?"

formatting link

]

I'm exactly challenging this beaten-path point of view, because it suggests to stop checking the sanity of the production code. This would work if *al l* errors are completely removed during debugging. Are they really removed in YOUR code?

And also, relevant for the OP, are you really suggesting to leave the watch dog in the production code while disabling other assertions. If so, WHY?

I'm looking forward to interesting discussion...

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 28, 2019 9:22 PM

A generally very sensible article.

I'm all for having error checking in production code, but I don't call those 'assertions'. I don't like the idea of leaving _assertions_ in, though, because (a) abort() or a hard reset is a mighty big hammer to apply that broadly, and (b) it deprives me of a very useful facility for debugging, because I can't use as many of them as I want if they all have to be left in the production builds.

I have a few macros like yours that supply a finer-grained set of options.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC / Hobbs ElectroOptics 
Optics, Electro-optics, Photonics, Analog Electronics 
Briarcliff Manor NY 10510 

http://electrooptical.net 
http://hobbs-eo.com

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Mon, Jan 28, 2019 11:17 PM

We had a set of assert macros that would abort in the test environment, but return an error code when run in production so the caller needed to explicitly ignore or handle the error condition. That gives you proper feedback during testing but proper error handling in prod.

Clifford Heath.

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 29, 2019 3:08 PM

I'm talking mostly about things like enforcing class invariants and so on. Putting those in inline functions, for instance, can be a big performance and code size hit, and once testing is done, you can be pretty sure they won't fire in production.

Memory corruption, null pointers, deadlocks, etc. definitely have to have run time checks. So it's nice to leave assert() for debug and roll your own macro set for runtime. That way you can have the fault tolerance of defensive programming without hiding bugs. (Maguire is still a good read.)

Most of my code is embedded or else console-mode simulations, so I don't really do a lot of error recovery.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC / Hobbs ElectroOptics 
Optics, Electro-optics, Photonics, Analog Electronics 
Briarcliff Manor NY 10510 

http://electrooptical.net 
http://hobbs-eo.com

- S
- StateMachineCOM
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Tue, Jan 29, 2019 5:30 PM

Seriously? Do you really believe that the error codes are checked and proper actions taken in *all* cases? Isn't this just kicking the can down the road and into some other code, which is ill-prepared to "handle" your bugs?

I'm not sure what you are proposing by "rolling your own" for production code. What those "other versions" of assert macros in production code are supposed to do?

For the OP, what is your advice specific to watchdog timers? Would you switch the watchdog off for production code? In that case, is it worth to implement a watchdog only for debugging?

On the other hand, if you recommend keeping the watchdog in production code, why you choose watchdog and suppress other assertions? What's so special about watchdog and what should be done when the watchdog expires in production code?

The main point remains: Bugs don't miraculously go away just because you stop checking for them. Do they?

Miro Samek state-machine.com