How I tripped over sched_rt_runtime_us

How I Tripped Over sched_rt_runtime_us or

Don't Try to Control What You Can't Measure

I recently spent almost two weeks banging my head on the wall trying to figure out why some software I was working on was failing. There were a number of SCHED_FIFO tasks triggered by timers. They ran with periods of typically 4,8,10 and 20 ms. Each woke up based on a periodic timerfd and ran for a few hundred microseconds. Most times when I would start the program, everything ran fine -- and would run fine indefinitely.

But, about 1 time out of 20 maybe, I would start up the programs, and they just wouldn't work right. Hardware devices would repeatedly time-out because they weren't being serviced. There would be gaps in message streams, etc.

After poking about for a few days, I determined that a thread that was supposed to be waking up every 4ms (for example), would run fine for "a while" and then just "not run" for many tens of ms. Sometimes it just didn't wake up, and sometimes it got suspended while it was running. When running, it wasn't doing anything that could block. When it got suspended while running, the location where it got suspended was pretty much random.

After instrumenting code a bit more, I determined that the most common failure state resulted in all my SCHED_FIFO threads being suspended for 50ms once every second. _Exactly_ 50ms, _exactly_ once every second.

Changing to SCHED_RR didn't change anything. Rearranging the priorities didn't change anything. However when I changed one of the threads to SCHED_OTHER, it started running without interruption.

The SCHED_OTHER thread was running _while_ SCHED_FIFO threads were being suspended "against their will".

I had read about the real-time throttling feature in the kernel, but it was only supposed to kick in when realtime tasks were using up 95% of CPU. I had just measured my CPU usage, and the realtime tasks were barely using 5% -- not anywhere close to the 95% limit.

But still, SCHED_FIFO threads being suspended for 50ms every 1s while SCHED_OTHER threads continue to run. That sure smells like somebody is intentionally blocking all realtime tasks 5% of the time.

Then it dawned on my that (at least on the ARM9), the mechanism used by the kernel to measure per-thread CPU usage is based on the assumption that tasks execute _randomly_ and their execution times are not correlated with kernel timers.

Mine _are_ controlled by kernel timers.

If you're running threads triggered by timers, then the kernel's CPU usage numbers are complete garbage[*]. I knew this, and in the past had to work out my own CPU usage measurement schemes because of it.

_If_ a task is a realtime task, _and_ it happens to be in phase with the CPU usage sampling, then the realtime usage looks like it's 100% and the sched_rt_runtime_us limiter kicks in and locks out all realtime threads for 50ms once every 1s. In this case, that breaks all sorts of seemingly unrelated stuff.

The moral to the story: if you're are using an architecture with without _real_ per-thread CPU usage measurements and you are triggering realtime tasks using timers, you _must_ disable the realtime limiter by setting sched_rt_runtime_us to -1 because the limiter is being fed CPU measurements that are, in effect, random.

If you have a thread that uses 2% of the CPU time, but it's synchronized with the CPU usage sampling, then it will usually appear to use either 0% or 100% depending on the phase difference between the sampling and the task's execution. If the phase difference is _just_ right you will see a number that varies wildly between 0 and 100.

How I tripped over sched_rt_runtime_us

Join the Discussion

Didn't find your answer?