Hello group!
When running our application on full performances we encountered occasional delays of 100ms. At >80% of CPU load they appear very often - once every few seconds. If load is not so high, delays still appears but not so often. Investigation leads us to finding that delays occur on semop() when acquiring semaphore (this semaphore is used to protect a critical section, which is executed very intensively ~4200/s). Further investigation makes us to conclusion that delays occur due to round-robin scheduler (sched_rr_get_interval() returns exact 100ms period, application processes are running on RT priorities with SCHED_RR policy). When the process exhausts its time-slice scheduler preempts it for RR interval. If this happen when the semaphore is taken no one could return it for the 100ms. In the mean time many processes try to execute the same critical section but they are blocked. It has a consequence that all other processes, also those with higher priority are blocked, while they are running their own transitions (they should process messages from the message queue). These results to full queues (>1000 messages) of blocked processes and make them running for longer period (to empty theirs queues) and gives them a good chance to be preempted by RR scheduler again. The circle is closed. When the scheduler policy is changed to SCHED_FIFO all of this does not happen any more, but the application runs more in bursts which is not desirable. Do you agree with our findings from above? Do you have any suggestion how to prevent the described problem and retain SCHED_RR policy?
Best regards,
Sani