linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rcu stall caused by rt task with high minor page fault rate
@ 2023-05-30 10:51 Kegl Rohit
  2023-08-18  9:42 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 2+ messages in thread
From: Kegl Rohit @ 2023-05-30 10:51 UTC (permalink / raw)
  To: linux-rt-users

Hello!

Running  5.10.104-rt63 SMP PREEMPT_RT on dual core imx7d.

Currently I am debugging an rcu stall issue caused by an user-space rt task.

The test procedure to reproduce the issue is:
1. Bootup the system
2. initd starts rt task application (SCHED_FIFO, priority -13 and
affinity set to core0)
3. login via ssh and start e.g. memtester with the maximum amount of
free RAM available
4. memtester locks its memory with mlock successfully
5. After some time the rt task is stuck consuming 100% system time on core0.
6. Kernel produces rcu stall warnings because rcu kthread does not get
any CPU on core0.

Looking at the vm stats of the rt thread shows a minor page fault rate
 > 350k/s.
So the process is stuck in memory handling and because of the core
binding the rcu kthread does not get any core0 cpu time and produces
stall warnings.

Reading https://wiki.linuxfoundation.org/realtime/documentation/technical_details/rcu
CONFIG_RCU_BOOST=y, should be the solution for such issues.
But enabling RCU_BOOST did not change anything.

See link above:
> However, bugs can happen, including bugs involving infinite loops in high-priority real-time threads. Debugging these problems is more difficult if the system keeps hanging due to OOM. One way to ease debugging is to build with CONFIG_RCU_BOOST=y,

The main cause for the minor page faults is the missing mlock in the
application.
mlock is always necessary for rt apps.

But for my understanding RCU_BOOST should help here, even if the rt
app is not implemented correctly?

Thanks in advance!

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: rcu stall caused by rt task with high minor page fault rate
  2023-05-30 10:51 rcu stall caused by rt task with high minor page fault rate Kegl Rohit
@ 2023-08-18  9:42 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 2+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-18  9:42 UTC (permalink / raw)
  To: Kegl Rohit; +Cc: linux-rt-users

On 2023-05-30 12:51:48 [+0200], Kegl Rohit wrote:
> Hello!
Hi,

> But for my understanding RCU_BOOST should help here, even if the rt
> app is not implemented correctly?

RCU_BOOST will help to move a task outside of its rcu_read() section if
it got preempted there. Otherwise the RCU machine can't make progress
and free after a grace period. If a task with low priority is stuck in
such a section, then RCU_BOOST will help to move it out. This requires
that the high-priority task gets into a RCU section for it work (usually
not a problem but wouldn't happen if a task loops 100% in userland).

From you describe, your RCU warning came probably from the fact that
your application is consuming 100% of the CPU. Your RT task needs to
rest from to time so that other kernel threads get their turn.

By enabling callback offloading and moving that RCU thread another CPU
you ensure that the completion callbacks get invoked.
Then you need to ensure that RCU can assure that it makes progress. This
is probably what triggers here. There is another thread for that but
this one can only be offloaded with CONFIG_NOHZ_FULL.

> Thanks in advance!

Sebastian

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-08-18  9:43 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 10:51 rcu stall caused by rt task with high minor page fault rate Kegl Rohit
2023-08-18  9:42 ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).