All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: donghai qiao <donghai.w.qiao@gmail.com>
Cc: rcu@vger.kernel.org
Subject: Re: RCU: rcu stall issues and an approach to the fix
Date: Thu, 22 Jul 2021 13:41:11 -0700	[thread overview]
Message-ID: <20210722204111.GB4397@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <CAOzhvcPb=NMCHCqyDiyWiX9DjR=RXgdQRfJDkJOdazugHe7+Zw@mail.gmail.com>

On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> RCU experts,
> 
> When you reply, please also keep me CC'ed.
> 
> The problem of RCU stall might be an old problem and it can happen quite often.
> As I have observed, when the problem occurs,  at least one CPU in the system
> on which its rdp->gp_seq falls behind others by 4 (qs).
> 
> e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> rdp->gp_seq = 0x1388a1.

CPUs that are idle for long periods of time can fall back much farther.
I have seen systems with CPUs having rdp->gp_seq thousands of grace
periods behind.  So yes, this can happen when there are RCU CPU stall
warnings, but it can also happen other ways.

> Because RCU stall issues can last a long period of time, the number of callbacks
> in the list rdp->cblist of all CPUs can accumulate to thousands. In
> the worst case,
> it triggers panic.

That is quite true.

> When looking into the problem further, I'd think the problem is related to the
> Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> would send a rescheduling request via send_IPI to that CPU to try to force a
> context switch to make some progress. However, at least one situation can fail
> this effort, which is when the CPU is running a user thread and it is the only
> user thread in the rq, then this attempted context switching will not happen
> immediately. In particular if the system is also configured with NOHZ_FULL for
> the CPU and as long as the user thread is running, the forced context
> switch will
> never happen unless the user thread volunteers to yield the CPU. I think this
> should be one of the major root causes of these RCU stall issues. Even if
> NOHZ_FULL is not configured, there will be at least 1 tick delay which can
> affect the realtime kernel, by the way.
> 
> But it seems not a good idea to craft a fix from the scheduler side because
> this has to invalidate some existing scheduling optimizations. The current
> scheduler is deliberately optimized to avoid such context switching.  So my
> question is why the RCU core cannot effectively update qs for the stalled CPU
> when it detects that the stalled CPU is running a user thread?  The reason
> is pretty obvious because when a CPU is running a user thread, it must not
> be in any kernel read-side critical sections. So it should be safe to close
> its current RCU grace period on this CPU. Also, with this approach we can make
> RCU work more efficiently than the approach of context switch which needs to
> go through an IPI interrupt and the destination CPU needs to wake up its
> ksoftirqd or wait for the next scheduling cycle.
> 
> If my suggested approach makes sense, I can go ahead to fix it that way.

If you have not yet read through Documentation/RCU/stallwarn.rst,
please do so.  There are many potential underlying causes of RCU CPU
stall warnings, most of which are simply bugs that need to be fixed.
One common example bug is a very long in-kernel loop that is missing
a cond_resched() -- it is after all hard to provide 500-millisecond
response time when your kernel has a 21-second tight loop.

Now, if you are seeing RCU CPU stall warnings for no apparent reason,
let's take a look and find root cause.

							Thanx, Paul

  reply	other threads:[~2021-07-22 20:41 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 20:08 RCU: rcu stall issues and an approach to the fix donghai qiao
2021-07-22 20:41 ` Paul E. McKenney [this message]
2021-07-23  0:29 ` Boqun Feng
2021-07-23  3:49   ` Paul E. McKenney
     [not found]     ` <CAOzhvcOLFzFGZAptOTrP9Xne1-LiO8jka1sPF6+0=WiLh-cQUA@mail.gmail.com>
2021-07-23 17:25       ` Paul E. McKenney
2021-07-23 18:41         ` donghai qiao
2021-07-23 19:06           ` Paul E. McKenney
2021-07-24  0:01             ` donghai qiao
2021-07-25  0:48               ` Paul E. McKenney
2021-07-27 22:34                 ` donghai qiao
2021-07-28  0:10                   ` Paul E. McKenney
2021-10-04 21:22                     ` donghai qiao
2021-10-05  0:59                       ` Paul E. McKenney
2021-10-05 16:10                         ` donghai qiao
2021-10-05 16:39                           ` Paul E. McKenney
2021-10-06  0:25                             ` donghai qiao
2021-10-18 21:18                               ` donghai qiao
2021-10-18 23:46                                 ` Paul E. McKenney
2021-10-20 17:48                                   ` donghai qiao
2021-10-20 18:37                                     ` Paul E. McKenney
2021-10-20 20:05                                       ` donghai qiao
2021-10-20 21:33                                         ` Paul E. McKenney
2021-10-21  3:25                                           ` Zhouyi Zhou
2021-10-21  4:17                                             ` Paul E. McKenney
2021-10-21 16:44                                           ` donghai qiao
2021-07-23 17:25     ` donghai qiao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210722204111.GB4397@paulmck-ThinkPad-P17-Gen-1 \
    --to=paulmck@kernel.org \
    --cc=donghai.w.qiao@gmail.com \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.