Re: call_rcu data race patch

* Re: call_rcu data race patch
       [not found] <20210917191555.GA2198@bender.morinfr.org>
@ 2021-09-17 21:11 ` Paul E. McKenney
  2021-09-17 21:34   ` Guillaume Morin
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2021-09-17 21:11 UTC (permalink / raw)
  To: Guillaume Morin; +Cc: linux-kernel

On Fri, Sep 17, 2021 at 09:15:57PM +0200, Guillaume Morin wrote:
> Hello Paul,
> 
> I've been researching some RCU warnings we see that lead to full lockups
> with longterm 5.x kernels.
> 
> Basically the rcu_advance_cbs() == true warning in
> rcu_advance_cbs_nowake() is firing then everything eventually gets
> stuck on RCU synchronization because the GP thread stays asleep while
> rcu_state.gp_flags & 1 == 1 (this is a bunch of nohz_full cpus)
> 
> During that search I found your patch from July 12th
> https://www.spinics.net/lists/rcu/msg05731.html that seems related (all
> warnings we've seen happened in the __fput call path). Is there a reason
> this patch was not pushed? Is there an issue with this patch or did it
> fall just through the cracks?

It is still in -rcu:

2431774f04d1 ("rcu: Mark accesses to rcu_state.n_force_qs")

It is slated for the v5.16 merge window.  But does it really fix the
problem that you are seeing?

> Thanks in advance for your help,
> 
> Guillaume.
> 
> PS: FYI during my research, I've found another similar report in bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208685

Huh.  First I have heard of it.  It looks like they hit this after about
nine days of uptime.  I have run way more than nine days of testing of
nohz_full RCU operation with rcutorture, and have never seen it myself.

Can you reproduce this?  If so, can you reproduce it on mainline kernels
(as opposed to -stable kernels as in that bugzilla)?

The theory behind that WARN_ON_ONCE() is as follows:

o	The check of rcu_seq_state(rcu_seq_current(&rnp->gp_seq))
	says that there is a grace period either in effect or just
	now ending.

o	In the latter case, the grace-period cleanup has not yet
	reached the current rcu_node structure, which means that
	it has not yet checked to see if another grace period
	is needed.

o	Either way, the RCU_GP_FLAG_INIT will cause the next grace
	period to start.  (This flag is protected by the root
	rcu_node structure's ->lock.)

Again, can you reproduce this, especially in mainline?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread