rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: josh@joshtriplett.org, joel@joelfernandes.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jiangshanlai@gmail.com, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, pkondeti@codeaurora.org,
	prsood@codeaurora.org, gkohli@codeaurora.org
Subject: Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
Date: Tue, 19 Nov 2019 12:06:47 -0800	[thread overview]
Message-ID: <20191119200647.GC2889@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <20191119150931.GX2889@paulmck-ThinkPad-P72>

On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
> > Hi Paul,
> > 
> > On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> > > On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> > > > Hi Paul,
> > > > 
> > > > On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > > > > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > 
> > > > > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > > > > in below scenario:
> > > > > > > > > > 
> > > > > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > > > > >        exp_funnel_lock().
> > > > > > > > > > 
> > > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > > 
> > > > > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > > > > plays out in current -rcu.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > > > > checked against the -rcu code, and the relevant portions were present
> > > > > > > > there.
> > > > > > > > 
> > > > > > > > > >          s = 0b'100
> > > > > > > > > >          exp_funnel_lock()
> > > > > > > > > >            wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > > > > 
> > > > > > > > > All of the above could still happen if the expedited grace
> > > > > > > > > period number was zero (or a bit less) when that task invoked
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > > > > be different tasks.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > > > > wake up this task, on exp GP completion.
> > > > > > > > 
> > > > > > > > > Does this task actually block, or is it just getting ready
> > > > > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes, it actually blocked in wait queue.
> > > > > > > > 
> > > > > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > > > > >        worker and schedules out.
> > > > > > > > > 
> > > > > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > > > > 
> > > > > > > > Yes, correct.
> > > > > > > > 
> > > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > > >          s = 0b'100
> > > > > > > > > >          queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > > > > >            wake_up_worker()
> > > > > > > > > >              schedule()
> > > > > > > > > > 
> > > > > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > > > > >        sequence.
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > > > > >                                    // to 0b'100'
> > > > > > > > > > 
> > > > > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > > > > >        and releases exp_mutex.
> > > > > > > > > > 
> > > > > > > > > >        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > > > > >          sync_exp_work_done(rsp, s));
> > > > > > > > > >        mutex_unlock(&rsp->exp_mutex);
> > > > > > > > > 
> > > > > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > > > > started when .expedited_sequence was zero, right?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes, right.
> > > > > > > > 
> > > > > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > > > > 
> > > > > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > > > > >        from wrong wait queue index - it should have worken wait queue
> > > > > > > > > >        corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > > > > >        0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > > > 
> > > > > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > > > > have completed before the completion of the wakeups for the previous
> > > > > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > > > > 
> > > > > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > > > > period completed while kworker A was in D state, while waiting for
> > > > > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > > > > waiting on wq index 1.
> > > > > > > > 
> > > > > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > > > > 
> > > > > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > > > > If actually triggered, what did you do to make this happen?
> > > > > > > > 
> > > > > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > > > > _synchronize_rcu_expedited().
> > > > > > > > 
> > > > > > > > schedule+0x70/0x90
> > > > > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > > > > synchronize_rcu+0x50/0xa0
> > > > > > > > scsi_device_quiesce+0x50/0x120
> > > > > > > > scsi_bus_suspend+0x70/0xe8
> > > > > > > > dpm_run_callback+0x148/0x388
> > > > > > > > __device_suspend+0x430/0x8a8
> > > > > > > > 
> > > > > > > > [1]
> > > > > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > > > > 
> > > > > > > > > What have you done to test the change?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I have given this for testing; will share the results . Current analysis
> > > > > > > > and patch is based on going through ftrace and code review.
> > > > > > > 
> > > > > > > OK, very good.  Please include the failure information in the changelog
> > > > > > > of the next version of this patch.
> > > > 
> > > > Done.
> > > > 
> > > > > > > 
> > > > > > > I prefer your original patch, that just uses "s", over the one below
> > > > > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > > > > patch is that it allow more concurrency between a consecutive pair of
> > > > > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > > > > is also conceptually simpler with a more manageable state space.
> > > > 
> > > > The reason for highlighting the alternate approach of doing gp end inside
> > > > exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> > > > please correct me if I am wrong here:
> > > > 
> > > > 1. task0 holds exp_wake_mutex, and is preempted.
> > > 
> > > Presumably after it has awakened the kthread that initiated the prior
> > > expedited grace period (the one with seq number = -4).
> > > 
> > > > 2. task1 initiates new GP (current seq number = 0).
> > > 
> > > Yes, this can happen.
> > > 
> > > > 3. task1 queues worker kworker1 and schedules out.
> > > 
> > > And thus still holds .exp_mutex, but yes.
> > > 
> > > > 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> > > 
> > > And thus cannot yet have awakened task1.
> > > 
> > > > 5. task1 releases exp mutex, w/o entering waitq.
> > > 
> > > So I do not believe that we can get to #5.  What am I missing here?
> > > 
> > 
> > As mentioned in this patch, task1 could have scheduled out after queuing
> > work:
> > 
> > queue_work(rcu_gp_wq, &rew.rew_work)
> >            wake_up_worker()
> >              schedule()
> > 
> > kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
> > on exp_wake_mutex.
> > 
> > task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
> > true and it does not enter wait queue and releases exp_mutex.
> > 
> > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> >          sync_exp_work_done(rsp, s));
> 
> Well, I have certainly given enough people a hard time about missing the
> didn't-actually-sleep case, so good show on finding one in my code!  ;-)
> 
> Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
> The .exp_mutex won't be released until after it happens, and the
> next manipulation of the sequence number cannot happen until after
> .exp_mutex is next acquired.
> 
> Good catch!  And keep up the good work!!!

And here is the commit corresponding to your earlier patch.  Please let
me know of any needed adjustments.

							Thanx, Paul

------------------------------------------------------------------------

commit 3ec440b52831eea172061c5db3d2990b22904863
Author: Neeraj Upadhyay <neeraju@codeaurora.org>
Date:   Tue Nov 19 11:50:52 2019 -0800

    rcu: Allow only one expedited GP to run concurrently with wakeups
    
    The current expedited RCU grace-period code expects that a task
    requesting an expedited grace period cannot awaken until that grace
    period has reached the wakeup phase.  However, it is possible for a long
    preemption to result in the waiting task never sleeping.  For example,
    consider the following sequence of events:
    
    1.      Task A starts an expedited grace period by invoking
            synchronize_rcu_expedited().  It proceeds normally up to the
            wait_event() near the end of that function, and is then preempted
            (or interrupted or whatever).
    
    2.      The expedited grace period completes, and a kworker task starts
            the awaken phase, having incremented the counter and acquired
            the rcu_state structure's .exp_wake_mutex.  This kworker task
            is then preempted or interrupted or whatever.
    
    3.      Task A resumes and enters wait_event(), which notes that the
            expedited grace period has completed, and thus doesn't sleep.
    
    4.      Task B starts an expedited grace period exactly as did Task A,
            complete with the preemption (or whatever delay) just before
            the call to wait_event().
    
    5.      The expedited grace period completes, and another kworker
            task starts the awaken phase, having incremented the counter.
            However, it blocks when attempting to acquire the rcu_state
            structure's .exp_wake_mutex because step 2's kworker task has
            not yet released it.
    
    6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
            structure's ->exp_wq[] array.
    
    In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
    array will just be spuriously awakened, but they will just sleep again
    on noting that the rcu_state structure's ->expedited_sequence value has
    not advanced far enough.
    
    In practice, this wastes CPU time and is an accident waiting to happen.
    This commit therefore moves the rcu_exp_gp_seq_end() call that officially
    ends the expedited grace period (along with associate tracing) until
    after the ->exp_wake_mutex has been acquired.  This prevents Task A from
    awakening prematurely, thus preventing more than one expedited grace
    period from being in flight during a previous expedited grace period's
    wakeup phase.
    
    Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
    [ paulmck: Added updated comment. ]
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 4433d00a..8840729 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
 	struct rcu_node *rnp;
 
 	synchronize_sched_expedited_wait();
-	rcu_exp_gp_seq_end();
-	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
 
-	/*
-	 * Switch over to wakeup mode, allowing the next GP, but -only- the
-	 * next GP, to proceed.
-	 */
+	// Switch over to wakeup mode, allowing the next GP to proceed.
+	// End the previous grace period only after acquiring the mutex
+	// to ensure that only one GP runs concurrently with wakeups.
 	mutex_lock(&rcu_state.exp_wake_mutex);
+	rcu_exp_gp_seq_end();
+	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
 
 	rcu_for_each_node_breadth_first(rnp) {
 		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {

  reply	other threads:[~2019-11-19 20:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15 17:28 [PATCH] rcu: Fix missed wakeup of exp_wq waiters Neeraj Upadhyay
2019-11-17 21:36 ` Paul E. McKenney
2019-11-18  9:28   ` Neeraj Upadhyay
     [not found]   ` <0101016e7dd7b824-50600058-ab5e-44d8-8d24-94cf095f1783-000000@us-west-2.amazonses.com>
2019-11-18 15:08     ` Paul E. McKenney
2019-11-18 16:41       ` Neeraj Upadhyay
     [not found]       ` <0101016e7f644106-90b40f19-ba9e-4974-bdbe-1062b52222a2-000000@us-west-2.amazonses.com>
2019-11-18 17:24         ` Paul E. McKenney
2019-11-19  3:35           ` Neeraj Upadhyay
     [not found]           ` <0101016e81ba8865-028ac62b-3fcd-481b-b657-be8519c4edf5-000000@us-west-2.amazonses.com>
2019-11-19  4:05             ` Paul E. McKenney
2019-11-19  7:03               ` Neeraj Upadhyay
2019-11-19 15:09                 ` Paul E. McKenney
2019-11-19 20:06                   ` Paul E. McKenney [this message]
2019-11-20  5:09                     ` Neeraj Upadhyay
     [not found]                     ` <eeb7de82-c24b-c914-05fb-683088c9f8e2@codeaurora.org>
2019-11-21  4:10                       ` Paul E. McKenney
2019-11-21  4:36                         ` Neeraj Upadhyay
     [not found]                         ` <71de1f25-75c5-a4bb-49d9-19af5f8ae4fe@codeaurora.org>
2019-11-21 18:55                           ` Paul E. McKenney
2019-11-22  3:44                             ` Neeraj Upadhyay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191119200647.GC2889@paulmck-ThinkPad-P72 \
    --to=paulmck@kernel.org \
    --cc=gkohli@codeaurora.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=neeraju@codeaurora.org \
    --cc=pkondeti@codeaurora.org \
    --cc=prsood@codeaurora.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).