Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters

From: Neeraj Upadhyay <neeraju@codeaurora.org>
To: paulmck@kernel.org
Cc: josh@joshtriplett.org, joel@joelfernandes.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jiangshanlai@gmail.com, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, pkondeti@codeaurora.org,
	prsood@codeaurora.org, gkohli@codeaurora.org
Subject: Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
Date: Tue, 19 Nov 2019 07:03:14 +0000	[thread overview]
Message-ID: <0101016e8278f225-9df7f507-2b6d-45e4-9c4d-d37141a1c5c6-000000@us-west-2.amazonses.com> (raw)
In-Reply-To: <20191119040533.GS2889@paulmck-ThinkPad-P72>

Hi Paul,

On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
>> Hi Paul,
>>
>> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
>>> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>>>> Hi Paul,
>>>>
>>>>
>>>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>>>> there is a chance that they might be indefinitely blocked
>>>>>>>> in below scenario:
>>>>>>>>
>>>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>>>        exp_funnel_lock().
>>>>>>>>
>>>>>>>>        _synchronize_rcu_expedited()
>>>>>>>
>>>>>>> This symbol went away a few versions back, but let's see how this
>>>>>>> plays out in current -rcu.
>>>>>>>
>>>>>>
>>>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>>>> checked against the -rcu code, and the relevant portions were present
>>>>>> there.
>>>>>>
>>>>>>>>          s = 0b'100
>>>>>>>>          exp_funnel_lock()
>>>>>>>>            wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>>
>>>>>>> All of the above could still happen if the expedited grace
>>>>>>> period number was zero (or a bit less) when that task invoked
>>>>>>
>>>>>> Yes
>>>>>>
>>>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>>>> between this task and "task1" below?  Seems like you want them to
>>>>>>> be different tasks.
>>>>>>>
>>>>>>
>>>>>> This task is the one which is waiting for the expedited sequence, which
>>>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>>>> wake up this task, on exp GP completion.
>>>>>>
>>>>>>> Does this task actually block, or is it just getting ready
>>>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>>>
>>>>>>
>>>>>> Yes, it actually blocked in wait queue.
>>>>>>
>>>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>>>        worker and schedules out.
>>>>>>>
>>>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>>>
>>>>>> Yes, correct.
>>>>>>
>>>>>>>>        _synchronize_rcu_expedited()
>>>>>>>>          s = 0b'100
>>>>>>>>          queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>>            wake_up_worker()
>>>>>>>>              schedule()
>>>>>>>>
>>>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>>>        sequence.
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>>>                                    // to 0b'100'
>>>>>>>>
>>>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>>>        and releases exp_mutex.
>>>>>>>>
>>>>>>>>        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>>          sync_exp_work_done(rsp, s));
>>>>>>>>        mutex_unlock(&rsp->exp_mutex);
>>>>>>>
>>>>>>> So task1 is the one that initiated the expedited grace period that
>>>>>>> started when .expedited_sequence was zero, right?
>>>>>>>
>>>>>>
>>>>>> Yes, right.
>>>>>>
>>>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>>>
>>>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>>>        from wrong wait queue index - it should have worken wait queue
>>>>>>>>        corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>>>        0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>>
>>>>>>> So the issue is that the next expedited RCU grace period might
>>>>>>> have completed before the completion of the wakeups for the previous
>>>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>>>
>>>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>>>> period completed while kworker A was in D state, while waiting for
>>>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>>>> waiting on wq index 1.
>>>>>>
>>>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>>>
>>>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>>>> If actually triggered, what did you do to make this happen?
>>>>>>
>>>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>>>> _synchronize_rcu_expedited().
>>>>>>
>>>>>> schedule+0x70/0x90
>>>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>>>> synchronize_rcu+0x50/0xa0
>>>>>> scsi_device_quiesce+0x50/0x120
>>>>>> scsi_bus_suspend+0x70/0xe8
>>>>>> dpm_run_callback+0x148/0x388
>>>>>> __device_suspend+0x430/0x8a8
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>>>
>>>>>>> What have you done to test the change?
>>>>>>>
>>>>>>
>>>>>> I have given this for testing; will share the results . Current analysis
>>>>>> and patch is based on going through ftrace and code review.
>>>>>
>>>>> OK, very good.  Please include the failure information in the changelog
>>>>> of the next version of this patch.
>>
>> Done.
>>
>>>>>
>>>>> I prefer your original patch, that just uses "s", over the one below
>>>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>>>> patch is that it allow more concurrency between a consecutive pair of
>>>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>>>> is also conceptually simpler with a more manageable state space.
>>
>> The reason for highlighting the alternate approach of doing gp end inside
>> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
>> please correct me if I am wrong here:
>>
>> 1. task0 holds exp_wake_mutex, and is preempted.
> 
> Presumably after it has awakened the kthread that initiated the prior
> expedited grace period (the one with seq number = -4).
> 
>> 2. task1 initiates new GP (current seq number = 0).
> 
> Yes, this can happen.
> 
>> 3. task1 queues worker kworker1 and schedules out.
> 
> And thus still holds .exp_mutex, but yes.
> 
>> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> 
> And thus cannot yet have awakened task1.
> 
>> 5. task1 releases exp mutex, w/o entering waitq.
> 
> So I do not believe that we can get to #5.  What am I missing here?
> 

As mentioned in this patch, task1 could have scheduled out after queuing 
work:

queue_work(rcu_gp_wq, &rew.rew_work)
            wake_up_worker()
              schedule()

kworker1 runs and picks up this queued work, and sets exp GP to 1 and 
waits on exp_wake_mutex.

task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which 
return true and it does not enter wait queue and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
          sync_exp_work_done(rsp, s));

Thanks
Neeraj

>> 6. task2 initiates new GP (current seq number = 1).
>> 7. task2 queues worker kworker1 and schedules out.
>> 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
>> ...
>>
>> This sequence would break the requirement of max 3 wqs.
>>
>> If we hold the exp_wake_mutex before exp seq end, there will be atmost three
>> pending GPs - 1. GP which current owner is doing wakeups for,
>> 2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
>> waiters, which started after GP @ point 2 had started. This also is inline
>> with the statement in [1]:
>>
>> "he key point is that the ->exp_mutex is not released until
>> the first wakeup is complete, which means that the ->exp_wake_mutex
>> has already been acquired at that point"
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> 
> And I believe that this still holds.  The task holding .exp_mutex cannot
> release it until it has been awakened, and it won't be awakened until after
> the .exp_wake_mutex has been acquired.  Again, what am I missing here?
> 
>>>>> Please also add the WARN_ON(), though at first glance your change seems
>>>>> to have lost the wakeup.  (But it is early, so maybe it is just that I
>>>>> am not yet fully awake.)
>>>>
>>>> My bad, I posted incomplete diff in previous mail:
>>>>
>>>>    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>>>    {
>>>>    	struct rcu_node *rnp;
>>>> +	unsigned long exp_low;
>>>> +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>>>
>>>>    	synchronize_sched_expedited_wait(rsp);
>>>>    	rcu_exp_gp_seq_end(rsp);
>>>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>> unsigned long s)
>>>>    			spin_unlock(&rnp->exp_lock);
>>>>    		}
>>>>    		smp_mb(); /* All above changes before wakeup. */
>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>> +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>>>> +		WARN_ON(s_low != exp_low);
>>>> +		wake_up_all(&rnp->exp_wq[exp_low]);
>>>
>>> Much better!
>>>
>>> But I suggest using s_low in the wake_up_all.  This hunk is of course
>>> only for testing purposes, not for actual inclusion.  (My earlier email
>>> didn't make that clear.)
>>>
>>> 							Thanx, Paul
>>>
>>
>> Ok sure, got it. I will share the results, once the issue is reproduced.
> 
> Sounds good!
> 
> 							Thanx, Paul
> 
>> Thanks
>> Neeraj
>>
>>>>    	}
>>>>
>>>>
>>>> Thanks
>>>> Neeraj
>>>>
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>>> I was thinking of another way of addressing this problem: Doing exp seq end
>>>>>> inside exp_wake_mutex. This will also ensure that, if we extend the current
>>>>>> scenario and there are multiple expedited GP sequence, which have completed,
>>>>>> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
>>>>>> entries [2].
>>>>>>
>>>>>>
>>>>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
>>>>>>
>>>>>>
>>>>>> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>            struct rcu_node *rnp;
>>>>>>
>>>>>>            synchronize_sched_expedited_wait(rsp);
>>>>>> -       rcu_exp_gp_seq_end(rsp);
>>>>>> -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>>>>
>>>>>>            /*
>>>>>>             * Switch over to wakeup mode, allowing the next GP, but -only- the
>>>>>> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>             */
>>>>>>            mutex_lock(&rsp->exp_wake_mutex);
>>>>>>
>>>>>> +       rcu_exp_gp_seq_end(rsp);
>>>>>> +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>>>> +
>>>>>>
>>>>>>
>>>>>>
>>>>>>> (Using a WARN_ON() to check for the lower bits of the counter portion
>>>>>>> of rcu_state.expedited_sequence differing from the same bits of s
>>>>>>> would be one way to detect this problem.)
>>>>>>>
>>>>>>> 							Thanx, Paul
>>>>>>>
>>>>>>
>>>>>> I have also given the patch for this, for testing:
>>>>>>
>>>>>>     static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>>>>>     {
>>>>>>            struct rcu_node *rnp;
>>>>>> +       unsigned long exp_low;
>>>>>> +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>>>>>
>>>>>>            synchronize_sched_expedited_wait(rsp);
>>>>>>            rcu_exp_gp_seq_end(rsp);
>>>>>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>                            spin_unlock(&rnp->exp_lock);
>>>>>>                    }
>>>>>>                    smp_mb(); /* All above changes before wakeup. */
>>>>>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>> +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>>>>>> +               WARN_ON(s_low != exp_low);
>>>>>> +
>>>>>>
>>>>>> Thanks
>>>>>> Neeraj
>>>>>>
>>>>>>>> Fix this by using the correct sequence number for wake_up_all() inside
>>>>>>>> rcu_exp_wait_wake().
>>>>>>>>
>>>>>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>>>> ---
>>>>>>>>      kernel/rcu/tree_exp.h | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>>>>> index e4b77d3..28979d3 100644
>>>>>>>> --- a/kernel/rcu/tree_exp.h
>>>>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>>>>>      			spin_unlock(&rnp->exp_lock);
>>>>>>>>      		}
>>>>>>>>      		smp_mb(); /* All above changes before wakeup. */
>>>>>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>>>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>>>>>>      	}
>>>>>>>>      	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>>>>>>      	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>>>>>> -- 
>>>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>>>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>>>
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation