Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters

From: Neeraj Upadhyay <neeraju@codeaurora.org>
To: paulmck@kernel.org
Cc: josh@joshtriplett.org, joel@joelfernandes.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jiangshanlai@gmail.com, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, pkondeti@codeaurora.org,
	prsood@codeaurora.org, gkohli@codeaurora.org
Subject: Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
Date: Mon, 18 Nov 2019 09:28:39 +0000	[thread overview]
Message-ID: <0101016e7dd7b7ff-d30f5eea-0548-4882-8885-f3658dc16798-000000@us-west-2.amazonses.com> (raw)
In-Reply-To: <20191117213624.GB2889@paulmck-ThinkPad-P72>

Hi Paul,

On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>> there is a chance that they might be indefinitely blocked
>> in below scenario:
>>
>> 1. There is a task waiting on exp sequence 0b'100' inside
>>     exp_funnel_lock().
>>
>>     _synchronize_rcu_expedited()
> 
> This symbol went away a few versions back, but let's see how this
> plays out in current -rcu.
> 

Sorry; for us this problem is observed on 4.19 stable version; I had
checked against the -rcu code, and the relevant portions were present
there.

>>       s = 0b'100
>>       exp_funnel_lock()
>>         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> 
> All of the above could still happen if the expedited grace
> period number was zero (or a bit less) when that task invoked

Yes

> synchronize_rcu_expedited().  What is the relation, if any,
> between this task and "task1" below?  Seems like you want them to
> be different tasks.
> 

This task is the one which is waiting for the expedited sequence, which
"task1" completes ("task1" holds the exp_mutex for it). "task1" would
wake up this task, on exp GP completion.

> Does this task actually block, or is it just getting ready
> to block?  Seems like you need it to have actually blocked.
> 

Yes, it actually blocked in wait queue.

>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>     worker and schedules out.
> 
> "The Exp GP" being the one that was initiated when the .expedited_sequence
> counter was zero, correct?  (Looks that way below.)
> 
Yes, correct.

>>     _synchronize_rcu_expedited()
>>       s = 0b'100
>>       queue_work(rcu_gp_wq, &rew.rew_work)
>>         wake_up_worker()
>>           schedule()
>>
>> 3. kworker A picks up the queued work and completes the exp gp
>>     sequence.
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>                                 // to 0b'100'
>>
>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>     and releases exp_mutex.
>>
>>     wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>       sync_exp_work_done(rsp, s));
>>     mutex_unlock(&rsp->exp_mutex);
> 
> So task1 is the one that initiated the expedited grace period that
> started when .expedited_sequence was zero, right?
> 

Yes, right.

>> 5. Next exp GP completes, and sequence number is incremented:
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>
>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>     from wrong wait queue index - it should have worken wait queue
>>     corresponding to 0b'100' sequence, but wakes up the ones for
>>     0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>
>>     rcu_exp_wait_wake()
>>       wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> 
> So the issue is that the next expedited RCU grace period might
> have completed before the completion of the wakeups for the previous
> expedited RCU grace period, correct?  Then expedited grace periods have

Yes. Actually from the ftraces, I saw that next expedited RCU grace
period completed while kworker A was in D state, while waiting for 
exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) 
for its wake_up_all() call; so, task (point 1) was never woken up, as it 
was waiting on wq index 1.

> to have stopped to prevent any future wakeup from happening, correct?
> (Which would make it harder for rcutorture to trigger this, though it
> really does have code that attempts to trigger this sort of thing.)
> 
> Is this theoretical in nature, or have you actually triggered it?
> If actually triggered, what did you do to make this happen?

This issue, we had seen previously - 1 instance in May 2018 (on 4.9 
kernel), another instance in Nov 2018 (on 4.14 kernel), in our customer 
reported issues. Both instances were in downstream drivers and we didn't 
have RCU traces. Now 2 days back, it was reported on 4.19 kernel, with 
RCU traces enabled, where it was observed in suspend scenario, where we 
are observing "DPM device timeout" [1], as scsi device is stuck in 
_synchronize_rcu_expedited().

schedule+0x70/0x90
_synchronize_rcu_expedited+0x590/0x5f8
synchronize_rcu+0x50/0xa0
scsi_device_quiesce+0x50/0x120
scsi_bus_suspend+0x70/0xe8
dpm_run_callback+0x148/0x388
__device_suspend+0x430/0x8a8

[1] 
https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489

> What have you done to test the change?
> 

I have given this for testing; will share the results . Current analysis
and patch is based on going through ftrace and code review.

I was thinking of another way of addressing this problem: Doing exp seq 
end inside exp_wake_mutex. This will also ensure that, if we extend the 
current scenario and there are multiple expedited GP sequence, which 
have completed, before exp_wake_mutex is held, we need to preserve the 
requirement of 3 wq entries [2].


[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev 


@@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
         struct rcu_node *rnp;

         synchronize_sched_expedited_wait(rsp);
-       rcu_exp_gp_seq_end(rsp);
-       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));

         /*
          * Switch over to wakeup mode, allowing the next GP, but -only- the
@@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
          */
         mutex_lock(&rsp->exp_wake_mutex);

+       rcu_exp_gp_seq_end(rsp);
+       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
+



> (Using a WARN_ON() to check for the lower bits of the counter portion
> of rcu_state.expedited_sequence differing from the same bits of s
> would be one way to detect this problem.)
> 
> 							Thanx, Paul
> 

I have also given the patch for this, for testing:

  static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
  {
         struct rcu_node *rnp;
+       unsigned long exp_low;
+       unsigned long s_low = rcu_seq_ctr(s) & 0x3;

         synchronize_sched_expedited_wait(rsp);
         rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
                         spin_unlock(&rnp->exp_lock);
                 }
                 smp_mb(); /* All above changes before wakeup. */
- 
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+               WARN_ON(s_low != exp_low);
+

Thanks
Neeraj

>> Fix this by using the correct sequence number for wake_up_all() inside
>> rcu_exp_wait_wake().
>>
>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>> ---
>>   kernel/rcu/tree_exp.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index e4b77d3..28979d3 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>   			spin_unlock(&rnp->exp_lock);
>>   		}
>>   		smp_mb(); /* All above changes before wakeup. */
>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>   	}
>>   	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>   	mutex_unlock(&rcu_state.exp_wake_mutex);
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation