rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] rcu: Fix missed wakeup of exp_wq waiters
@ 2019-11-15 17:28 Neeraj Upadhyay
  2019-11-17 21:36 ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-15 17:28 UTC (permalink / raw)
  To: paulmck, josh, joel, rostedt, mathieu.desnoyers, jiangshanlai
  Cc: linux-kernel, rcu, pkondeti, prsood, gkohli, Neeraj Upadhyay

For the tasks waiting in exp_wq inside exp_funnel_lock(),
there is a chance that they might be indefinitely blocked
in below scenario:

1. There is a task waiting on exp sequence 0b'100' inside
   exp_funnel_lock().

   _synchronize_rcu_expedited()
     s = 0b'100
     exp_funnel_lock()
       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

2. The Exp GP completes and task (task1) holding exp_mutex queues
   worker and schedules out.

   _synchronize_rcu_expedited()
     s = 0b'100
     queue_work(rcu_gp_wq, &rew.rew_work)
       wake_up_worker()
         schedule()

3. kworker A picks up the queued work and completes the exp gp
   sequence.

   rcu_exp_wait_wake()
     rcu_exp_wait_wake()
       rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
                               // to 0b'100'

4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
   and releases exp_mutex.

   wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
     sync_exp_work_done(rsp, s));
   mutex_unlock(&rsp->exp_mutex);

5. Next exp GP completes, and sequence number is incremented:

   rcu_exp_wait_wake()
     rcu_exp_wait_wake()
       rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'

6. As kworker A uses current expedited_sequence, it wakes up workers
   from wrong wait queue index - it should have worken wait queue
   corresponding to 0b'100' sequence, but wakes up the ones for
   0b'200' sequence. This results in task at step 1 indefinitely blocked.

   rcu_exp_wait_wake()
     wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

Fix this by using the correct sequence number for wake_up_all() inside
rcu_exp_wait_wake().

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
---
 kernel/rcu/tree_exp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d3..28979d3 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
 			spin_unlock(&rnp->exp_lock);
 		}
 		smp_mb(); /* All above changes before wakeup. */
-		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
+		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
 	}
 	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
 	mutex_unlock(&rcu_state.exp_wake_mutex);
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-15 17:28 [PATCH] rcu: Fix missed wakeup of exp_wq waiters Neeraj Upadhyay
@ 2019-11-17 21:36 ` Paul E. McKenney
  2019-11-18  9:28   ` Neeraj Upadhyay
       [not found]   ` <0101016e7dd7b824-50600058-ab5e-44d8-8d24-94cf095f1783-000000@us-west-2.amazonses.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-17 21:36 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> For the tasks waiting in exp_wq inside exp_funnel_lock(),
> there is a chance that they might be indefinitely blocked
> in below scenario:
> 
> 1. There is a task waiting on exp sequence 0b'100' inside
>    exp_funnel_lock().
> 
>    _synchronize_rcu_expedited()

This symbol went away a few versions back, but let's see how this
plays out in current -rcu.

>      s = 0b'100
>      exp_funnel_lock()
>        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

All of the above could still happen if the expedited grace
period number was zero (or a bit less) when that task invoked
synchronize_rcu_expedited().  What is the relation, if any,
between this task and "task1" below?  Seems like you want them to
be different tasks.

Does this task actually block, or is it just getting ready
to block?  Seems like you need it to have actually blocked.

> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>    worker and schedules out.

"The Exp GP" being the one that was initiated when the .expedited_sequence
counter was zero, correct?  (Looks that way below.)

>    _synchronize_rcu_expedited()
>      s = 0b'100
>      queue_work(rcu_gp_wq, &rew.rew_work)
>        wake_up_worker()
>          schedule()
> 
> 3. kworker A picks up the queued work and completes the exp gp
>    sequence.
> 
>    rcu_exp_wait_wake()
>      rcu_exp_wait_wake()
>        rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>                                // to 0b'100'
> 
> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>    and releases exp_mutex.
> 
>    wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>      sync_exp_work_done(rsp, s));
>    mutex_unlock(&rsp->exp_mutex);

So task1 is the one that initiated the expedited grace period that
started when .expedited_sequence was zero, right?

> 5. Next exp GP completes, and sequence number is incremented:
> 
>    rcu_exp_wait_wake()
>      rcu_exp_wait_wake()
>        rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> 
> 6. As kworker A uses current expedited_sequence, it wakes up workers
>    from wrong wait queue index - it should have worken wait queue
>    corresponding to 0b'100' sequence, but wakes up the ones for
>    0b'200' sequence. This results in task at step 1 indefinitely blocked.
> 
>    rcu_exp_wait_wake()
>      wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

So the issue is that the next expedited RCU grace period might
have completed before the completion of the wakeups for the previous
expedited RCU grace period, correct?  Then expedited grace periods have
to have stopped to prevent any future wakeup from happening, correct?
(Which would make it harder for rcutorture to trigger this, though it
really does have code that attempts to trigger this sort of thing.)

Is this theoretical in nature, or have you actually triggered it?
If actually triggered, what did you do to make this happen?
What have you done to test the change?

(Using a WARN_ON() to check for the lower bits of the counter portion
of rcu_state.expedited_sequence differing from the same bits of s
would be one way to detect this problem.)

							Thanx, Paul

> Fix this by using the correct sequence number for wake_up_all() inside
> rcu_exp_wait_wake().
> 
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> ---
>  kernel/rcu/tree_exp.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index e4b77d3..28979d3 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>  			spin_unlock(&rnp->exp_lock);
>  		}
>  		smp_mb(); /* All above changes before wakeup. */
> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>  	}
>  	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>  	mutex_unlock(&rcu_state.exp_wake_mutex);
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-17 21:36 ` Paul E. McKenney
@ 2019-11-18  9:28   ` Neeraj Upadhyay
       [not found]   ` <0101016e7dd7b824-50600058-ab5e-44d8-8d24-94cf095f1783-000000@us-west-2.amazonses.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-18  9:28 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

Hi Paul,

On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>> there is a chance that they might be indefinitely blocked
>> in below scenario:
>>
>> 1. There is a task waiting on exp sequence 0b'100' inside
>>     exp_funnel_lock().
>>
>>     _synchronize_rcu_expedited()
> 
> This symbol went away a few versions back, but let's see how this
> plays out in current -rcu.
> 

Sorry; for us this problem is observed on 4.19 stable version; I had
checked against the -rcu code, and the relevant portions were present
there.

>>       s = 0b'100
>>       exp_funnel_lock()
>>         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> 
> All of the above could still happen if the expedited grace
> period number was zero (or a bit less) when that task invoked

Yes

> synchronize_rcu_expedited().  What is the relation, if any,
> between this task and "task1" below?  Seems like you want them to
> be different tasks.
> 

This task is the one which is waiting for the expedited sequence, which
"task1" completes ("task1" holds the exp_mutex for it). "task1" would
wake up this task, on exp GP completion.

> Does this task actually block, or is it just getting ready
> to block?  Seems like you need it to have actually blocked.
> 

Yes, it actually blocked in wait queue.

>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>     worker and schedules out.
> 
> "The Exp GP" being the one that was initiated when the .expedited_sequence
> counter was zero, correct?  (Looks that way below.)
> 
Yes, correct.

>>     _synchronize_rcu_expedited()
>>       s = 0b'100
>>       queue_work(rcu_gp_wq, &rew.rew_work)
>>         wake_up_worker()
>>           schedule()
>>
>> 3. kworker A picks up the queued work and completes the exp gp
>>     sequence.
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>                                 // to 0b'100'
>>
>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>     and releases exp_mutex.
>>
>>     wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>       sync_exp_work_done(rsp, s));
>>     mutex_unlock(&rsp->exp_mutex);
> 
> So task1 is the one that initiated the expedited grace period that
> started when .expedited_sequence was zero, right?
> 

Yes, right.

>> 5. Next exp GP completes, and sequence number is incremented:
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>
>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>     from wrong wait queue index - it should have worken wait queue
>>     corresponding to 0b'100' sequence, but wakes up the ones for
>>     0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>
>>     rcu_exp_wait_wake()
>>       wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> 
> So the issue is that the next expedited RCU grace period might
> have completed before the completion of the wakeups for the previous
> expedited RCU grace period, correct?  Then expedited grace periods have

Yes. Actually from the ftraces, I saw that next expedited RCU grace
period completed while kworker A was in D state, while waiting for 
exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) 
for its wake_up_all() call; so, task (point 1) was never woken up, as it 
was waiting on wq index 1.

> to have stopped to prevent any future wakeup from happening, correct?
> (Which would make it harder for rcutorture to trigger this, though it
> really does have code that attempts to trigger this sort of thing.)
> 
> Is this theoretical in nature, or have you actually triggered it?
> If actually triggered, what did you do to make this happen?

This issue, we had seen previously - 1 instance in May 2018 (on 4.9 
kernel), another instance in Nov 2018 (on 4.14 kernel), in our customer 
reported issues. Both instances were in downstream drivers and we didn't 
have RCU traces. Now 2 days back, it was reported on 4.19 kernel, with 
RCU traces enabled, where it was observed in suspend scenario, where we 
are observing "DPM device timeout" [1], as scsi device is stuck in 
_synchronize_rcu_expedited().

schedule+0x70/0x90
_synchronize_rcu_expedited+0x590/0x5f8
synchronize_rcu+0x50/0xa0
scsi_device_quiesce+0x50/0x120
scsi_bus_suspend+0x70/0xe8
dpm_run_callback+0x148/0x388
__device_suspend+0x430/0x8a8

[1] 
https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489

> What have you done to test the change?
> 

I have given this for testing; will share the results . Current analysis
and patch is based on going through ftrace and code review.

I was thinking of another way of addressing this problem: Doing exp seq 
end inside exp_wake_mutex. This will also ensure that, if we extend the 
current scenario and there are multiple expedited GP sequence, which 
have completed, before exp_wake_mutex is held, we need to preserve the 
requirement of 3 wq entries [2].


[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev 


@@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
         struct rcu_node *rnp;

         synchronize_sched_expedited_wait(rsp);
-       rcu_exp_gp_seq_end(rsp);
-       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));

         /*
          * Switch over to wakeup mode, allowing the next GP, but -only- the
@@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
          */
         mutex_lock(&rsp->exp_wake_mutex);

+       rcu_exp_gp_seq_end(rsp);
+       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
+



> (Using a WARN_ON() to check for the lower bits of the counter portion
> of rcu_state.expedited_sequence differing from the same bits of s
> would be one way to detect this problem.)
> 
> 							Thanx, Paul
> 

I have also given the patch for this, for testing:

  static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
  {
         struct rcu_node *rnp;
+       unsigned long exp_low;
+       unsigned long s_low = rcu_seq_ctr(s) & 0x3;

         synchronize_sched_expedited_wait(rsp);
         rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
                         spin_unlock(&rnp->exp_lock);
                 }
                 smp_mb(); /* All above changes before wakeup. */
- 
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+               WARN_ON(s_low != exp_low);
+

Thanks
Neeraj

>> Fix this by using the correct sequence number for wake_up_all() inside
>> rcu_exp_wait_wake().
>>
>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>> ---
>>   kernel/rcu/tree_exp.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index e4b77d3..28979d3 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>   			spin_unlock(&rnp->exp_lock);
>>   		}
>>   		smp_mb(); /* All above changes before wakeup. */
>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>   	}
>>   	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>   	mutex_unlock(&rcu_state.exp_wake_mutex);
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
       [not found]   ` <0101016e7dd7b824-50600058-ab5e-44d8-8d24-94cf095f1783-000000@us-west-2.amazonses.com>
@ 2019-11-18 15:08     ` Paul E. McKenney
  2019-11-18 16:41       ` Neeraj Upadhyay
       [not found]       ` <0101016e7f644106-90b40f19-ba9e-4974-bdbe-1062b52222a2-000000@us-west-2.amazonses.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-18 15:08 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> Hi Paul,
> 
> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > there is a chance that they might be indefinitely blocked
> > > in below scenario:
> > > 
> > > 1. There is a task waiting on exp sequence 0b'100' inside
> > >     exp_funnel_lock().
> > > 
> > >     _synchronize_rcu_expedited()
> > 
> > This symbol went away a few versions back, but let's see how this
> > plays out in current -rcu.
> > 
> 
> Sorry; for us this problem is observed on 4.19 stable version; I had
> checked against the -rcu code, and the relevant portions were present
> there.
> 
> > >       s = 0b'100
> > >       exp_funnel_lock()
> > >         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > 
> > All of the above could still happen if the expedited grace
> > period number was zero (or a bit less) when that task invoked
> 
> Yes
> 
> > synchronize_rcu_expedited().  What is the relation, if any,
> > between this task and "task1" below?  Seems like you want them to
> > be different tasks.
> > 
> 
> This task is the one which is waiting for the expedited sequence, which
> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> wake up this task, on exp GP completion.
> 
> > Does this task actually block, or is it just getting ready
> > to block?  Seems like you need it to have actually blocked.
> > 
> 
> Yes, it actually blocked in wait queue.
> 
> > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > >     worker and schedules out.
> > 
> > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > counter was zero, correct?  (Looks that way below.)
> > 
> Yes, correct.
> 
> > >     _synchronize_rcu_expedited()
> > >       s = 0b'100
> > >       queue_work(rcu_gp_wq, &rew.rew_work)
> > >         wake_up_worker()
> > >           schedule()
> > > 
> > > 3. kworker A picks up the queued work and completes the exp gp
> > >     sequence.
> > > 
> > >     rcu_exp_wait_wake()
> > >       rcu_exp_wait_wake()
> > >         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > >                                 // to 0b'100'
> > > 
> > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > >     and releases exp_mutex.
> > > 
> > >     wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > >       sync_exp_work_done(rsp, s));
> > >     mutex_unlock(&rsp->exp_mutex);
> > 
> > So task1 is the one that initiated the expedited grace period that
> > started when .expedited_sequence was zero, right?
> > 
> 
> Yes, right.
> 
> > > 5. Next exp GP completes, and sequence number is incremented:
> > > 
> > >     rcu_exp_wait_wake()
> > >       rcu_exp_wait_wake()
> > >         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > 
> > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > >     from wrong wait queue index - it should have worken wait queue
> > >     corresponding to 0b'100' sequence, but wakes up the ones for
> > >     0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > 
> > >     rcu_exp_wait_wake()
> > >       wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > 
> > So the issue is that the next expedited RCU grace period might
> > have completed before the completion of the wakeups for the previous
> > expedited RCU grace period, correct?  Then expedited grace periods have
> 
> Yes. Actually from the ftraces, I saw that next expedited RCU grace
> period completed while kworker A was in D state, while waiting for
> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> its wake_up_all() call; so, task (point 1) was never woken up, as it was
> waiting on wq index 1.
> 
> > to have stopped to prevent any future wakeup from happening, correct?
> > (Which would make it harder for rcutorture to trigger this, though it
> > really does have code that attempts to trigger this sort of thing.)
> > 
> > Is this theoretical in nature, or have you actually triggered it?
> > If actually triggered, what did you do to make this happen?
> 
> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> issues. Both instances were in downstream drivers and we didn't have RCU
> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> enabled, where it was observed in suspend scenario, where we are observing
> "DPM device timeout" [1], as scsi device is stuck in
> _synchronize_rcu_expedited().
> 
> schedule+0x70/0x90
> _synchronize_rcu_expedited+0x590/0x5f8
> synchronize_rcu+0x50/0xa0
> scsi_device_quiesce+0x50/0x120
> scsi_bus_suspend+0x70/0xe8
> dpm_run_callback+0x148/0x388
> __device_suspend+0x430/0x8a8
> 
> [1]
> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> 
> > What have you done to test the change?
> > 
> 
> I have given this for testing; will share the results . Current analysis
> and patch is based on going through ftrace and code review.

OK, very good.  Please include the failure information in the changelog
of the next version of this patch.

I prefer your original patch, that just uses "s", over the one below
that moves the rcu_exp_gp_seq_end().  The big advantage of your original
patch is that it allow more concurrency between a consecutive pair of
expedited RCU grace periods.  Plus it would not be easy to convince
myself that moving rcu_exp_gp_seq_end() down is safe, so your original
is also conceptually simpler with a more manageable state space.

Please also add the WARN_ON(), though at first glance your change seems
to have lost the wakeup.  (But it is early, so maybe it is just that I
am not yet fully awake.)

							Thanx, Paul

> I was thinking of another way of addressing this problem: Doing exp seq end
> inside exp_wake_mutex. This will also ensure that, if we extend the current
> scenario and there are multiple expedited GP sequence, which have completed,
> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
> entries [2].
> 
> 
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> 
> 
> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> unsigned long s)
>         struct rcu_node *rnp;
> 
>         synchronize_sched_expedited_wait(rsp);
> -       rcu_exp_gp_seq_end(rsp);
> -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> 
>         /*
>          * Switch over to wakeup mode, allowing the next GP, but -only- the
> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> unsigned long s)
>          */
>         mutex_lock(&rsp->exp_wake_mutex);
> 
> +       rcu_exp_gp_seq_end(rsp);
> +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> +
> 
> 
> 
> > (Using a WARN_ON() to check for the lower bits of the counter portion
> > of rcu_state.expedited_sequence differing from the same bits of s
> > would be one way to detect this problem.)
> > 
> > 							Thanx, Paul
> > 
> 
> I have also given the patch for this, for testing:
> 
>  static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>  {
>         struct rcu_node *rnp;
> +       unsigned long exp_low;
> +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> 
>         synchronize_sched_expedited_wait(rsp);
>         rcu_exp_gp_seq_end(rsp);
> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> unsigned long s)
>                         spin_unlock(&rnp->exp_lock);
>                 }
>                 smp_mb(); /* All above changes before wakeup. */
> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> +               WARN_ON(s_low != exp_low);
> +
> 
> Thanks
> Neeraj
> 
> > > Fix this by using the correct sequence number for wake_up_all() inside
> > > rcu_exp_wait_wake().
> > > 
> > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > ---
> > >   kernel/rcu/tree_exp.h | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index e4b77d3..28979d3 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > >   			spin_unlock(&rnp->exp_lock);
> > >   		}
> > >   		smp_mb(); /* All above changes before wakeup. */
> > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > >   	}
> > >   	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > >   	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> > > member of the Code Aurora Forum, hosted by The Linux Foundation
> > > 
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-18 15:08     ` Paul E. McKenney
@ 2019-11-18 16:41       ` Neeraj Upadhyay
       [not found]       ` <0101016e7f644106-90b40f19-ba9e-4974-bdbe-1062b52222a2-000000@us-west-2.amazonses.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-18 16:41 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

Hi Paul,


On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>> Hi Paul,
>>
>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>> there is a chance that they might be indefinitely blocked
>>>> in below scenario:
>>>>
>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>      exp_funnel_lock().
>>>>
>>>>      _synchronize_rcu_expedited()
>>>
>>> This symbol went away a few versions back, but let's see how this
>>> plays out in current -rcu.
>>>
>>
>> Sorry; for us this problem is observed on 4.19 stable version; I had
>> checked against the -rcu code, and the relevant portions were present
>> there.
>>
>>>>        s = 0b'100
>>>>        exp_funnel_lock()
>>>>          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>
>>> All of the above could still happen if the expedited grace
>>> period number was zero (or a bit less) when that task invoked
>>
>> Yes
>>
>>> synchronize_rcu_expedited().  What is the relation, if any,
>>> between this task and "task1" below?  Seems like you want them to
>>> be different tasks.
>>>
>>
>> This task is the one which is waiting for the expedited sequence, which
>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>> wake up this task, on exp GP completion.
>>
>>> Does this task actually block, or is it just getting ready
>>> to block?  Seems like you need it to have actually blocked.
>>>
>>
>> Yes, it actually blocked in wait queue.
>>
>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>      worker and schedules out.
>>>
>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>> counter was zero, correct?  (Looks that way below.)
>>>
>> Yes, correct.
>>
>>>>      _synchronize_rcu_expedited()
>>>>        s = 0b'100
>>>>        queue_work(rcu_gp_wq, &rew.rew_work)
>>>>          wake_up_worker()
>>>>            schedule()
>>>>
>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>      sequence.
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        rcu_exp_wait_wake()
>>>>          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>                                  // to 0b'100'
>>>>
>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>      and releases exp_mutex.
>>>>
>>>>      wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>        sync_exp_work_done(rsp, s));
>>>>      mutex_unlock(&rsp->exp_mutex);
>>>
>>> So task1 is the one that initiated the expedited grace period that
>>> started when .expedited_sequence was zero, right?
>>>
>>
>> Yes, right.
>>
>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        rcu_exp_wait_wake()
>>>>          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>
>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>      from wrong wait queue index - it should have worken wait queue
>>>>      corresponding to 0b'100' sequence, but wakes up the ones for
>>>>      0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>
>>> So the issue is that the next expedited RCU grace period might
>>> have completed before the completion of the wakeups for the previous
>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>
>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>> period completed while kworker A was in D state, while waiting for
>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>> waiting on wq index 1.
>>
>>> to have stopped to prevent any future wakeup from happening, correct?
>>> (Which would make it harder for rcutorture to trigger this, though it
>>> really does have code that attempts to trigger this sort of thing.)
>>>
>>> Is this theoretical in nature, or have you actually triggered it?
>>> If actually triggered, what did you do to make this happen?
>>
>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>> issues. Both instances were in downstream drivers and we didn't have RCU
>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>> enabled, where it was observed in suspend scenario, where we are observing
>> "DPM device timeout" [1], as scsi device is stuck in
>> _synchronize_rcu_expedited().
>>
>> schedule+0x70/0x90
>> _synchronize_rcu_expedited+0x590/0x5f8
>> synchronize_rcu+0x50/0xa0
>> scsi_device_quiesce+0x50/0x120
>> scsi_bus_suspend+0x70/0xe8
>> dpm_run_callback+0x148/0x388
>> __device_suspend+0x430/0x8a8
>>
>> [1]
>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>
>>> What have you done to test the change?
>>>
>>
>> I have given this for testing; will share the results . Current analysis
>> and patch is based on going through ftrace and code review.
> 
> OK, very good.  Please include the failure information in the changelog
> of the next version of this patch.
> 
> I prefer your original patch, that just uses "s", over the one below
> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> patch is that it allow more concurrency between a consecutive pair of
> expedited RCU grace periods.  Plus it would not be easy to convince
> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> is also conceptually simpler with a more manageable state space.
> 
> Please also add the WARN_ON(), though at first glance your change seems
> to have lost the wakeup.  (But it is early, so maybe it is just that I
> am not yet fully awake.)

My bad, I posted incomplete diff in previous mail:

  static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
  {
  	struct rcu_node *rnp;
+	unsigned long exp_low;
+	unsigned long s_low = rcu_seq_ctr(s) & 0x3;

  	synchronize_sched_expedited_wait(rsp);
  	rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, 
unsigned long s)
  			spin_unlock(&rnp->exp_lock);
  		}
  		smp_mb(); /* All above changes before wakeup. */
-		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+		WARN_ON(s_low != exp_low);
+		wake_up_all(&rnp->exp_wq[exp_low]);
  	}


Thanks
Neeraj

> 
> 							Thanx, Paul
> 
>> I was thinking of another way of addressing this problem: Doing exp seq end
>> inside exp_wake_mutex. This will also ensure that, if we extend the current
>> scenario and there are multiple expedited GP sequence, which have completed,
>> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
>> entries [2].
>>
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
>>
>>
>> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>>          struct rcu_node *rnp;
>>
>>          synchronize_sched_expedited_wait(rsp);
>> -       rcu_exp_gp_seq_end(rsp);
>> -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>
>>          /*
>>           * Switch over to wakeup mode, allowing the next GP, but -only- the
>> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>>           */
>>          mutex_lock(&rsp->exp_wake_mutex);
>>
>> +       rcu_exp_gp_seq_end(rsp);
>> +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>> +
>>
>>
>>
>>> (Using a WARN_ON() to check for the lower bits of the counter portion
>>> of rcu_state.expedited_sequence differing from the same bits of s
>>> would be one way to detect this problem.)
>>>
>>> 							Thanx, Paul
>>>
>>
>> I have also given the patch for this, for testing:
>>
>>   static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>   {
>>          struct rcu_node *rnp;
>> +       unsigned long exp_low;
>> +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>
>>          synchronize_sched_expedited_wait(rsp);
>>          rcu_exp_gp_seq_end(rsp);
>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>>                          spin_unlock(&rnp->exp_lock);
>>                  }
>>                  smp_mb(); /* All above changes before wakeup. */
>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>> +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>> +               WARN_ON(s_low != exp_low);
>> +
>>
>> Thanks
>> Neeraj
>>
>>>> Fix this by using the correct sequence number for wake_up_all() inside
>>>> rcu_exp_wait_wake().
>>>>
>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>> ---
>>>>    kernel/rcu/tree_exp.h | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index e4b77d3..28979d3 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>    			spin_unlock(&rnp->exp_lock);
>>>>    		}
>>>>    		smp_mb(); /* All above changes before wakeup. */
>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>>    	}
>>>>    	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>>    	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
       [not found]       ` <0101016e7f644106-90b40f19-ba9e-4974-bdbe-1062b52222a2-000000@us-west-2.amazonses.com>
@ 2019-11-18 17:24         ` Paul E. McKenney
  2019-11-19  3:35           ` Neeraj Upadhyay
       [not found]           ` <0101016e81ba8865-028ac62b-3fcd-481b-b657-be8519c4edf5-000000@us-west-2.amazonses.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-18 17:24 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> Hi Paul,
> 
> 
> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > Hi Paul,
> > > 
> > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > there is a chance that they might be indefinitely blocked
> > > > > in below scenario:
> > > > > 
> > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > >      exp_funnel_lock().
> > > > > 
> > > > >      _synchronize_rcu_expedited()
> > > > 
> > > > This symbol went away a few versions back, but let's see how this
> > > > plays out in current -rcu.
> > > > 
> > > 
> > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > checked against the -rcu code, and the relevant portions were present
> > > there.
> > > 
> > > > >        s = 0b'100
> > > > >        exp_funnel_lock()
> > > > >          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > 
> > > > All of the above could still happen if the expedited grace
> > > > period number was zero (or a bit less) when that task invoked
> > > 
> > > Yes
> > > 
> > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > between this task and "task1" below?  Seems like you want them to
> > > > be different tasks.
> > > > 
> > > 
> > > This task is the one which is waiting for the expedited sequence, which
> > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > wake up this task, on exp GP completion.
> > > 
> > > > Does this task actually block, or is it just getting ready
> > > > to block?  Seems like you need it to have actually blocked.
> > > > 
> > > 
> > > Yes, it actually blocked in wait queue.
> > > 
> > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > >      worker and schedules out.
> > > > 
> > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > counter was zero, correct?  (Looks that way below.)
> > > > 
> > > Yes, correct.
> > > 
> > > > >      _synchronize_rcu_expedited()
> > > > >        s = 0b'100
> > > > >        queue_work(rcu_gp_wq, &rew.rew_work)
> > > > >          wake_up_worker()
> > > > >            schedule()
> > > > > 
> > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > >      sequence.
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        rcu_exp_wait_wake()
> > > > >          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > >                                  // to 0b'100'
> > > > > 
> > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > >      and releases exp_mutex.
> > > > > 
> > > > >      wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > >        sync_exp_work_done(rsp, s));
> > > > >      mutex_unlock(&rsp->exp_mutex);
> > > > 
> > > > So task1 is the one that initiated the expedited grace period that
> > > > started when .expedited_sequence was zero, right?
> > > > 
> > > 
> > > Yes, right.
> > > 
> > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        rcu_exp_wait_wake()
> > > > >          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > 
> > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > >      from wrong wait queue index - it should have worken wait queue
> > > > >      corresponding to 0b'100' sequence, but wakes up the ones for
> > > > >      0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > 
> > > > So the issue is that the next expedited RCU grace period might
> > > > have completed before the completion of the wakeups for the previous
> > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > 
> > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > period completed while kworker A was in D state, while waiting for
> > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > waiting on wq index 1.
> > > 
> > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > really does have code that attempts to trigger this sort of thing.)
> > > > 
> > > > Is this theoretical in nature, or have you actually triggered it?
> > > > If actually triggered, what did you do to make this happen?
> > > 
> > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > enabled, where it was observed in suspend scenario, where we are observing
> > > "DPM device timeout" [1], as scsi device is stuck in
> > > _synchronize_rcu_expedited().
> > > 
> > > schedule+0x70/0x90
> > > _synchronize_rcu_expedited+0x590/0x5f8
> > > synchronize_rcu+0x50/0xa0
> > > scsi_device_quiesce+0x50/0x120
> > > scsi_bus_suspend+0x70/0xe8
> > > dpm_run_callback+0x148/0x388
> > > __device_suspend+0x430/0x8a8
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > 
> > > > What have you done to test the change?
> > > > 
> > > 
> > > I have given this for testing; will share the results . Current analysis
> > > and patch is based on going through ftrace and code review.
> > 
> > OK, very good.  Please include the failure information in the changelog
> > of the next version of this patch.
> > 
> > I prefer your original patch, that just uses "s", over the one below
> > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > patch is that it allow more concurrency between a consecutive pair of
> > expedited RCU grace periods.  Plus it would not be easy to convince
> > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > is also conceptually simpler with a more manageable state space.
> > 
> > Please also add the WARN_ON(), though at first glance your change seems
> > to have lost the wakeup.  (But it is early, so maybe it is just that I
> > am not yet fully awake.)
> 
> My bad, I posted incomplete diff in previous mail:
> 
>  static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>  {
>  	struct rcu_node *rnp;
> +	unsigned long exp_low;
> +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> 
>  	synchronize_sched_expedited_wait(rsp);
>  	rcu_exp_gp_seq_end(rsp);
> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> unsigned long s)
>  			spin_unlock(&rnp->exp_lock);
>  		}
>  		smp_mb(); /* All above changes before wakeup. */
> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> +		WARN_ON(s_low != exp_low);
> +		wake_up_all(&rnp->exp_wq[exp_low]);

Much better!

But I suggest using s_low in the wake_up_all.  This hunk is of course
only for testing purposes, not for actual inclusion.  (My earlier email
didn't make that clear.)

							Thanx, Paul

>  	}
> 
> 
> Thanks
> Neeraj
> 
> > 
> > 							Thanx, Paul
> > 
> > > I was thinking of another way of addressing this problem: Doing exp seq end
> > > inside exp_wake_mutex. This will also ensure that, if we extend the current
> > > scenario and there are multiple expedited GP sequence, which have completed,
> > > before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
> > > entries [2].
> > > 
> > > 
> > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> > > 
> > > 
> > > @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > unsigned long s)
> > >          struct rcu_node *rnp;
> > > 
> > >          synchronize_sched_expedited_wait(rsp);
> > > -       rcu_exp_gp_seq_end(rsp);
> > > -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > 
> > >          /*
> > >           * Switch over to wakeup mode, allowing the next GP, but -only- the
> > > @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > unsigned long s)
> > >           */
> > >          mutex_lock(&rsp->exp_wake_mutex);
> > > 
> > > +       rcu_exp_gp_seq_end(rsp);
> > > +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > +
> > > 
> > > 
> > > 
> > > > (Using a WARN_ON() to check for the lower bits of the counter portion
> > > > of rcu_state.expedited_sequence differing from the same bits of s
> > > > would be one way to detect this problem.)
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > 
> > > I have also given the patch for this, for testing:
> > > 
> > >   static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > >   {
> > >          struct rcu_node *rnp;
> > > +       unsigned long exp_low;
> > > +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > 
> > >          synchronize_sched_expedited_wait(rsp);
> > >          rcu_exp_gp_seq_end(rsp);
> > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > unsigned long s)
> > >                          spin_unlock(&rnp->exp_lock);
> > >                  }
> > >                  smp_mb(); /* All above changes before wakeup. */
> > > - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > +               WARN_ON(s_low != exp_low);
> > > +
> > > 
> > > Thanks
> > > Neeraj
> > > 
> > > > > Fix this by using the correct sequence number for wake_up_all() inside
> > > > > rcu_exp_wait_wake().
> > > > > 
> > > > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > > ---
> > > > >    kernel/rcu/tree_exp.h | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > index e4b77d3..28979d3 100644
> > > > > --- a/kernel/rcu/tree_exp.h
> > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > > >    			spin_unlock(&rnp->exp_lock);
> > > > >    		}
> > > > >    		smp_mb(); /* All above changes before wakeup. */
> > > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > > > >    	}
> > > > >    	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > > > >    	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > > > -- 
> > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> > > > > member of the Code Aurora Forum, hosted by The Linux Foundation
> > > > > 
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-18 17:24         ` Paul E. McKenney
@ 2019-11-19  3:35           ` Neeraj Upadhyay
       [not found]           ` <0101016e81ba8865-028ac62b-3fcd-481b-b657-be8519c4edf5-000000@us-west-2.amazonses.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-19  3:35 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

Hi Paul,

On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>> Hi Paul,
>>
>>
>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>> Hi Paul,
>>>>
>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>> there is a chance that they might be indefinitely blocked
>>>>>> in below scenario:
>>>>>>
>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>       exp_funnel_lock().
>>>>>>
>>>>>>       _synchronize_rcu_expedited()
>>>>>
>>>>> This symbol went away a few versions back, but let's see how this
>>>>> plays out in current -rcu.
>>>>>
>>>>
>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>> checked against the -rcu code, and the relevant portions were present
>>>> there.
>>>>
>>>>>>         s = 0b'100
>>>>>>         exp_funnel_lock()
>>>>>>           wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>
>>>>> All of the above could still happen if the expedited grace
>>>>> period number was zero (or a bit less) when that task invoked
>>>>
>>>> Yes
>>>>
>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>> between this task and "task1" below?  Seems like you want them to
>>>>> be different tasks.
>>>>>
>>>>
>>>> This task is the one which is waiting for the expedited sequence, which
>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>> wake up this task, on exp GP completion.
>>>>
>>>>> Does this task actually block, or is it just getting ready
>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>
>>>>
>>>> Yes, it actually blocked in wait queue.
>>>>
>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>       worker and schedules out.
>>>>>
>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>
>>>> Yes, correct.
>>>>
>>>>>>       _synchronize_rcu_expedited()
>>>>>>         s = 0b'100
>>>>>>         queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>           wake_up_worker()
>>>>>>             schedule()
>>>>>>
>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>       sequence.
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         rcu_exp_wait_wake()
>>>>>>           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>                                   // to 0b'100'
>>>>>>
>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>       and releases exp_mutex.
>>>>>>
>>>>>>       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>         sync_exp_work_done(rsp, s));
>>>>>>       mutex_unlock(&rsp->exp_mutex);
>>>>>
>>>>> So task1 is the one that initiated the expedited grace period that
>>>>> started when .expedited_sequence was zero, right?
>>>>>
>>>>
>>>> Yes, right.
>>>>
>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         rcu_exp_wait_wake()
>>>>>>           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>
>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>       from wrong wait queue index - it should have worken wait queue
>>>>>>       corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>       0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>
>>>>> So the issue is that the next expedited RCU grace period might
>>>>> have completed before the completion of the wakeups for the previous
>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>
>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>> period completed while kworker A was in D state, while waiting for
>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>> waiting on wq index 1.
>>>>
>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>
>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>> If actually triggered, what did you do to make this happen?
>>>>
>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>> _synchronize_rcu_expedited().
>>>>
>>>> schedule+0x70/0x90
>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>> synchronize_rcu+0x50/0xa0
>>>> scsi_device_quiesce+0x50/0x120
>>>> scsi_bus_suspend+0x70/0xe8
>>>> dpm_run_callback+0x148/0x388
>>>> __device_suspend+0x430/0x8a8
>>>>
>>>> [1]
>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>
>>>>> What have you done to test the change?
>>>>>
>>>>
>>>> I have given this for testing; will share the results . Current analysis
>>>> and patch is based on going through ftrace and code review.
>>>
>>> OK, very good.  Please include the failure information in the changelog
>>> of the next version of this patch.

Done.

>>>
>>> I prefer your original patch, that just uses "s", over the one below
>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>> patch is that it allow more concurrency between a consecutive pair of
>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>> is also conceptually simpler with a more manageable state space.

The reason for highlighting the alternate approach of doing gp end 
inside exp_wake_mutex is the requirement of 3 wqs. Now, this is a 
theoretical case; please correct me if I am wrong here:

1. task0 holds exp_wake_mutex, and is preempted.
2. task1 initiates new GP (current seq number = 0).
3. task1 queues worker kworker1 and schedules out.
4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
5. task1 releases exp mutex, w/o entering waitq.
6. task2 initiates new GP (current seq number = 1).
7. task2 queues worker kworker1 and schedules out.
8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
...

This sequence would break the requirement of max 3 wqs.

If we hold the exp_wake_mutex before exp seq end, there will be atmost 
three pending GPs - 1. GP which current owner is doing wakeups for,
2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
waiters, which started after GP @ point 2 had started. This also is 
inline with the statement in [1]:

"he key point is that the ->exp_mutex is not released until
the first wakeup is complete, which means that the ->exp_wake_mutex
has already been acquired at that point"

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev

>>>
>>> Please also add the WARN_ON(), though at first glance your change seems
>>> to have lost the wakeup.  (But it is early, so maybe it is just that I
>>> am not yet fully awake.)
>>
>> My bad, I posted incomplete diff in previous mail:
>>
>>   static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>   {
>>   	struct rcu_node *rnp;
>> +	unsigned long exp_low;
>> +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>
>>   	synchronize_sched_expedited_wait(rsp);
>>   	rcu_exp_gp_seq_end(rsp);
>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>>   			spin_unlock(&rnp->exp_lock);
>>   		}
>>   		smp_mb(); /* All above changes before wakeup. */
>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>> +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>> +		WARN_ON(s_low != exp_low);
>> +		wake_up_all(&rnp->exp_wq[exp_low]);
> 
> Much better!
> 
> But I suggest using s_low in the wake_up_all.  This hunk is of course
> only for testing purposes, not for actual inclusion.  (My earlier email
> didn't make that clear.)
> 
> 							Thanx, Paul
> 

Ok sure, got it. I will share the results, once the issue is reproduced.


Thanks
Neeraj

>>   	}
>>
>>
>> Thanks
>> Neeraj
>>
>>>
>>> 							Thanx, Paul
>>>
>>>> I was thinking of another way of addressing this problem: Doing exp seq end
>>>> inside exp_wake_mutex. This will also ensure that, if we extend the current
>>>> scenario and there are multiple expedited GP sequence, which have completed,
>>>> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
>>>> entries [2].
>>>>
>>>>
>>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
>>>>
>>>>
>>>> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>> unsigned long s)
>>>>           struct rcu_node *rnp;
>>>>
>>>>           synchronize_sched_expedited_wait(rsp);
>>>> -       rcu_exp_gp_seq_end(rsp);
>>>> -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>>
>>>>           /*
>>>>            * Switch over to wakeup mode, allowing the next GP, but -only- the
>>>> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>> unsigned long s)
>>>>            */
>>>>           mutex_lock(&rsp->exp_wake_mutex);
>>>>
>>>> +       rcu_exp_gp_seq_end(rsp);
>>>> +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>> +
>>>>
>>>>
>>>>
>>>>> (Using a WARN_ON() to check for the lower bits of the counter portion
>>>>> of rcu_state.expedited_sequence differing from the same bits of s
>>>>> would be one way to detect this problem.)
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>
>>>> I have also given the patch for this, for testing:
>>>>
>>>>    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>>>    {
>>>>           struct rcu_node *rnp;
>>>> +       unsigned long exp_low;
>>>> +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>>>
>>>>           synchronize_sched_expedited_wait(rsp);
>>>>           rcu_exp_gp_seq_end(rsp);
>>>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>> unsigned long s)
>>>>                           spin_unlock(&rnp->exp_lock);
>>>>                   }
>>>>                   smp_mb(); /* All above changes before wakeup. */
>>>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>> +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>>>> +               WARN_ON(s_low != exp_low);
>>>> +
>>>>
>>>> Thanks
>>>> Neeraj
>>>>
>>>>>> Fix this by using the correct sequence number for wake_up_all() inside
>>>>>> rcu_exp_wait_wake().
>>>>>>
>>>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>> ---
>>>>>>     kernel/rcu/tree_exp.h | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>>> index e4b77d3..28979d3 100644
>>>>>> --- a/kernel/rcu/tree_exp.h
>>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>>>     			spin_unlock(&rnp->exp_lock);
>>>>>>     		}
>>>>>>     		smp_mb(); /* All above changes before wakeup. */
>>>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>>>>     	}
>>>>>>     	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>>>>     	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>>>> -- 
>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>>>
>>>>
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
       [not found]           ` <0101016e81ba8865-028ac62b-3fcd-481b-b657-be8519c4edf5-000000@us-west-2.amazonses.com>
@ 2019-11-19  4:05             ` Paul E. McKenney
  2019-11-19  7:03               ` Neeraj Upadhyay
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-19  4:05 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> Hi Paul,
> 
> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > Hi Paul,
> > > 
> > > 
> > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > in below scenario:
> > > > > > > 
> > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > >       exp_funnel_lock().
> > > > > > > 
> > > > > > >       _synchronize_rcu_expedited()
> > > > > > 
> > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > plays out in current -rcu.
> > > > > > 
> > > > > 
> > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > checked against the -rcu code, and the relevant portions were present
> > > > > there.
> > > > > 
> > > > > > >         s = 0b'100
> > > > > > >         exp_funnel_lock()
> > > > > > >           wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > 
> > > > > > All of the above could still happen if the expedited grace
> > > > > > period number was zero (or a bit less) when that task invoked
> > > > > 
> > > > > Yes
> > > > > 
> > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > be different tasks.
> > > > > > 
> > > > > 
> > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > wake up this task, on exp GP completion.
> > > > > 
> > > > > > Does this task actually block, or is it just getting ready
> > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > 
> > > > > 
> > > > > Yes, it actually blocked in wait queue.
> > > > > 
> > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > >       worker and schedules out.
> > > > > > 
> > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > 
> > > > > Yes, correct.
> > > > > 
> > > > > > >       _synchronize_rcu_expedited()
> > > > > > >         s = 0b'100
> > > > > > >         queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > >           wake_up_worker()
> > > > > > >             schedule()
> > > > > > > 
> > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > >       sequence.
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         rcu_exp_wait_wake()
> > > > > > >           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > >                                   // to 0b'100'
> > > > > > > 
> > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > >       and releases exp_mutex.
> > > > > > > 
> > > > > > >       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > >         sync_exp_work_done(rsp, s));
> > > > > > >       mutex_unlock(&rsp->exp_mutex);
> > > > > > 
> > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > started when .expedited_sequence was zero, right?
> > > > > > 
> > > > > 
> > > > > Yes, right.
> > > > > 
> > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         rcu_exp_wait_wake()
> > > > > > >           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > 
> > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > >       from wrong wait queue index - it should have worken wait queue
> > > > > > >       corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > >       0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > 
> > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > have completed before the completion of the wakeups for the previous
> > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > 
> > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > period completed while kworker A was in D state, while waiting for
> > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > waiting on wq index 1.
> > > > > 
> > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > 
> > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > If actually triggered, what did you do to make this happen?
> > > > > 
> > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > _synchronize_rcu_expedited().
> > > > > 
> > > > > schedule+0x70/0x90
> > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > synchronize_rcu+0x50/0xa0
> > > > > scsi_device_quiesce+0x50/0x120
> > > > > scsi_bus_suspend+0x70/0xe8
> > > > > dpm_run_callback+0x148/0x388
> > > > > __device_suspend+0x430/0x8a8
> > > > > 
> > > > > [1]
> > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > 
> > > > > > What have you done to test the change?
> > > > > > 
> > > > > 
> > > > > I have given this for testing; will share the results . Current analysis
> > > > > and patch is based on going through ftrace and code review.
> > > > 
> > > > OK, very good.  Please include the failure information in the changelog
> > > > of the next version of this patch.
> 
> Done.
> 
> > > > 
> > > > I prefer your original patch, that just uses "s", over the one below
> > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > patch is that it allow more concurrency between a consecutive pair of
> > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > is also conceptually simpler with a more manageable state space.
> 
> The reason for highlighting the alternate approach of doing gp end inside
> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> please correct me if I am wrong here:
> 
> 1. task0 holds exp_wake_mutex, and is preempted.

Presumably after it has awakened the kthread that initiated the prior
expedited grace period (the one with seq number = -4).

> 2. task1 initiates new GP (current seq number = 0).

Yes, this can happen.

> 3. task1 queues worker kworker1 and schedules out.

And thus still holds .exp_mutex, but yes.

> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex

And thus cannot yet have awakened task1.

> 5. task1 releases exp mutex, w/o entering waitq.

So I do not believe that we can get to #5.  What am I missing here?

> 6. task2 initiates new GP (current seq number = 1).
> 7. task2 queues worker kworker1 and schedules out.
> 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
> ...
> 
> This sequence would break the requirement of max 3 wqs.
> 
> If we hold the exp_wake_mutex before exp seq end, there will be atmost three
> pending GPs - 1. GP which current owner is doing wakeups for,
> 2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
> waiters, which started after GP @ point 2 had started. This also is inline
> with the statement in [1]:
> 
> "he key point is that the ->exp_mutex is not released until
> the first wakeup is complete, which means that the ->exp_wake_mutex
> has already been acquired at that point"
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev

And I believe that this still holds.  The task holding .exp_mutex cannot
release it until it has been awakened, and it won't be awakened until after
the .exp_wake_mutex has been acquired.  Again, what am I missing here?

> > > > Please also add the WARN_ON(), though at first glance your change seems
> > > > to have lost the wakeup.  (But it is early, so maybe it is just that I
> > > > am not yet fully awake.)
> > > 
> > > My bad, I posted incomplete diff in previous mail:
> > > 
> > >   static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > >   {
> > >   	struct rcu_node *rnp;
> > > +	unsigned long exp_low;
> > > +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > 
> > >   	synchronize_sched_expedited_wait(rsp);
> > >   	rcu_exp_gp_seq_end(rsp);
> > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > unsigned long s)
> > >   			spin_unlock(&rnp->exp_lock);
> > >   		}
> > >   		smp_mb(); /* All above changes before wakeup. */
> > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > +		WARN_ON(s_low != exp_low);
> > > +		wake_up_all(&rnp->exp_wq[exp_low]);
> > 
> > Much better!
> > 
> > But I suggest using s_low in the wake_up_all.  This hunk is of course
> > only for testing purposes, not for actual inclusion.  (My earlier email
> > didn't make that clear.)
> > 
> > 							Thanx, Paul
> > 
> 
> Ok sure, got it. I will share the results, once the issue is reproduced.

Sounds good!

							Thanx, Paul

> Thanks
> Neeraj
> 
> > >   	}
> > > 
> > > 
> > > Thanks
> > > Neeraj
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > I was thinking of another way of addressing this problem: Doing exp seq end
> > > > > inside exp_wake_mutex. This will also ensure that, if we extend the current
> > > > > scenario and there are multiple expedited GP sequence, which have completed,
> > > > > before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
> > > > > entries [2].
> > > > > 
> > > > > 
> > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> > > > > 
> > > > > 
> > > > > @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >           struct rcu_node *rnp;
> > > > > 
> > > > >           synchronize_sched_expedited_wait(rsp);
> > > > > -       rcu_exp_gp_seq_end(rsp);
> > > > > -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > 
> > > > >           /*
> > > > >            * Switch over to wakeup mode, allowing the next GP, but -only- the
> > > > > @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >            */
> > > > >           mutex_lock(&rsp->exp_wake_mutex);
> > > > > 
> > > > > +       rcu_exp_gp_seq_end(rsp);
> > > > > +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > +
> > > > > 
> > > > > 
> > > > > 
> > > > > > (Using a WARN_ON() to check for the lower bits of the counter portion
> > > > > > of rcu_state.expedited_sequence differing from the same bits of s
> > > > > > would be one way to detect this problem.)
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > 
> > > > > I have also given the patch for this, for testing:
> > > > > 
> > > > >    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > > > >    {
> > > > >           struct rcu_node *rnp;
> > > > > +       unsigned long exp_low;
> > > > > +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > > > 
> > > > >           synchronize_sched_expedited_wait(rsp);
> > > > >           rcu_exp_gp_seq_end(rsp);
> > > > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >                           spin_unlock(&rnp->exp_lock);
> > > > >                   }
> > > > >                   smp_mb(); /* All above changes before wakeup. */
> > > > > - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > > > +               WARN_ON(s_low != exp_low);
> > > > > +
> > > > > 
> > > > > Thanks
> > > > > Neeraj
> > > > > 
> > > > > > > Fix this by using the correct sequence number for wake_up_all() inside
> > > > > > > rcu_exp_wait_wake().
> > > > > > > 
> > > > > > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > > > > ---
> > > > > > >     kernel/rcu/tree_exp.h | 2 +-
> > > > > > >     1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > > index e4b77d3..28979d3 100644
> > > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > > > > >     			spin_unlock(&rnp->exp_lock);
> > > > > > >     		}
> > > > > > >     		smp_mb(); /* All above changes before wakeup. */
> > > > > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > > > > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > > > > > >     	}
> > > > > > >     	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > > > > > >     	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > > > > > -- 
> > > > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> > > > > > > member of the Code Aurora Forum, hosted by The Linux Foundation
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > > > the Code Aurora Forum, hosted by The Linux Foundation
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19  4:05             ` Paul E. McKenney
@ 2019-11-19  7:03               ` Neeraj Upadhyay
  2019-11-19 15:09                 ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-19  7:03 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

Hi Paul,

On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
>> Hi Paul,
>>
>> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
>>> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>>>> Hi Paul,
>>>>
>>>>
>>>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>>>> there is a chance that they might be indefinitely blocked
>>>>>>>> in below scenario:
>>>>>>>>
>>>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>>>        exp_funnel_lock().
>>>>>>>>
>>>>>>>>        _synchronize_rcu_expedited()
>>>>>>>
>>>>>>> This symbol went away a few versions back, but let's see how this
>>>>>>> plays out in current -rcu.
>>>>>>>
>>>>>>
>>>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>>>> checked against the -rcu code, and the relevant portions were present
>>>>>> there.
>>>>>>
>>>>>>>>          s = 0b'100
>>>>>>>>          exp_funnel_lock()
>>>>>>>>            wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>>
>>>>>>> All of the above could still happen if the expedited grace
>>>>>>> period number was zero (or a bit less) when that task invoked
>>>>>>
>>>>>> Yes
>>>>>>
>>>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>>>> between this task and "task1" below?  Seems like you want them to
>>>>>>> be different tasks.
>>>>>>>
>>>>>>
>>>>>> This task is the one which is waiting for the expedited sequence, which
>>>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>>>> wake up this task, on exp GP completion.
>>>>>>
>>>>>>> Does this task actually block, or is it just getting ready
>>>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>>>
>>>>>>
>>>>>> Yes, it actually blocked in wait queue.
>>>>>>
>>>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>>>        worker and schedules out.
>>>>>>>
>>>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>>>
>>>>>> Yes, correct.
>>>>>>
>>>>>>>>        _synchronize_rcu_expedited()
>>>>>>>>          s = 0b'100
>>>>>>>>          queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>>            wake_up_worker()
>>>>>>>>              schedule()
>>>>>>>>
>>>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>>>        sequence.
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>>>                                    // to 0b'100'
>>>>>>>>
>>>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>>>        and releases exp_mutex.
>>>>>>>>
>>>>>>>>        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>>          sync_exp_work_done(rsp, s));
>>>>>>>>        mutex_unlock(&rsp->exp_mutex);
>>>>>>>
>>>>>>> So task1 is the one that initiated the expedited grace period that
>>>>>>> started when .expedited_sequence was zero, right?
>>>>>>>
>>>>>>
>>>>>> Yes, right.
>>>>>>
>>>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>>>
>>>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>>>        from wrong wait queue index - it should have worken wait queue
>>>>>>>>        corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>>>        0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>>>
>>>>>>>>        rcu_exp_wait_wake()
>>>>>>>>          wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>>
>>>>>>> So the issue is that the next expedited RCU grace period might
>>>>>>> have completed before the completion of the wakeups for the previous
>>>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>>>
>>>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>>>> period completed while kworker A was in D state, while waiting for
>>>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>>>> waiting on wq index 1.
>>>>>>
>>>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>>>
>>>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>>>> If actually triggered, what did you do to make this happen?
>>>>>>
>>>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>>>> _synchronize_rcu_expedited().
>>>>>>
>>>>>> schedule+0x70/0x90
>>>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>>>> synchronize_rcu+0x50/0xa0
>>>>>> scsi_device_quiesce+0x50/0x120
>>>>>> scsi_bus_suspend+0x70/0xe8
>>>>>> dpm_run_callback+0x148/0x388
>>>>>> __device_suspend+0x430/0x8a8
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>>>
>>>>>>> What have you done to test the change?
>>>>>>>
>>>>>>
>>>>>> I have given this for testing; will share the results . Current analysis
>>>>>> and patch is based on going through ftrace and code review.
>>>>>
>>>>> OK, very good.  Please include the failure information in the changelog
>>>>> of the next version of this patch.
>>
>> Done.
>>
>>>>>
>>>>> I prefer your original patch, that just uses "s", over the one below
>>>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>>>> patch is that it allow more concurrency between a consecutive pair of
>>>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>>>> is also conceptually simpler with a more manageable state space.
>>
>> The reason for highlighting the alternate approach of doing gp end inside
>> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
>> please correct me if I am wrong here:
>>
>> 1. task0 holds exp_wake_mutex, and is preempted.
> 
> Presumably after it has awakened the kthread that initiated the prior
> expedited grace period (the one with seq number = -4).
> 
>> 2. task1 initiates new GP (current seq number = 0).
> 
> Yes, this can happen.
> 
>> 3. task1 queues worker kworker1 and schedules out.
> 
> And thus still holds .exp_mutex, but yes.
> 
>> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> 
> And thus cannot yet have awakened task1.
> 
>> 5. task1 releases exp mutex, w/o entering waitq.
> 
> So I do not believe that we can get to #5.  What am I missing here?
> 

As mentioned in this patch, task1 could have scheduled out after queuing 
work:

queue_work(rcu_gp_wq, &rew.rew_work)
            wake_up_worker()
              schedule()

kworker1 runs and picks up this queued work, and sets exp GP to 1 and 
waits on exp_wake_mutex.

task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which 
return true and it does not enter wait queue and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
          sync_exp_work_done(rsp, s));

Thanks
Neeraj

>> 6. task2 initiates new GP (current seq number = 1).
>> 7. task2 queues worker kworker1 and schedules out.
>> 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
>> ...
>>
>> This sequence would break the requirement of max 3 wqs.
>>
>> If we hold the exp_wake_mutex before exp seq end, there will be atmost three
>> pending GPs - 1. GP which current owner is doing wakeups for,
>> 2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
>> waiters, which started after GP @ point 2 had started. This also is inline
>> with the statement in [1]:
>>
>> "he key point is that the ->exp_mutex is not released until
>> the first wakeup is complete, which means that the ->exp_wake_mutex
>> has already been acquired at that point"
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> 
> And I believe that this still holds.  The task holding .exp_mutex cannot
> release it until it has been awakened, and it won't be awakened until after
> the .exp_wake_mutex has been acquired.  Again, what am I missing here?
> 
>>>>> Please also add the WARN_ON(), though at first glance your change seems
>>>>> to have lost the wakeup.  (But it is early, so maybe it is just that I
>>>>> am not yet fully awake.)
>>>>
>>>> My bad, I posted incomplete diff in previous mail:
>>>>
>>>>    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>>>    {
>>>>    	struct rcu_node *rnp;
>>>> +	unsigned long exp_low;
>>>> +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>>>
>>>>    	synchronize_sched_expedited_wait(rsp);
>>>>    	rcu_exp_gp_seq_end(rsp);
>>>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>> unsigned long s)
>>>>    			spin_unlock(&rnp->exp_lock);
>>>>    		}
>>>>    		smp_mb(); /* All above changes before wakeup. */
>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>> +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>>>> +		WARN_ON(s_low != exp_low);
>>>> +		wake_up_all(&rnp->exp_wq[exp_low]);
>>>
>>> Much better!
>>>
>>> But I suggest using s_low in the wake_up_all.  This hunk is of course
>>> only for testing purposes, not for actual inclusion.  (My earlier email
>>> didn't make that clear.)
>>>
>>> 							Thanx, Paul
>>>
>>
>> Ok sure, got it. I will share the results, once the issue is reproduced.
> 
> Sounds good!
> 
> 							Thanx, Paul
> 
>> Thanks
>> Neeraj
>>
>>>>    	}
>>>>
>>>>
>>>> Thanks
>>>> Neeraj
>>>>
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>>> I was thinking of another way of addressing this problem: Doing exp seq end
>>>>>> inside exp_wake_mutex. This will also ensure that, if we extend the current
>>>>>> scenario and there are multiple expedited GP sequence, which have completed,
>>>>>> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
>>>>>> entries [2].
>>>>>>
>>>>>>
>>>>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
>>>>>>
>>>>>>
>>>>>> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>            struct rcu_node *rnp;
>>>>>>
>>>>>>            synchronize_sched_expedited_wait(rsp);
>>>>>> -       rcu_exp_gp_seq_end(rsp);
>>>>>> -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>>>>
>>>>>>            /*
>>>>>>             * Switch over to wakeup mode, allowing the next GP, but -only- the
>>>>>> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>             */
>>>>>>            mutex_lock(&rsp->exp_wake_mutex);
>>>>>>
>>>>>> +       rcu_exp_gp_seq_end(rsp);
>>>>>> +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>>>>> +
>>>>>>
>>>>>>
>>>>>>
>>>>>>> (Using a WARN_ON() to check for the lower bits of the counter portion
>>>>>>> of rcu_state.expedited_sequence differing from the same bits of s
>>>>>>> would be one way to detect this problem.)
>>>>>>>
>>>>>>> 							Thanx, Paul
>>>>>>>
>>>>>>
>>>>>> I have also given the patch for this, for testing:
>>>>>>
>>>>>>     static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>>>>>>     {
>>>>>>            struct rcu_node *rnp;
>>>>>> +       unsigned long exp_low;
>>>>>> +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>>>>>
>>>>>>            synchronize_sched_expedited_wait(rsp);
>>>>>>            rcu_exp_gp_seq_end(rsp);
>>>>>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>>>>>> unsigned long s)
>>>>>>                            spin_unlock(&rnp->exp_lock);
>>>>>>                    }
>>>>>>                    smp_mb(); /* All above changes before wakeup. */
>>>>>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>> +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>>>>>> +               WARN_ON(s_low != exp_low);
>>>>>> +
>>>>>>
>>>>>> Thanks
>>>>>> Neeraj
>>>>>>
>>>>>>>> Fix this by using the correct sequence number for wake_up_all() inside
>>>>>>>> rcu_exp_wait_wake().
>>>>>>>>
>>>>>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>>>> ---
>>>>>>>>      kernel/rcu/tree_exp.h | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>>>>> index e4b77d3..28979d3 100644
>>>>>>>> --- a/kernel/rcu/tree_exp.h
>>>>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>>>>>      			spin_unlock(&rnp->exp_lock);
>>>>>>>>      		}
>>>>>>>>      		smp_mb(); /* All above changes before wakeup. */
>>>>>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>>>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>>>>>>      	}
>>>>>>>>      	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>>>>>>      	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>>>>>> -- 
>>>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>>>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>>>
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19  7:03               ` Neeraj Upadhyay
@ 2019-11-19 15:09                 ` Paul E. McKenney
  2019-11-19 20:06                   ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-19 15:09 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
> Hi Paul,
> 
> On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> > > Hi Paul,
> > > 
> > > On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > > > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > 
> > > > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > > > in below scenario:
> > > > > > > > > 
> > > > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > > > >        exp_funnel_lock().
> > > > > > > > > 
> > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > 
> > > > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > > > plays out in current -rcu.
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > > > checked against the -rcu code, and the relevant portions were present
> > > > > > > there.
> > > > > > > 
> > > > > > > > >          s = 0b'100
> > > > > > > > >          exp_funnel_lock()
> > > > > > > > >            wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > > > 
> > > > > > > > All of the above could still happen if the expedited grace
> > > > > > > > period number was zero (or a bit less) when that task invoked
> > > > > > > 
> > > > > > > Yes
> > > > > > > 
> > > > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > > > be different tasks.
> > > > > > > > 
> > > > > > > 
> > > > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > > > wake up this task, on exp GP completion.
> > > > > > > 
> > > > > > > > Does this task actually block, or is it just getting ready
> > > > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes, it actually blocked in wait queue.
> > > > > > > 
> > > > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > > > >        worker and schedules out.
> > > > > > > > 
> > > > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > > > 
> > > > > > > Yes, correct.
> > > > > > > 
> > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > >          s = 0b'100
> > > > > > > > >          queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > > > >            wake_up_worker()
> > > > > > > > >              schedule()
> > > > > > > > > 
> > > > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > > > >        sequence.
> > > > > > > > > 
> > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > > > >                                    // to 0b'100'
> > > > > > > > > 
> > > > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > > > >        and releases exp_mutex.
> > > > > > > > > 
> > > > > > > > >        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > > > >          sync_exp_work_done(rsp, s));
> > > > > > > > >        mutex_unlock(&rsp->exp_mutex);
> > > > > > > > 
> > > > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > > > started when .expedited_sequence was zero, right?
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes, right.
> > > > > > > 
> > > > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > > > 
> > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > > > 
> > > > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > > > >        from wrong wait queue index - it should have worken wait queue
> > > > > > > > >        corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > > > >        0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > > > 
> > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > >          wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > > 
> > > > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > > > have completed before the completion of the wakeups for the previous
> > > > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > > > 
> > > > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > > > period completed while kworker A was in D state, while waiting for
> > > > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > > > waiting on wq index 1.
> > > > > > > 
> > > > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > > > 
> > > > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > > > If actually triggered, what did you do to make this happen?
> > > > > > > 
> > > > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > > > _synchronize_rcu_expedited().
> > > > > > > 
> > > > > > > schedule+0x70/0x90
> > > > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > > > synchronize_rcu+0x50/0xa0
> > > > > > > scsi_device_quiesce+0x50/0x120
> > > > > > > scsi_bus_suspend+0x70/0xe8
> > > > > > > dpm_run_callback+0x148/0x388
> > > > > > > __device_suspend+0x430/0x8a8
> > > > > > > 
> > > > > > > [1]
> > > > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > > > 
> > > > > > > > What have you done to test the change?
> > > > > > > > 
> > > > > > > 
> > > > > > > I have given this for testing; will share the results . Current analysis
> > > > > > > and patch is based on going through ftrace and code review.
> > > > > > 
> > > > > > OK, very good.  Please include the failure information in the changelog
> > > > > > of the next version of this patch.
> > > 
> > > Done.
> > > 
> > > > > > 
> > > > > > I prefer your original patch, that just uses "s", over the one below
> > > > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > > > patch is that it allow more concurrency between a consecutive pair of
> > > > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > > > is also conceptually simpler with a more manageable state space.
> > > 
> > > The reason for highlighting the alternate approach of doing gp end inside
> > > exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> > > please correct me if I am wrong here:
> > > 
> > > 1. task0 holds exp_wake_mutex, and is preempted.
> > 
> > Presumably after it has awakened the kthread that initiated the prior
> > expedited grace period (the one with seq number = -4).
> > 
> > > 2. task1 initiates new GP (current seq number = 0).
> > 
> > Yes, this can happen.
> > 
> > > 3. task1 queues worker kworker1 and schedules out.
> > 
> > And thus still holds .exp_mutex, but yes.
> > 
> > > 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> > 
> > And thus cannot yet have awakened task1.
> > 
> > > 5. task1 releases exp mutex, w/o entering waitq.
> > 
> > So I do not believe that we can get to #5.  What am I missing here?
> > 
> 
> As mentioned in this patch, task1 could have scheduled out after queuing
> work:
> 
> queue_work(rcu_gp_wq, &rew.rew_work)
>            wake_up_worker()
>              schedule()
> 
> kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
> on exp_wake_mutex.
> 
> task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
> true and it does not enter wait queue and releases exp_mutex.
> 
> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>          sync_exp_work_done(rsp, s));

Well, I have certainly given enough people a hard time about missing the
didn't-actually-sleep case, so good show on finding one in my code!  ;-)

Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
The .exp_mutex won't be released until after it happens, and the
next manipulation of the sequence number cannot happen until after
.exp_mutex is next acquired.

Good catch!  And keep up the good work!!!

							Thanx, Paul

> Thanks
> Neeraj
> 
> > > 6. task2 initiates new GP (current seq number = 1).
> > > 7. task2 queues worker kworker1 and schedules out.
> > > 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
> > > ...
> > > 
> > > This sequence would break the requirement of max 3 wqs.
> > > 
> > > If we hold the exp_wake_mutex before exp seq end, there will be atmost three
> > > pending GPs - 1. GP which current owner is doing wakeups for,
> > > 2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
> > > waiters, which started after GP @ point 2 had started. This also is inline
> > > with the statement in [1]:
> > > 
> > > "he key point is that the ->exp_mutex is not released until
> > > the first wakeup is complete, which means that the ->exp_wake_mutex
> > > has already been acquired at that point"
> > > 
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> > 
> > And I believe that this still holds.  The task holding .exp_mutex cannot
> > release it until it has been awakened, and it won't be awakened until after
> > the .exp_wake_mutex has been acquired.  Again, what am I missing here?
> > 
> > > > > > Please also add the WARN_ON(), though at first glance your change seems
> > > > > > to have lost the wakeup.  (But it is early, so maybe it is just that I
> > > > > > am not yet fully awake.)
> > > > > 
> > > > > My bad, I posted incomplete diff in previous mail:
> > > > > 
> > > > >    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > > > >    {
> > > > >    	struct rcu_node *rnp;
> > > > > +	unsigned long exp_low;
> > > > > +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > > > 
> > > > >    	synchronize_sched_expedited_wait(rsp);
> > > > >    	rcu_exp_gp_seq_end(rsp);
> > > > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >    			spin_unlock(&rnp->exp_lock);
> > > > >    		}
> > > > >    		smp_mb(); /* All above changes before wakeup. */
> > > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > > > +		WARN_ON(s_low != exp_low);
> > > > > +		wake_up_all(&rnp->exp_wq[exp_low]);
> > > > 
> > > > Much better!
> > > > 
> > > > But I suggest using s_low in the wake_up_all.  This hunk is of course
> > > > only for testing purposes, not for actual inclusion.  (My earlier email
> > > > didn't make that clear.)
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > 
> > > Ok sure, got it. I will share the results, once the issue is reproduced.
> > 
> > Sounds good!
> > 
> > 							Thanx, Paul
> > 
> > > Thanks
> > > Neeraj
> > > 
> > > > >    	}
> > > > > 
> > > > > 
> > > > > Thanks
> > > > > Neeraj
> > > > > 
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > > I was thinking of another way of addressing this problem: Doing exp seq end
> > > > > > > inside exp_wake_mutex. This will also ensure that, if we extend the current
> > > > > > > scenario and there are multiple expedited GP sequence, which have completed,
> > > > > > > before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
> > > > > > > entries [2].
> > > > > > > 
> > > > > > > 
> > > > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> > > > > > > 
> > > > > > > 
> > > > > > > @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > > > unsigned long s)
> > > > > > >            struct rcu_node *rnp;
> > > > > > > 
> > > > > > >            synchronize_sched_expedited_wait(rsp);
> > > > > > > -       rcu_exp_gp_seq_end(rsp);
> > > > > > > -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > > > 
> > > > > > >            /*
> > > > > > >             * Switch over to wakeup mode, allowing the next GP, but -only- the
> > > > > > > @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > > > unsigned long s)
> > > > > > >             */
> > > > > > >            mutex_lock(&rsp->exp_wake_mutex);
> > > > > > > 
> > > > > > > +       rcu_exp_gp_seq_end(rsp);
> > > > > > > +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > > > +
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > (Using a WARN_ON() to check for the lower bits of the counter portion
> > > > > > > > of rcu_state.expedited_sequence differing from the same bits of s
> > > > > > > > would be one way to detect this problem.)
> > > > > > > > 
> > > > > > > > 							Thanx, Paul
> > > > > > > > 
> > > > > > > 
> > > > > > > I have also given the patch for this, for testing:
> > > > > > > 
> > > > > > >     static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > > > > > >     {
> > > > > > >            struct rcu_node *rnp;
> > > > > > > +       unsigned long exp_low;
> > > > > > > +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > > > > > 
> > > > > > >            synchronize_sched_expedited_wait(rsp);
> > > > > > >            rcu_exp_gp_seq_end(rsp);
> > > > > > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > > > unsigned long s)
> > > > > > >                            spin_unlock(&rnp->exp_lock);
> > > > > > >                    }
> > > > > > >                    smp_mb(); /* All above changes before wakeup. */
> > > > > > > - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > > > > > +               WARN_ON(s_low != exp_low);
> > > > > > > +
> > > > > > > 
> > > > > > > Thanks
> > > > > > > Neeraj
> > > > > > > 
> > > > > > > > > Fix this by using the correct sequence number for wake_up_all() inside
> > > > > > > > > rcu_exp_wait_wake().
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > > > > > > ---
> > > > > > > > >      kernel/rcu/tree_exp.h | 2 +-
> > > > > > > > >      1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > > > > index e4b77d3..28979d3 100644
> > > > > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > > > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > > > > > > >      			spin_unlock(&rnp->exp_lock);
> > > > > > > > >      		}
> > > > > > > > >      		smp_mb(); /* All above changes before wakeup. */
> > > > > > > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > > > > > > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > > > > > > > >      	}
> > > > > > > > >      	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > > > > > > > >      	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > > > > > > > -- 
> > > > > > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> > > > > > > > > member of the Code Aurora Forum, hosted by The Linux Foundation
> > > > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > > > > > the Code Aurora Forum, hosted by The Linux Foundation
> > > > > 
> > > > > -- 
> > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > > > the Code Aurora Forum, hosted by The Linux Foundation
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19 15:09                 ` Paul E. McKenney
@ 2019-11-19 20:06                   ` Paul E. McKenney
  2019-11-20  5:09                     ` Neeraj Upadhyay
       [not found]                     ` <eeb7de82-c24b-c914-05fb-683088c9f8e2@codeaurora.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-19 20:06 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
> > Hi Paul,
> > 
> > On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> > > On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> > > > Hi Paul,
> > > > 
> > > > On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > > > > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > 
> > > > > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > > > > in below scenario:
> > > > > > > > > > 
> > > > > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > > > > >        exp_funnel_lock().
> > > > > > > > > > 
> > > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > > 
> > > > > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > > > > plays out in current -rcu.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > > > > checked against the -rcu code, and the relevant portions were present
> > > > > > > > there.
> > > > > > > > 
> > > > > > > > > >          s = 0b'100
> > > > > > > > > >          exp_funnel_lock()
> > > > > > > > > >            wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > > > > 
> > > > > > > > > All of the above could still happen if the expedited grace
> > > > > > > > > period number was zero (or a bit less) when that task invoked
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > > > > be different tasks.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > > > > wake up this task, on exp GP completion.
> > > > > > > > 
> > > > > > > > > Does this task actually block, or is it just getting ready
> > > > > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes, it actually blocked in wait queue.
> > > > > > > > 
> > > > > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > > > > >        worker and schedules out.
> > > > > > > > > 
> > > > > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > > > > 
> > > > > > > > Yes, correct.
> > > > > > > > 
> > > > > > > > > >        _synchronize_rcu_expedited()
> > > > > > > > > >          s = 0b'100
> > > > > > > > > >          queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > > > > >            wake_up_worker()
> > > > > > > > > >              schedule()
> > > > > > > > > > 
> > > > > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > > > > >        sequence.
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > > > > >                                    // to 0b'100'
> > > > > > > > > > 
> > > > > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > > > > >        and releases exp_mutex.
> > > > > > > > > > 
> > > > > > > > > >        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > > > > >          sync_exp_work_done(rsp, s));
> > > > > > > > > >        mutex_unlock(&rsp->exp_mutex);
> > > > > > > > > 
> > > > > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > > > > started when .expedited_sequence was zero, right?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes, right.
> > > > > > > > 
> > > > > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > >            rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > > > > 
> > > > > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > > > > >        from wrong wait queue index - it should have worken wait queue
> > > > > > > > > >        corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > > > > >        0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > > > > 
> > > > > > > > > >        rcu_exp_wait_wake()
> > > > > > > > > >          wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > > > 
> > > > > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > > > > have completed before the completion of the wakeups for the previous
> > > > > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > > > > 
> > > > > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > > > > period completed while kworker A was in D state, while waiting for
> > > > > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > > > > waiting on wq index 1.
> > > > > > > > 
> > > > > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > > > > 
> > > > > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > > > > If actually triggered, what did you do to make this happen?
> > > > > > > > 
> > > > > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > > > > _synchronize_rcu_expedited().
> > > > > > > > 
> > > > > > > > schedule+0x70/0x90
> > > > > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > > > > synchronize_rcu+0x50/0xa0
> > > > > > > > scsi_device_quiesce+0x50/0x120
> > > > > > > > scsi_bus_suspend+0x70/0xe8
> > > > > > > > dpm_run_callback+0x148/0x388
> > > > > > > > __device_suspend+0x430/0x8a8
> > > > > > > > 
> > > > > > > > [1]
> > > > > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > > > > 
> > > > > > > > > What have you done to test the change?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I have given this for testing; will share the results . Current analysis
> > > > > > > > and patch is based on going through ftrace and code review.
> > > > > > > 
> > > > > > > OK, very good.  Please include the failure information in the changelog
> > > > > > > of the next version of this patch.
> > > > 
> > > > Done.
> > > > 
> > > > > > > 
> > > > > > > I prefer your original patch, that just uses "s", over the one below
> > > > > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > > > > patch is that it allow more concurrency between a consecutive pair of
> > > > > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > > > > is also conceptually simpler with a more manageable state space.
> > > > 
> > > > The reason for highlighting the alternate approach of doing gp end inside
> > > > exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> > > > please correct me if I am wrong here:
> > > > 
> > > > 1. task0 holds exp_wake_mutex, and is preempted.
> > > 
> > > Presumably after it has awakened the kthread that initiated the prior
> > > expedited grace period (the one with seq number = -4).
> > > 
> > > > 2. task1 initiates new GP (current seq number = 0).
> > > 
> > > Yes, this can happen.
> > > 
> > > > 3. task1 queues worker kworker1 and schedules out.
> > > 
> > > And thus still holds .exp_mutex, but yes.
> > > 
> > > > 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> > > 
> > > And thus cannot yet have awakened task1.
> > > 
> > > > 5. task1 releases exp mutex, w/o entering waitq.
> > > 
> > > So I do not believe that we can get to #5.  What am I missing here?
> > > 
> > 
> > As mentioned in this patch, task1 could have scheduled out after queuing
> > work:
> > 
> > queue_work(rcu_gp_wq, &rew.rew_work)
> >            wake_up_worker()
> >              schedule()
> > 
> > kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
> > on exp_wake_mutex.
> > 
> > task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
> > true and it does not enter wait queue and releases exp_mutex.
> > 
> > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> >          sync_exp_work_done(rsp, s));
> 
> Well, I have certainly given enough people a hard time about missing the
> didn't-actually-sleep case, so good show on finding one in my code!  ;-)
> 
> Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
> The .exp_mutex won't be released until after it happens, and the
> next manipulation of the sequence number cannot happen until after
> .exp_mutex is next acquired.
> 
> Good catch!  And keep up the good work!!!

And here is the commit corresponding to your earlier patch.  Please let
me know of any needed adjustments.

							Thanx, Paul

------------------------------------------------------------------------

commit 3ec440b52831eea172061c5db3d2990b22904863
Author: Neeraj Upadhyay <neeraju@codeaurora.org>
Date:   Tue Nov 19 11:50:52 2019 -0800

    rcu: Allow only one expedited GP to run concurrently with wakeups
    
    The current expedited RCU grace-period code expects that a task
    requesting an expedited grace period cannot awaken until that grace
    period has reached the wakeup phase.  However, it is possible for a long
    preemption to result in the waiting task never sleeping.  For example,
    consider the following sequence of events:
    
    1.      Task A starts an expedited grace period by invoking
            synchronize_rcu_expedited().  It proceeds normally up to the
            wait_event() near the end of that function, and is then preempted
            (or interrupted or whatever).
    
    2.      The expedited grace period completes, and a kworker task starts
            the awaken phase, having incremented the counter and acquired
            the rcu_state structure's .exp_wake_mutex.  This kworker task
            is then preempted or interrupted or whatever.
    
    3.      Task A resumes and enters wait_event(), which notes that the
            expedited grace period has completed, and thus doesn't sleep.
    
    4.      Task B starts an expedited grace period exactly as did Task A,
            complete with the preemption (or whatever delay) just before
            the call to wait_event().
    
    5.      The expedited grace period completes, and another kworker
            task starts the awaken phase, having incremented the counter.
            However, it blocks when attempting to acquire the rcu_state
            structure's .exp_wake_mutex because step 2's kworker task has
            not yet released it.
    
    6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
            structure's ->exp_wq[] array.
    
    In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
    array will just be spuriously awakened, but they will just sleep again
    on noting that the rcu_state structure's ->expedited_sequence value has
    not advanced far enough.
    
    In practice, this wastes CPU time and is an accident waiting to happen.
    This commit therefore moves the rcu_exp_gp_seq_end() call that officially
    ends the expedited grace period (along with associate tracing) until
    after the ->exp_wake_mutex has been acquired.  This prevents Task A from
    awakening prematurely, thus preventing more than one expedited grace
    period from being in flight during a previous expedited grace period's
    wakeup phase.
    
    Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
    [ paulmck: Added updated comment. ]
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 4433d00a..8840729 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
 	struct rcu_node *rnp;
 
 	synchronize_sched_expedited_wait();
-	rcu_exp_gp_seq_end();
-	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
 
-	/*
-	 * Switch over to wakeup mode, allowing the next GP, but -only- the
-	 * next GP, to proceed.
-	 */
+	// Switch over to wakeup mode, allowing the next GP to proceed.
+	// End the previous grace period only after acquiring the mutex
+	// to ensure that only one GP runs concurrently with wakeups.
 	mutex_lock(&rcu_state.exp_wake_mutex);
+	rcu_exp_gp_seq_end();
+	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
 
 	rcu_for_each_node_breadth_first(rnp) {
 		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19 20:06                   ` Paul E. McKenney
@ 2019-11-20  5:09                     ` Neeraj Upadhyay
       [not found]                     ` <eeb7de82-c24b-c914-05fb-683088c9f8e2@codeaurora.org>
  1 sibling, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-20  5:09 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli



On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
>> On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
>>> Hi Paul,
>>>
>>> On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
>>>> On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
>>>>> Hi Paul,
>>>>>
>>>>> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
>>>>>> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>>
>>>>>>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>>>>>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>>>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>>>>>>> there is a chance that they might be indefinitely blocked
>>>>>>>>>>> in below scenario:
>>>>>>>>>>>
>>>>>>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>>>>>>         exp_funnel_lock().
>>>>>>>>>>>
>>>>>>>>>>>         _synchronize_rcu_expedited()
>>>>>>>>>>
>>>>>>>>>> This symbol went away a few versions back, but let's see how this
>>>>>>>>>> plays out in current -rcu.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>>>>>>> checked against the -rcu code, and the relevant portions were present
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>>>>           s = 0b'100
>>>>>>>>>>>           exp_funnel_lock()
>>>>>>>>>>>             wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>>>>>
>>>>>>>>>> All of the above could still happen if the expedited grace
>>>>>>>>>> period number was zero (or a bit less) when that task invoked
>>>>>>>>>
>>>>>>>>> Yes
>>>>>>>>>
>>>>>>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>>>>>>> between this task and "task1" below?  Seems like you want them to
>>>>>>>>>> be different tasks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This task is the one which is waiting for the expedited sequence, which
>>>>>>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>>>>>>> wake up this task, on exp GP completion.
>>>>>>>>>
>>>>>>>>>> Does this task actually block, or is it just getting ready
>>>>>>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, it actually blocked in wait queue.
>>>>>>>>>
>>>>>>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>>>>>>         worker and schedules out.
>>>>>>>>>>
>>>>>>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>>>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>>>>>>
>>>>>>>>> Yes, correct.
>>>>>>>>>
>>>>>>>>>>>         _synchronize_rcu_expedited()
>>>>>>>>>>>           s = 0b'100
>>>>>>>>>>>           queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>>>>>             wake_up_worker()
>>>>>>>>>>>               schedule()
>>>>>>>>>>>
>>>>>>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>>>>>>         sequence.
>>>>>>>>>>>
>>>>>>>>>>>         rcu_exp_wait_wake()
>>>>>>>>>>>           rcu_exp_wait_wake()
>>>>>>>>>>>             rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>>>>>>                                     // to 0b'100'
>>>>>>>>>>>
>>>>>>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>>>>>>         and releases exp_mutex.
>>>>>>>>>>>
>>>>>>>>>>>         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>>>>>           sync_exp_work_done(rsp, s));
>>>>>>>>>>>         mutex_unlock(&rsp->exp_mutex);
>>>>>>>>>>
>>>>>>>>>> So task1 is the one that initiated the expedited grace period that
>>>>>>>>>> started when .expedited_sequence was zero, right?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, right.
>>>>>>>>>
>>>>>>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>>>>>>
>>>>>>>>>>>         rcu_exp_wait_wake()
>>>>>>>>>>>           rcu_exp_wait_wake()
>>>>>>>>>>>             rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>>>>>>
>>>>>>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>>>>>>         from wrong wait queue index - it should have worken wait queue
>>>>>>>>>>>         corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>>>>>>         0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>>>>>>
>>>>>>>>>>>         rcu_exp_wait_wake()
>>>>>>>>>>>           wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>>>>>
>>>>>>>>>> So the issue is that the next expedited RCU grace period might
>>>>>>>>>> have completed before the completion of the wakeups for the previous
>>>>>>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>>>>>>
>>>>>>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>>>>>>> period completed while kworker A was in D state, while waiting for
>>>>>>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>>>>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>>>>>>> waiting on wq index 1.
>>>>>>>>>
>>>>>>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>>>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>>>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>>>>>>
>>>>>>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>>>>>>> If actually triggered, what did you do to make this happen?
>>>>>>>>>
>>>>>>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>>>>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>>>>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>>>>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>>>>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>>>>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>>>>>>> _synchronize_rcu_expedited().
>>>>>>>>>
>>>>>>>>> schedule+0x70/0x90
>>>>>>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>>>>>>> synchronize_rcu+0x50/0xa0
>>>>>>>>> scsi_device_quiesce+0x50/0x120
>>>>>>>>> scsi_bus_suspend+0x70/0xe8
>>>>>>>>> dpm_run_callback+0x148/0x388
>>>>>>>>> __device_suspend+0x430/0x8a8
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>>>>>>
>>>>>>>>>> What have you done to test the change?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have given this for testing; will share the results . Current analysis
>>>>>>>>> and patch is based on going through ftrace and code review.
>>>>>>>>
>>>>>>>> OK, very good.  Please include the failure information in the changelog
>>>>>>>> of the next version of this patch.
>>>>>
>>>>> Done.
>>>>>
>>>>>>>>
>>>>>>>> I prefer your original patch, that just uses "s", over the one below
>>>>>>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>>>>>>> patch is that it allow more concurrency between a consecutive pair of
>>>>>>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>>>>>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>>>>>>> is also conceptually simpler with a more manageable state space.
>>>>>
>>>>> The reason for highlighting the alternate approach of doing gp end inside
>>>>> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
>>>>> please correct me if I am wrong here:
>>>>>
>>>>> 1. task0 holds exp_wake_mutex, and is preempted.
>>>>
>>>> Presumably after it has awakened the kthread that initiated the prior
>>>> expedited grace period (the one with seq number = -4).
>>>>
>>>>> 2. task1 initiates new GP (current seq number = 0).
>>>>
>>>> Yes, this can happen.
>>>>
>>>>> 3. task1 queues worker kworker1 and schedules out.
>>>>
>>>> And thus still holds .exp_mutex, but yes.
>>>>
>>>>> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
>>>>
>>>> And thus cannot yet have awakened task1.
>>>>
>>>>> 5. task1 releases exp mutex, w/o entering waitq.
>>>>
>>>> So I do not believe that we can get to #5.  What am I missing here?
>>>>
>>>
>>> As mentioned in this patch, task1 could have scheduled out after queuing
>>> work:
>>>
>>> queue_work(rcu_gp_wq, &rew.rew_work)
>>>             wake_up_worker()
>>>               schedule()
>>>
>>> kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
>>> on exp_wake_mutex.
>>>
>>> task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
>>> true and it does not enter wait queue and releases exp_mutex.
>>>
>>> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>           sync_exp_work_done(rsp, s));
>>
>> Well, I have certainly given enough people a hard time about missing the
>> didn't-actually-sleep case, so good show on finding one in my code!  ;-)
>>
>> Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
>> The .exp_mutex won't be released until after it happens, and the
>> next manipulation of the sequence number cannot happen until after
>> .exp_mutex is next acquired.
>>
>> Good catch!  And keep up the good work!!!
> 
> And here is the commit corresponding to your earlier patch.  Please let
> me know of any needed adjustments.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 3ec440b52831eea172061c5db3d2990b22904863
> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> Date:   Tue Nov 19 11:50:52 2019 -0800
> 
>      rcu: Allow only one expedited GP to run concurrently with wakeups
>      
>      The current expedited RCU grace-period code expects that a task
>      requesting an expedited grace period cannot awaken until that grace
>      period has reached the wakeup phase.  However, it is possible for a long
>      preemption to result in the waiting task never sleeping.  For example,
>      consider the following sequence of events:
>      
>      1.      Task A starts an expedited grace period by invoking
>              synchronize_rcu_expedited().  It proceeds normally up to the
>              wait_event() near the end of that function, and is then preempted
>              (or interrupted or whatever).
>      
>      2.      The expedited grace period completes, and a kworker task starts
>              the awaken phase, having incremented the counter and acquired
>              the rcu_state structure's .exp_wake_mutex.  This kworker task
>              is then preempted or interrupted or whatever.
>      
>      3.      Task A resumes and enters wait_event(), which notes that the
>              expedited grace period has completed, and thus doesn't sleep.
>      
>      4.      Task B starts an expedited grace period exactly as did Task A,
>              complete with the preemption (or whatever delay) just before
>              the call to wait_event().
>      
>      5.      The expedited grace period completes, and another kworker
>              task starts the awaken phase, having incremented the counter.
>              However, it blocks when attempting to acquire the rcu_state
>              structure's .exp_wake_mutex because step 2's kworker task has
>              not yet released it.
>      
>      6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
>              structure's ->exp_wq[] array.
>      
>      In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
>      array will just be spuriously awakened, but they will just sleep again
>      on noting that the rcu_state structure's ->expedited_sequence value has
>      not advanced far enough.
>      
>      In practice, this wastes CPU time and is an accident waiting to happen.
>      This commit therefore moves the rcu_exp_gp_seq_end() call that officially
>      ends the expedited grace period (along with associate tracing) until
>      after the ->exp_wake_mutex has been acquired.  This prevents Task A from
>      awakening prematurely, thus preventing more than one expedited grace
>      period from being in flight during a previous expedited grace period's
>      wakeup phase.
>      

I am not sure, if a "fixes" tag is required for it.

>      Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>      [ paulmck: Added updated comment. ]
>      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 4433d00a..8840729 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
>   	struct rcu_node *rnp;
>   
>   	synchronize_sched_expedited_wait();
> -	rcu_exp_gp_seq_end();
> -	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>   
> -	/*
> -	 * Switch over to wakeup mode, allowing the next GP, but -only- the
> -	 * next GP, to proceed.
> -	 */
> +	// Switch over to wakeup mode, allowing the next GP to proceed.
> +	// End the previous grace period only after acquiring the mutex
> +	// to ensure that only one GP runs concurrently with wakeups.
Should comment style be changed to below?

/* Switch over to wakeup mode, allowing the next GP to proceed.
  * End the previous grace period only after acquiring the mutex
  * to ensure that only one GP runs concurrently with wakeups.
  */


>   	mutex_lock(&rcu_state.exp_wake_mutex);
> +	rcu_exp_gp_seq_end();
> +	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>   
>   	rcu_for_each_node_breadth_first(rnp) {
>   		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
> 

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
       [not found]                     ` <eeb7de82-c24b-c914-05fb-683088c9f8e2@codeaurora.org>
@ 2019-11-21  4:10                       ` Paul E. McKenney
  2019-11-21  4:36                         ` Neeraj Upadhyay
       [not found]                         ` <71de1f25-75c5-a4bb-49d9-19af5f8ae4fe@codeaurora.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-21  4:10 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:
> 
> 
> On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
> > > > Hi Paul,
> > > > 
> > > > On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> > > > > On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > > > > > > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > > > > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > > > > > > Hi Paul,
> > > > > > > > > > 
> > > > > > > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > > > > > > in below scenario:
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > > > > > > >         exp_funnel_lock().
> > > > > > > > > > > > 
> > > > > > > > > > > >         _synchronize_rcu_expedited()
> > > > > > > > > > > 
> > > > > > > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > > > > > > plays out in current -rcu.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > > > > > > checked against the -rcu code, and the relevant portions were present
> > > > > > > > > > there.
> > > > > > > > > > 
> > > > > > > > > > > >           s = 0b'100
> > > > > > > > > > > >           exp_funnel_lock()
> > > > > > > > > > > >             wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > > > > > > 
> > > > > > > > > > > All of the above could still happen if the expedited grace
> > > > > > > > > > > period number was zero (or a bit less) when that task invoked
> > > > > > > > > > 
> > > > > > > > > > Yes
> > > > > > > > > > 
> > > > > > > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > > > > > > be different tasks.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > > > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > > > > > > wake up this task, on exp GP completion.
> > > > > > > > > > 
> > > > > > > > > > > Does this task actually block, or is it just getting ready
> > > > > > > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Yes, it actually blocked in wait queue.
> > > > > > > > > > 
> > > > > > > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > > > > > > >         worker and schedules out.
> > > > > > > > > > > 
> > > > > > > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > > > > > > 
> > > > > > > > > > Yes, correct.
> > > > > > > > > > 
> > > > > > > > > > > >         _synchronize_rcu_expedited()
> > > > > > > > > > > >           s = 0b'100
> > > > > > > > > > > >           queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > > > > > > >             wake_up_worker()
> > > > > > > > > > > >               schedule()
> > > > > > > > > > > > 
> > > > > > > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > > > > > > >         sequence.
> > > > > > > > > > > > 
> > > > > > > > > > > >         rcu_exp_wait_wake()
> > > > > > > > > > > >           rcu_exp_wait_wake()
> > > > > > > > > > > >             rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > > > > > > >                                     // to 0b'100'
> > > > > > > > > > > > 
> > > > > > > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > > > > > > >         and releases exp_mutex.
> > > > > > > > > > > > 
> > > > > > > > > > > >         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > > > > > > >           sync_exp_work_done(rsp, s));
> > > > > > > > > > > >         mutex_unlock(&rsp->exp_mutex);
> > > > > > > > > > > 
> > > > > > > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > > > > > > started when .expedited_sequence was zero, right?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Yes, right.
> > > > > > > > > > 
> > > > > > > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > > > > > > 
> > > > > > > > > > > >         rcu_exp_wait_wake()
> > > > > > > > > > > >           rcu_exp_wait_wake()
> > > > > > > > > > > >             rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > > > > > > 
> > > > > > > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > > > > > > >         from wrong wait queue index - it should have worken wait queue
> > > > > > > > > > > >         corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > > > > > > >         0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > > > > > > 
> > > > > > > > > > > >         rcu_exp_wait_wake()
> > > > > > > > > > > >           wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > > > > > 
> > > > > > > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > > > > > > have completed before the completion of the wakeups for the previous
> > > > > > > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > > > > > > 
> > > > > > > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > > > > > > period completed while kworker A was in D state, while waiting for
> > > > > > > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > > > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > > > > > > waiting on wq index 1.
> > > > > > > > > > 
> > > > > > > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > > > > > > 
> > > > > > > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > > > > > > If actually triggered, what did you do to make this happen?
> > > > > > > > > > 
> > > > > > > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > > > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > > > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > > > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > > > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > > > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > > > > > > _synchronize_rcu_expedited().
> > > > > > > > > > 
> > > > > > > > > > schedule+0x70/0x90
> > > > > > > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > > > > > > synchronize_rcu+0x50/0xa0
> > > > > > > > > > scsi_device_quiesce+0x50/0x120
> > > > > > > > > > scsi_bus_suspend+0x70/0xe8
> > > > > > > > > > dpm_run_callback+0x148/0x388
> > > > > > > > > > __device_suspend+0x430/0x8a8
> > > > > > > > > > 
> > > > > > > > > > [1]
> > > > > > > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > > > > > > 
> > > > > > > > > > > What have you done to test the change?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I have given this for testing; will share the results . Current analysis
> > > > > > > > > > and patch is based on going through ftrace and code review.
> > > > > > > > > 
> > > > > > > > > OK, very good.  Please include the failure information in the changelog
> > > > > > > > > of the next version of this patch.
> > > > > > 
> > > > > > Done.
> > > > > > 
> > > > > > > > > 
> > > > > > > > > I prefer your original patch, that just uses "s", over the one below
> > > > > > > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > > > > > > patch is that it allow more concurrency between a consecutive pair of
> > > > > > > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > > > > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > > > > > > is also conceptually simpler with a more manageable state space.
> > > > > > 
> > > > > > The reason for highlighting the alternate approach of doing gp end inside
> > > > > > exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> > > > > > please correct me if I am wrong here:
> > > > > > 
> > > > > > 1. task0 holds exp_wake_mutex, and is preempted.
> > > > > 
> > > > > Presumably after it has awakened the kthread that initiated the prior
> > > > > expedited grace period (the one with seq number = -4).
> > > > > 
> > > > > > 2. task1 initiates new GP (current seq number = 0).
> > > > > 
> > > > > Yes, this can happen.
> > > > > 
> > > > > > 3. task1 queues worker kworker1 and schedules out.
> > > > > 
> > > > > And thus still holds .exp_mutex, but yes.
> > > > > 
> > > > > > 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> > > > > 
> > > > > And thus cannot yet have awakened task1.
> > > > > 
> > > > > > 5. task1 releases exp mutex, w/o entering waitq.
> > > > > 
> > > > > So I do not believe that we can get to #5.  What am I missing here?
> > > > > 
> > > > 
> > > > As mentioned in this patch, task1 could have scheduled out after queuing
> > > > work:
> > > > 
> > > > queue_work(rcu_gp_wq, &rew.rew_work)
> > > >             wake_up_worker()
> > > >               schedule()
> > > > 
> > > > kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
> > > > on exp_wake_mutex.
> > > > 
> > > > task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
> > > > true and it does not enter wait queue and releases exp_mutex.
> > > > 
> > > > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > >           sync_exp_work_done(rsp, s));
> > > 
> > > Well, I have certainly given enough people a hard time about missing the
> > > didn't-actually-sleep case, so good show on finding one in my code!  ;-)
> > > 
> > > Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
> > > The .exp_mutex won't be released until after it happens, and the
> > > next manipulation of the sequence number cannot happen until after
> > > .exp_mutex is next acquired.
> > > 
> > > Good catch!  And keep up the good work!!!
> > 
> > And here is the commit corresponding to your earlier patch.  Please let
> > me know of any needed adjustments.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 3ec440b52831eea172061c5db3d2990b22904863
> > Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> > Date:   Tue Nov 19 11:50:52 2019 -0800
> > 
> >      rcu: Allow only one expedited GP to run concurrently with wakeups
> >      The current expedited RCU grace-period code expects that a task
> >      requesting an expedited grace period cannot awaken until that grace
> >      period has reached the wakeup phase.  However, it is possible for a long
> >      preemption to result in the waiting task never sleeping.  For example,
> >      consider the following sequence of events:
> >      1.      Task A starts an expedited grace period by invoking
> >              synchronize_rcu_expedited().  It proceeds normally up to the
> >              wait_event() near the end of that function, and is then preempted
> >              (or interrupted or whatever).
> >      2.      The expedited grace period completes, and a kworker task starts
> >              the awaken phase, having incremented the counter and acquired
> >              the rcu_state structure's .exp_wake_mutex.  This kworker task
> >              is then preempted or interrupted or whatever.
> >      3.      Task A resumes and enters wait_event(), which notes that the
> >              expedited grace period has completed, and thus doesn't sleep.
> >      4.      Task B starts an expedited grace period exactly as did Task A,
> >              complete with the preemption (or whatever delay) just before
> >              the call to wait_event().
> >      5.      The expedited grace period completes, and another kworker
> >              task starts the awaken phase, having incremented the counter.
> >              However, it blocks when attempting to acquire the rcu_state
> >              structure's .exp_wake_mutex because step 2's kworker task has
> >              not yet released it.
> >      6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
> >              structure's ->exp_wq[] array.
> >      In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
> >      array will just be spuriously awakened, but they will just sleep again
> >      on noting that the rcu_state structure's ->expedited_sequence value has
> >      not advanced far enough.
> >      In practice, this wastes CPU time and is an accident waiting to happen.
> >      This commit therefore moves the rcu_exp_gp_seq_end() call that officially
> >      ends the expedited grace period (along with associate tracing) until
> >      after the ->exp_wake_mutex has been acquired.  This prevents Task A from
> >      awakening prematurely, thus preventing more than one expedited grace
> >      period from being in flight during a previous expedited grace period's
> >      wakeup phase.
> 
> I am not sure, if a "fixes" tag is required for it.

If you have a suggested commit, I would be happy to add it.

> >      Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> >      [ paulmck: Added updated comment. ]
> >      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 4433d00a..8840729 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
> >   	struct rcu_node *rnp;
> >   	synchronize_sched_expedited_wait();
> > -	rcu_exp_gp_seq_end();
> > -	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
> > -	/*
> > -	 * Switch over to wakeup mode, allowing the next GP, but -only- the
> > -	 * next GP, to proceed.
> > -	 */
> > +	// Switch over to wakeup mode, allowing the next GP to proceed.
> > +	// End the previous grace period only after acquiring the mutex
> > +	// to ensure that only one GP runs concurrently with wakeups.
> 
> Should comment style be changed to below?
> 
> /* Switch over to wakeup mode, allowing the next GP to proceed.
>  * End the previous grace period only after acquiring the mutex
>  * to ensure that only one GP runs concurrently with wakeups.
>  */

No.  "//" is acceptable comment format, aside from docbook headers.
The "//" approach saves three characters per line compared to "/* ... */"
single-line comments and a line compared to the style you show above.

So yes, some maintainers prefer the style you show, but not me.

							Thanx, Paul

> >   	mutex_lock(&rcu_state.exp_wake_mutex);
> > +	rcu_exp_gp_seq_end();
> > +	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
> >   	rcu_for_each_node_breadth_first(rnp) {
> >   		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
> > 
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-21  4:10                       ` Paul E. McKenney
@ 2019-11-21  4:36                         ` Neeraj Upadhyay
       [not found]                         ` <71de1f25-75c5-a4bb-49d9-19af5f8ae4fe@codeaurora.org>
  1 sibling, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-21  4:36 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli



On 11/21/2019 9:40 AM, Paul E. McKenney wrote:
> On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:
>>
>>
>> On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
>>> On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
>>>> On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
>>>>> Hi Paul,
>>>>>
>>>>> On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
>>>>>> On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
>>>>>>>> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>>>>>>>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>>>>>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>>>>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>>>>>>>>> there is a chance that they might be indefinitely blocked
>>>>>>>>>>>>> in below scenario:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>>>>>>>>          exp_funnel_lock().
>>>>>>>>>>>>>
>>>>>>>>>>>>>          _synchronize_rcu_expedited()
>>>>>>>>>>>>
>>>>>>>>>>>> This symbol went away a few versions back, but let's see how this
>>>>>>>>>>>> plays out in current -rcu.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>>>>>>>>> checked against the -rcu code, and the relevant portions were present
>>>>>>>>>>> there.
>>>>>>>>>>>
>>>>>>>>>>>>>            s = 0b'100
>>>>>>>>>>>>>            exp_funnel_lock()
>>>>>>>>>>>>>              wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>>>>>>>
>>>>>>>>>>>> All of the above could still happen if the expedited grace
>>>>>>>>>>>> period number was zero (or a bit less) when that task invoked
>>>>>>>>>>>
>>>>>>>>>>> Yes
>>>>>>>>>>>
>>>>>>>>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>>>>>>>>> between this task and "task1" below?  Seems like you want them to
>>>>>>>>>>>> be different tasks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This task is the one which is waiting for the expedited sequence, which
>>>>>>>>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>>>>>>>>> wake up this task, on exp GP completion.
>>>>>>>>>>>
>>>>>>>>>>>> Does this task actually block, or is it just getting ready
>>>>>>>>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, it actually blocked in wait queue.
>>>>>>>>>>>
>>>>>>>>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>>>>>>>>          worker and schedules out.
>>>>>>>>>>>>
>>>>>>>>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>>>>>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>>>>>>>>
>>>>>>>>>>> Yes, correct.
>>>>>>>>>>>
>>>>>>>>>>>>>          _synchronize_rcu_expedited()
>>>>>>>>>>>>>            s = 0b'100
>>>>>>>>>>>>>            queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>>>>>>>              wake_up_worker()
>>>>>>>>>>>>>                schedule()
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>>>>>>>>          sequence.
>>>>>>>>>>>>>
>>>>>>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>>>>>>            rcu_exp_wait_wake()
>>>>>>>>>>>>>              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>>>>>>>>                                      // to 0b'100'
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>>>>>>>>          and releases exp_mutex.
>>>>>>>>>>>>>
>>>>>>>>>>>>>          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>>>>>>>            sync_exp_work_done(rsp, s));
>>>>>>>>>>>>>          mutex_unlock(&rsp->exp_mutex);
>>>>>>>>>>>>
>>>>>>>>>>>> So task1 is the one that initiated the expedited grace period that
>>>>>>>>>>>> started when .expedited_sequence was zero, right?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, right.
>>>>>>>>>>>
>>>>>>>>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>>>>>>>>
>>>>>>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>>>>>>            rcu_exp_wait_wake()
>>>>>>>>>>>>>              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>>>>>>>>
>>>>>>>>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>>>>>>>>          from wrong wait queue index - it should have worken wait queue
>>>>>>>>>>>>>          corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>>>>>>>>          0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>>>>>>>>
>>>>>>>>>>>>>          rcu_exp_wait_wake()
>>>>>>>>>>>>>            wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>>>>>>>
>>>>>>>>>>>> So the issue is that the next expedited RCU grace period might
>>>>>>>>>>>> have completed before the completion of the wakeups for the previous
>>>>>>>>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>>>>>>>>
>>>>>>>>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>>>>>>>>> period completed while kworker A was in D state, while waiting for
>>>>>>>>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>>>>>>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>>>>>>>>> waiting on wq index 1.
>>>>>>>>>>>
>>>>>>>>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>>>>>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>>>>>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>>>>>>>>
>>>>>>>>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>>>>>>>>> If actually triggered, what did you do to make this happen?
>>>>>>>>>>>
>>>>>>>>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>>>>>>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>>>>>>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>>>>>>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>>>>>>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>>>>>>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>>>>>>>>> _synchronize_rcu_expedited().
>>>>>>>>>>>
>>>>>>>>>>> schedule+0x70/0x90
>>>>>>>>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>>>>>>>>> synchronize_rcu+0x50/0xa0
>>>>>>>>>>> scsi_device_quiesce+0x50/0x120
>>>>>>>>>>> scsi_bus_suspend+0x70/0xe8
>>>>>>>>>>> dpm_run_callback+0x148/0x388
>>>>>>>>>>> __device_suspend+0x430/0x8a8
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>>>>>>>>
>>>>>>>>>>>> What have you done to test the change?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have given this for testing; will share the results . Current analysis
>>>>>>>>>>> and patch is based on going through ftrace and code review.
>>>>>>>>>>
>>>>>>>>>> OK, very good.  Please include the failure information in the changelog
>>>>>>>>>> of the next version of this patch.
>>>>>>>
>>>>>>> Done.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I prefer your original patch, that just uses "s", over the one below
>>>>>>>>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>>>>>>>>> patch is that it allow more concurrency between a consecutive pair of
>>>>>>>>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>>>>>>>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>>>>>>>>> is also conceptually simpler with a more manageable state space.
>>>>>>>
>>>>>>> The reason for highlighting the alternate approach of doing gp end inside
>>>>>>> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
>>>>>>> please correct me if I am wrong here:
>>>>>>>
>>>>>>> 1. task0 holds exp_wake_mutex, and is preempted.
>>>>>>
>>>>>> Presumably after it has awakened the kthread that initiated the prior
>>>>>> expedited grace period (the one with seq number = -4).
>>>>>>
>>>>>>> 2. task1 initiates new GP (current seq number = 0).
>>>>>>
>>>>>> Yes, this can happen.
>>>>>>
>>>>>>> 3. task1 queues worker kworker1 and schedules out.
>>>>>>
>>>>>> And thus still holds .exp_mutex, but yes.
>>>>>>
>>>>>>> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
>>>>>>
>>>>>> And thus cannot yet have awakened task1.
>>>>>>
>>>>>>> 5. task1 releases exp mutex, w/o entering waitq.
>>>>>>
>>>>>> So I do not believe that we can get to #5.  What am I missing here?
>>>>>>
>>>>>
>>>>> As mentioned in this patch, task1 could have scheduled out after queuing
>>>>> work:
>>>>>
>>>>> queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>              wake_up_worker()
>>>>>                schedule()
>>>>>
>>>>> kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
>>>>> on exp_wake_mutex.
>>>>>
>>>>> task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
>>>>> true and it does not enter wait queue and releases exp_mutex.
>>>>>
>>>>> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>            sync_exp_work_done(rsp, s));
>>>>
>>>> Well, I have certainly given enough people a hard time about missing the
>>>> didn't-actually-sleep case, so good show on finding one in my code!  ;-)
>>>>
>>>> Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
>>>> The .exp_mutex won't be released until after it happens, and the
>>>> next manipulation of the sequence number cannot happen until after
>>>> .exp_mutex is next acquired.
>>>>
>>>> Good catch!  And keep up the good work!!!
>>>
>>> And here is the commit corresponding to your earlier patch.  Please let
>>> me know of any needed adjustments.
>>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> commit 3ec440b52831eea172061c5db3d2990b22904863
>>> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
>>> Date:   Tue Nov 19 11:50:52 2019 -0800
>>>
>>>       rcu: Allow only one expedited GP to run concurrently with wakeups
>>>       The current expedited RCU grace-period code expects that a task
>>>       requesting an expedited grace period cannot awaken until that grace
>>>       period has reached the wakeup phase.  However, it is possible for a long
>>>       preemption to result in the waiting task never sleeping.  For example,
>>>       consider the following sequence of events:
>>>       1.      Task A starts an expedited grace period by invoking
>>>               synchronize_rcu_expedited().  It proceeds normally up to the
>>>               wait_event() near the end of that function, and is then preempted
>>>               (or interrupted or whatever).
>>>       2.      The expedited grace period completes, and a kworker task starts
>>>               the awaken phase, having incremented the counter and acquired
>>>               the rcu_state structure's .exp_wake_mutex.  This kworker task
>>>               is then preempted or interrupted or whatever.
>>>       3.      Task A resumes and enters wait_event(), which notes that the
>>>               expedited grace period has completed, and thus doesn't sleep.
>>>       4.      Task B starts an expedited grace period exactly as did Task A,
>>>               complete with the preemption (or whatever delay) just before
>>>               the call to wait_event().
>>>       5.      The expedited grace period completes, and another kworker
>>>               task starts the awaken phase, having incremented the counter.
>>>               However, it blocks when attempting to acquire the rcu_state
>>>               structure's .exp_wake_mutex because step 2's kworker task has
>>>               not yet released it.
>>>       6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
>>>               structure's ->exp_wq[] array.
>>>       In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
>>>       array will just be spuriously awakened, but they will just sleep again
>>>       on noting that the rcu_state structure's ->expedited_sequence value has
>>>       not advanced far enough.
>>>       In practice, this wastes CPU time and is an accident waiting to happen.
>>>       This commit therefore moves the rcu_exp_gp_seq_end() call that officially
>>>       ends the expedited grace period (along with associate tracing) until
>>>       after the ->exp_wake_mutex has been acquired.  This prevents Task A from
>>>       awakening prematurely, thus preventing more than one expedited grace
>>>       period from being in flight during a previous expedited grace period's
>>>       wakeup phase.
>>
>> I am not sure, if a "fixes" tag is required for it.
> 
> If you have a suggested commit, I would be happy to add it.
> 

I think either or below 2 - first one is on the tree_exp.h file, second
one looks to be the original commit.


1. 
https://github.com/torvalds/linux/commit/3549c2bc2c4ea8ecfeb9d21cb81cb00c6002b011 


2. 
https://github.com/torvalds/linux/commit/3b5f668e715bc19610ad967ef97a7e8c55a186ec


Thanks
Neeraj

>>>       Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>       [ paulmck: Added updated comment. ]
>>>       Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>>
>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>> index 4433d00a..8840729 100644
>>> --- a/kernel/rcu/tree_exp.h
>>> +++ b/kernel/rcu/tree_exp.h
>>> @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>    	struct rcu_node *rnp;
>>>    	synchronize_sched_expedited_wait();
>>> -	rcu_exp_gp_seq_end();
>>> -	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>>> -	/*
>>> -	 * Switch over to wakeup mode, allowing the next GP, but -only- the
>>> -	 * next GP, to proceed.
>>> -	 */
>>> +	// Switch over to wakeup mode, allowing the next GP to proceed.
>>> +	// End the previous grace period only after acquiring the mutex
>>> +	// to ensure that only one GP runs concurrently with wakeups.
>>
>> Should comment style be changed to below?
>>
>> /* Switch over to wakeup mode, allowing the next GP to proceed.
>>   * End the previous grace period only after acquiring the mutex
>>   * to ensure that only one GP runs concurrently with wakeups.
>>   */
> 
> No.  "//" is acceptable comment format, aside from docbook headers.
> The "//" approach saves three characters per line compared to "/* ... */"
> single-line comments and a line compared to the style you show above.
> 
> So yes, some maintainers prefer the style you show, but not me.
> 
> 							Thanx, Paul
> 
>>>    	mutex_lock(&rcu_state.exp_wake_mutex);
>>> +	rcu_exp_gp_seq_end();
>>> +	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>>>    	rcu_for_each_node_breadth_first(rnp) {
>>>    		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
>>>
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
       [not found]                         ` <71de1f25-75c5-a4bb-49d9-19af5f8ae4fe@codeaurora.org>
@ 2019-11-21 18:55                           ` Paul E. McKenney
  2019-11-22  3:44                             ` Neeraj Upadhyay
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2019-11-21 18:55 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli

On Thu, Nov 21, 2019 at 10:06:40AM +0530, Neeraj Upadhyay wrote:
> 
> 
> On 11/21/2019 9:40 AM, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:
> > > 
> > > 
> > > On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
> > > > On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
> > > > > > > On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > > > > > > > > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > > > > > > > > Hi Paul,
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > > > > > > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > > > > > > > > Hi Paul,
> > > > > > > > > > > > 
> > > > > > > > > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > > > > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > > > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > > > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > > > > > > > > in below scenario:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > > > > > > > > >          exp_funnel_lock().
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          _synchronize_rcu_expedited()
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > > > > > > > > plays out in current -rcu.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > > > > > > > > checked against the -rcu code, and the relevant portions were present
> > > > > > > > > > > > there.
> > > > > > > > > > > > 
> > > > > > > > > > > > > >            s = 0b'100
> > > > > > > > > > > > > >            exp_funnel_lock()
> > > > > > > > > > > > > >              wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > All of the above could still happen if the expedited grace
> > > > > > > > > > > > > period number was zero (or a bit less) when that task invoked
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes
> > > > > > > > > > > > 
> > > > > > > > > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > > > > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > > > > > > > > be different tasks.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > > > > > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > > > > > > > > wake up this task, on exp GP completion.
> > > > > > > > > > > > 
> > > > > > > > > > > > > Does this task actually block, or is it just getting ready
> > > > > > > > > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, it actually blocked in wait queue.
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > > > > > > > > >          worker and schedules out.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > > > > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > > > > > > > > 
> > > > > > > > > > > > Yes, correct.
> > > > > > > > > > > > 
> > > > > > > > > > > > > >          _synchronize_rcu_expedited()
> > > > > > > > > > > > > >            s = 0b'100
> > > > > > > > > > > > > >            queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > > > > > > > > >              wake_up_worker()
> > > > > > > > > > > > > >                schedule()
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > > > > > > > > >          sequence.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > > > > > >            rcu_exp_wait_wake()
> > > > > > > > > > > > > >              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > > > > > > > > >                                      // to 0b'100'
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > > > > > > > > >          and releases exp_mutex.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > > > > > > > > >            sync_exp_work_done(rsp, s));
> > > > > > > > > > > > > >          mutex_unlock(&rsp->exp_mutex);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > > > > > > > > started when .expedited_sequence was zero, right?
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, right.
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > > > > > >            rcu_exp_wait_wake()
> > > > > > > > > > > > > >              rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > > > > > > > > >          from wrong wait queue index - it should have worken wait queue
> > > > > > > > > > > > > >          corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > > > > > > > > >          0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >          rcu_exp_wait_wake()
> > > > > > > > > > > > > >            wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > > > > > > > > have completed before the completion of the wakeups for the previous
> > > > > > > > > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > > > > > > > > period completed while kworker A was in D state, while waiting for
> > > > > > > > > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > > > > > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > > > > > > > > waiting on wq index 1.
> > > > > > > > > > > > 
> > > > > > > > > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > > > > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > > > > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > > > > > > > > If actually triggered, what did you do to make this happen?
> > > > > > > > > > > > 
> > > > > > > > > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > > > > > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > > > > > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > > > > > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > > > > > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > > > > > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > > > > > > > > _synchronize_rcu_expedited().
> > > > > > > > > > > > 
> > > > > > > > > > > > schedule+0x70/0x90
> > > > > > > > > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > > > > > > > > synchronize_rcu+0x50/0xa0
> > > > > > > > > > > > scsi_device_quiesce+0x50/0x120
> > > > > > > > > > > > scsi_bus_suspend+0x70/0xe8
> > > > > > > > > > > > dpm_run_callback+0x148/0x388
> > > > > > > > > > > > __device_suspend+0x430/0x8a8
> > > > > > > > > > > > 
> > > > > > > > > > > > [1]
> > > > > > > > > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > > > > > > > > 
> > > > > > > > > > > > > What have you done to test the change?
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > I have given this for testing; will share the results . Current analysis
> > > > > > > > > > > > and patch is based on going through ftrace and code review.
> > > > > > > > > > > 
> > > > > > > > > > > OK, very good.  Please include the failure information in the changelog
> > > > > > > > > > > of the next version of this patch.
> > > > > > > > 
> > > > > > > > Done.
> > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I prefer your original patch, that just uses "s", over the one below
> > > > > > > > > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > > > > > > > > patch is that it allow more concurrency between a consecutive pair of
> > > > > > > > > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > > > > > > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > > > > > > > > is also conceptually simpler with a more manageable state space.
> > > > > > > > 
> > > > > > > > The reason for highlighting the alternate approach of doing gp end inside
> > > > > > > > exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> > > > > > > > please correct me if I am wrong here:
> > > > > > > > 
> > > > > > > > 1. task0 holds exp_wake_mutex, and is preempted.
> > > > > > > 
> > > > > > > Presumably after it has awakened the kthread that initiated the prior
> > > > > > > expedited grace period (the one with seq number = -4).
> > > > > > > 
> > > > > > > > 2. task1 initiates new GP (current seq number = 0).
> > > > > > > 
> > > > > > > Yes, this can happen.
> > > > > > > 
> > > > > > > > 3. task1 queues worker kworker1 and schedules out.
> > > > > > > 
> > > > > > > And thus still holds .exp_mutex, but yes.
> > > > > > > 
> > > > > > > > 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
> > > > > > > 
> > > > > > > And thus cannot yet have awakened task1.
> > > > > > > 
> > > > > > > > 5. task1 releases exp mutex, w/o entering waitq.
> > > > > > > 
> > > > > > > So I do not believe that we can get to #5.  What am I missing here?
> > > > > > > 
> > > > > > 
> > > > > > As mentioned in this patch, task1 could have scheduled out after queuing
> > > > > > work:
> > > > > > 
> > > > > > queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > >              wake_up_worker()
> > > > > >                schedule()
> > > > > > 
> > > > > > kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
> > > > > > on exp_wake_mutex.
> > > > > > 
> > > > > > task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
> > > > > > true and it does not enter wait queue and releases exp_mutex.
> > > > > > 
> > > > > > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > >            sync_exp_work_done(rsp, s));
> > > > > 
> > > > > Well, I have certainly given enough people a hard time about missing the
> > > > > didn't-actually-sleep case, so good show on finding one in my code!  ;-)
> > > > > 
> > > > > Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
> > > > > The .exp_mutex won't be released until after it happens, and the
> > > > > next manipulation of the sequence number cannot happen until after
> > > > > .exp_mutex is next acquired.
> > > > > 
> > > > > Good catch!  And keep up the good work!!!
> > > > 
> > > > And here is the commit corresponding to your earlier patch.  Please let
> > > > me know of any needed adjustments.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 3ec440b52831eea172061c5db3d2990b22904863
> > > > Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > Date:   Tue Nov 19 11:50:52 2019 -0800
> > > > 
> > > >       rcu: Allow only one expedited GP to run concurrently with wakeups
> > > >       The current expedited RCU grace-period code expects that a task
> > > >       requesting an expedited grace period cannot awaken until that grace
> > > >       period has reached the wakeup phase.  However, it is possible for a long
> > > >       preemption to result in the waiting task never sleeping.  For example,
> > > >       consider the following sequence of events:
> > > >       1.      Task A starts an expedited grace period by invoking
> > > >               synchronize_rcu_expedited().  It proceeds normally up to the
> > > >               wait_event() near the end of that function, and is then preempted
> > > >               (or interrupted or whatever).
> > > >       2.      The expedited grace period completes, and a kworker task starts
> > > >               the awaken phase, having incremented the counter and acquired
> > > >               the rcu_state structure's .exp_wake_mutex.  This kworker task
> > > >               is then preempted or interrupted or whatever.
> > > >       3.      Task A resumes and enters wait_event(), which notes that the
> > > >               expedited grace period has completed, and thus doesn't sleep.
> > > >       4.      Task B starts an expedited grace period exactly as did Task A,
> > > >               complete with the preemption (or whatever delay) just before
> > > >               the call to wait_event().
> > > >       5.      The expedited grace period completes, and another kworker
> > > >               task starts the awaken phase, having incremented the counter.
> > > >               However, it blocks when attempting to acquire the rcu_state
> > > >               structure's .exp_wake_mutex because step 2's kworker task has
> > > >               not yet released it.
> > > >       6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
> > > >               structure's ->exp_wq[] array.
> > > >       In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
> > > >       array will just be spuriously awakened, but they will just sleep again
> > > >       on noting that the rcu_state structure's ->expedited_sequence value has
> > > >       not advanced far enough.
> > > >       In practice, this wastes CPU time and is an accident waiting to happen.
> > > >       This commit therefore moves the rcu_exp_gp_seq_end() call that officially
> > > >       ends the expedited grace period (along with associate tracing) until
> > > >       after the ->exp_wake_mutex has been acquired.  This prevents Task A from
> > > >       awakening prematurely, thus preventing more than one expedited grace
> > > >       period from being in flight during a previous expedited grace period's
> > > >       wakeup phase.
> > > 
> > > I am not sure, if a "fixes" tag is required for it.
> > 
> > If you have a suggested commit, I would be happy to add it.
> > 
> 
> I think either or below 2 - first one is on the tree_exp.h file, second
> one looks to be the original commit.
> 
> 1. https://github.com/torvalds/linux/commit/3549c2bc2c4ea8ecfeb9d21cb81cb00c6002b011
> 
> 2. https://github.com/torvalds/linux/commit/3b5f668e715bc19610ad967ef97a7e8c55a186ec

Agreed, this second commit is the one that introduced the bug.  I placed
"Fixes:" tags on both of your commits for this one.  And thank you for
digging them both up!

							Thanx, Paul

> Thanks
> Neeraj
> 
> > > >       Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > >       [ paulmck: Added updated comment. ]
> > > >       Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > 
> > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > index 4433d00a..8840729 100644
> > > > --- a/kernel/rcu/tree_exp.h
> > > > +++ b/kernel/rcu/tree_exp.h
> > > > @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > >    	struct rcu_node *rnp;
> > > >    	synchronize_sched_expedited_wait();
> > > > -	rcu_exp_gp_seq_end();
> > > > -	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
> > > > -	/*
> > > > -	 * Switch over to wakeup mode, allowing the next GP, but -only- the
> > > > -	 * next GP, to proceed.
> > > > -	 */
> > > > +	// Switch over to wakeup mode, allowing the next GP to proceed.
> > > > +	// End the previous grace period only after acquiring the mutex
> > > > +	// to ensure that only one GP runs concurrently with wakeups.
> > > 
> > > Should comment style be changed to below?
> > > 
> > > /* Switch over to wakeup mode, allowing the next GP to proceed.
> > >   * End the previous grace period only after acquiring the mutex
> > >   * to ensure that only one GP runs concurrently with wakeups.
> > >   */
> > 
> > No.  "//" is acceptable comment format, aside from docbook headers.
> > The "//" approach saves three characters per line compared to "/* ... */"
> > single-line comments and a line compared to the style you show above.
> > 
> > So yes, some maintainers prefer the style you show, but not me.
> > 
> > 							Thanx, Paul
> > 
> > > >    	mutex_lock(&rcu_state.exp_wake_mutex);
> > > > +	rcu_exp_gp_seq_end();
> > > > +	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
> > > >    	rcu_for_each_node_breadth_first(rnp) {
> > > >    		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
> > > > 
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-21 18:55                           ` Paul E. McKenney
@ 2019-11-22  3:44                             ` Neeraj Upadhyay
  0 siblings, 0 replies; 16+ messages in thread
From: Neeraj Upadhyay @ 2019-11-22  3:44 UTC (permalink / raw)
  To: paulmck
  Cc: josh, joel, rostedt, mathieu.desnoyers, jiangshanlai,
	linux-kernel, rcu, pkondeti, prsood, gkohli



On 11/22/2019 12:25 AM, Paul E. McKenney wrote:
> On Thu, Nov 21, 2019 at 10:06:40AM +0530, Neeraj Upadhyay wrote:
>>
>>
>> On 11/21/2019 9:40 AM, Paul E. McKenney wrote:
>>> On Wed, Nov 20, 2019 at 10:38:56AM +0530, Neeraj Upadhyay wrote:
>>>>
>>>>
>>>> On 11/20/2019 1:36 AM, Paul E. McKenney wrote:
>>>>> On Tue, Nov 19, 2019 at 07:09:31AM -0800, Paul E. McKenney wrote:
>>>>>> On Tue, Nov 19, 2019 at 07:03:14AM +0000, Neeraj Upadhyay wrote:
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
>>>>>>>> On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
>>>>>>>>>> On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
>>>>>>>>>>>> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>>>>>>>>>>>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>>>>>>>>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>>>>>>>>>>> there is a chance that they might be indefinitely blocked
>>>>>>>>>>>>>>> in below scenario:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>>>>>>>>>>           exp_funnel_lock().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           _synchronize_rcu_expedited()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This symbol went away a few versions back, but let's see how this
>>>>>>>>>>>>>> plays out in current -rcu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry; for us this problem is observed on 4.19 stable version; I had
>>>>>>>>>>>>> checked against the -rcu code, and the relevant portions were present
>>>>>>>>>>>>> there.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             s = 0b'100
>>>>>>>>>>>>>>>             exp_funnel_lock()
>>>>>>>>>>>>>>>               wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All of the above could still happen if the expedited grace
>>>>>>>>>>>>>> period number was zero (or a bit less) when that task invoked
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes
>>>>>>>>>>>>>
>>>>>>>>>>>>>> synchronize_rcu_expedited().  What is the relation, if any,
>>>>>>>>>>>>>> between this task and "task1" below?  Seems like you want them to
>>>>>>>>>>>>>> be different tasks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This task is the one which is waiting for the expedited sequence, which
>>>>>>>>>>>>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>>>>>>>>>>>>> wake up this task, on exp GP completion.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does this task actually block, or is it just getting ready
>>>>>>>>>>>>>> to block?  Seems like you need it to have actually blocked.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, it actually blocked in wait queue.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>>>>>>>>>>>>>           worker and schedules out.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>>>>>>>>>>>>> counter was zero, correct?  (Looks that way below.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, correct.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           _synchronize_rcu_expedited()
>>>>>>>>>>>>>>>             s = 0b'100
>>>>>>>>>>>>>>>             queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>>>>>>>>>               wake_up_worker()
>>>>>>>>>>>>>>>                 schedule()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>>>>>>>>>>           sequence.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           rcu_exp_wait_wake()
>>>>>>>>>>>>>>>             rcu_exp_wait_wake()
>>>>>>>>>>>>>>>               rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>>>>>>>>>>                                       // to 0b'100'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>>>>>>>>>>           and releases exp_mutex.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>>>>>>>>>             sync_exp_work_done(rsp, s));
>>>>>>>>>>>>>>>           mutex_unlock(&rsp->exp_mutex);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So task1 is the one that initiated the expedited grace period that
>>>>>>>>>>>>>> started when .expedited_sequence was zero, right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, right.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           rcu_exp_wait_wake()
>>>>>>>>>>>>>>>             rcu_exp_wait_wake()
>>>>>>>>>>>>>>>               rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>>>>>>>>>>>>>           from wrong wait queue index - it should have worken wait queue
>>>>>>>>>>>>>>>           corresponding to 0b'100' sequence, but wakes up the ones for
>>>>>>>>>>>>>>>           0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           rcu_exp_wait_wake()
>>>>>>>>>>>>>>>             wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So the issue is that the next expedited RCU grace period might
>>>>>>>>>>>>>> have completed before the completion of the wakeups for the previous
>>>>>>>>>>>>>> expedited RCU grace period, correct?  Then expedited grace periods have
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>>>>>>>>>>>>> period completed while kworker A was in D state, while waiting for
>>>>>>>>>>>>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>>>>>>>>>>>>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>>>>>>>>>>>>> waiting on wq index 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> to have stopped to prevent any future wakeup from happening, correct?
>>>>>>>>>>>>>> (Which would make it harder for rcutorture to trigger this, though it
>>>>>>>>>>>>>> really does have code that attempts to trigger this sort of thing.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is this theoretical in nature, or have you actually triggered it?
>>>>>>>>>>>>>> If actually triggered, what did you do to make this happen?
>>>>>>>>>>>>>
>>>>>>>>>>>>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>>>>>>>>>>>>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>>>>>>>>>>>>> issues. Both instances were in downstream drivers and we didn't have RCU
>>>>>>>>>>>>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>>>>>>>>>>>>> enabled, where it was observed in suspend scenario, where we are observing
>>>>>>>>>>>>> "DPM device timeout" [1], as scsi device is stuck in
>>>>>>>>>>>>> _synchronize_rcu_expedited().
>>>>>>>>>>>>>
>>>>>>>>>>>>> schedule+0x70/0x90
>>>>>>>>>>>>> _synchronize_rcu_expedited+0x590/0x5f8
>>>>>>>>>>>>> synchronize_rcu+0x50/0xa0
>>>>>>>>>>>>> scsi_device_quiesce+0x50/0x120
>>>>>>>>>>>>> scsi_bus_suspend+0x70/0xe8
>>>>>>>>>>>>> dpm_run_callback+0x148/0x388
>>>>>>>>>>>>> __device_suspend+0x430/0x8a8
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>>>>>>>>>>>>
>>>>>>>>>>>>>> What have you done to test the change?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have given this for testing; will share the results . Current analysis
>>>>>>>>>>>>> and patch is based on going through ftrace and code review.
>>>>>>>>>>>>
>>>>>>>>>>>> OK, very good.  Please include the failure information in the changelog
>>>>>>>>>>>> of the next version of this patch.
>>>>>>>>>
>>>>>>>>> Done.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I prefer your original patch, that just uses "s", over the one below
>>>>>>>>>>>> that moves the rcu_exp_gp_seq_end().  The big advantage of your original
>>>>>>>>>>>> patch is that it allow more concurrency between a consecutive pair of
>>>>>>>>>>>> expedited RCU grace periods.  Plus it would not be easy to convince
>>>>>>>>>>>> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
>>>>>>>>>>>> is also conceptually simpler with a more manageable state space.
>>>>>>>>>
>>>>>>>>> The reason for highlighting the alternate approach of doing gp end inside
>>>>>>>>> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
>>>>>>>>> please correct me if I am wrong here:
>>>>>>>>>
>>>>>>>>> 1. task0 holds exp_wake_mutex, and is preempted.
>>>>>>>>
>>>>>>>> Presumably after it has awakened the kthread that initiated the prior
>>>>>>>> expedited grace period (the one with seq number = -4).
>>>>>>>>
>>>>>>>>> 2. task1 initiates new GP (current seq number = 0).
>>>>>>>>
>>>>>>>> Yes, this can happen.
>>>>>>>>
>>>>>>>>> 3. task1 queues worker kworker1 and schedules out.
>>>>>>>>
>>>>>>>> And thus still holds .exp_mutex, but yes.
>>>>>>>>
>>>>>>>>> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex
>>>>>>>>
>>>>>>>> And thus cannot yet have awakened task1.
>>>>>>>>
>>>>>>>>> 5. task1 releases exp mutex, w/o entering waitq.
>>>>>>>>
>>>>>>>> So I do not believe that we can get to #5.  What am I missing here?
>>>>>>>>
>>>>>>>
>>>>>>> As mentioned in this patch, task1 could have scheduled out after queuing
>>>>>>> work:
>>>>>>>
>>>>>>> queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>>               wake_up_worker()
>>>>>>>                 schedule()
>>>>>>>
>>>>>>> kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits
>>>>>>> on exp_wake_mutex.
>>>>>>>
>>>>>>> task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return
>>>>>>> true and it does not enter wait queue and releases exp_mutex.
>>>>>>>
>>>>>>> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>>             sync_exp_work_done(rsp, s));
>>>>>>
>>>>>> Well, I have certainly given enough people a hard time about missing the
>>>>>> didn't-actually-sleep case, so good show on finding one in my code!  ;-)
>>>>>>
>>>>>> Which also explains why deferring the rcu_exp_gp_seq_end() is safe:
>>>>>> The .exp_mutex won't be released until after it happens, and the
>>>>>> next manipulation of the sequence number cannot happen until after
>>>>>> .exp_mutex is next acquired.
>>>>>>
>>>>>> Good catch!  And keep up the good work!!!
>>>>>
>>>>> And here is the commit corresponding to your earlier patch.  Please let
>>>>> me know of any needed adjustments.
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> commit 3ec440b52831eea172061c5db3d2990b22904863
>>>>> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>> Date:   Tue Nov 19 11:50:52 2019 -0800
>>>>>
>>>>>        rcu: Allow only one expedited GP to run concurrently with wakeups
>>>>>        The current expedited RCU grace-period code expects that a task
>>>>>        requesting an expedited grace period cannot awaken until that grace
>>>>>        period has reached the wakeup phase.  However, it is possible for a long
>>>>>        preemption to result in the waiting task never sleeping.  For example,
>>>>>        consider the following sequence of events:
>>>>>        1.      Task A starts an expedited grace period by invoking
>>>>>                synchronize_rcu_expedited().  It proceeds normally up to the
>>>>>                wait_event() near the end of that function, and is then preempted
>>>>>                (or interrupted or whatever).
>>>>>        2.      The expedited grace period completes, and a kworker task starts
>>>>>                the awaken phase, having incremented the counter and acquired
>>>>>                the rcu_state structure's .exp_wake_mutex.  This kworker task
>>>>>                is then preempted or interrupted or whatever.
>>>>>        3.      Task A resumes and enters wait_event(), which notes that the
>>>>>                expedited grace period has completed, and thus doesn't sleep.
>>>>>        4.      Task B starts an expedited grace period exactly as did Task A,
>>>>>                complete with the preemption (or whatever delay) just before
>>>>>                the call to wait_event().
>>>>>        5.      The expedited grace period completes, and another kworker
>>>>>                task starts the awaken phase, having incremented the counter.
>>>>>                However, it blocks when attempting to acquire the rcu_state
>>>>>                structure's .exp_wake_mutex because step 2's kworker task has
>>>>>                not yet released it.
>>>>>        6.      Steps 4 and 5 repeat, resulting in overflow of the rcu_node
>>>>>                structure's ->exp_wq[] array.
>>>>>        In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
>>>>>        array will just be spuriously awakened, but they will just sleep again
>>>>>        on noting that the rcu_state structure's ->expedited_sequence value has
>>>>>        not advanced far enough.
>>>>>        In practice, this wastes CPU time and is an accident waiting to happen.
>>>>>        This commit therefore moves the rcu_exp_gp_seq_end() call that officially
>>>>>        ends the expedited grace period (along with associate tracing) until
>>>>>        after the ->exp_wake_mutex has been acquired.  This prevents Task A from
>>>>>        awakening prematurely, thus preventing more than one expedited grace
>>>>>        period from being in flight during a previous expedited grace period's
>>>>>        wakeup phase.
>>>>
>>>> I am not sure, if a "fixes" tag is required for it.
>>>
>>> If you have a suggested commit, I would be happy to add it.
>>>
>>
>> I think either or below 2 - first one is on the tree_exp.h file, second
>> one looks to be the original commit.
>>
>> 1. https://github.com/torvalds/linux/commit/3549c2bc2c4ea8ecfeb9d21cb81cb00c6002b011
>>
>> 2. https://github.com/torvalds/linux/commit/3b5f668e715bc19610ad967ef97a7e8c55a186ec
> 
> Agreed, this second commit is the one that introduced the bug.  I placed
> "Fixes:" tags on both of your commits for this one.  And thank you for
> digging them both up!
> 
> 							Thanx, Paul
> 

No problem.

>> Thanks
>> Neeraj
>>
>>>>>        Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>        [ paulmck: Added updated comment. ]
>>>>>        Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>>>>
>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>> index 4433d00a..8840729 100644
>>>>> --- a/kernel/rcu/tree_exp.h
>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>> @@ -539,14 +539,13 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>>     	struct rcu_node *rnp;
>>>>>     	synchronize_sched_expedited_wait();
>>>>> -	rcu_exp_gp_seq_end();
>>>>> -	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>>>>> -	/*
>>>>> -	 * Switch over to wakeup mode, allowing the next GP, but -only- the
>>>>> -	 * next GP, to proceed.
>>>>> -	 */
>>>>> +	// Switch over to wakeup mode, allowing the next GP to proceed.
>>>>> +	// End the previous grace period only after acquiring the mutex
>>>>> +	// to ensure that only one GP runs concurrently with wakeups.
>>>>
>>>> Should comment style be changed to below?
>>>>
>>>> /* Switch over to wakeup mode, allowing the next GP to proceed.
>>>>    * End the previous grace period only after acquiring the mutex
>>>>    * to ensure that only one GP runs concurrently with wakeups.
>>>>    */
>>>
>>> No.  "//" is acceptable comment format, aside from docbook headers.
>>> The "//" approach saves three characters per line compared to "/* ... */"
>>> single-line comments and a line compared to the style you show above.
>>>
>>> So yes, some maintainers prefer the style you show, but not me.
>>>
>>> 							Thanx, Paul
>>>

Got it.


Thanks
Neeraj

>>>>>     	mutex_lock(&rcu_state.exp_wake_mutex);
>>>>> +	rcu_exp_gp_seq_end();
>>>>> +	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("end"));
>>>>>     	rcu_for_each_node_breadth_first(rnp) {
>>>>>     		if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
>>>>>
>>>>
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-11-22  3:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15 17:28 [PATCH] rcu: Fix missed wakeup of exp_wq waiters Neeraj Upadhyay
2019-11-17 21:36 ` Paul E. McKenney
2019-11-18  9:28   ` Neeraj Upadhyay
     [not found]   ` <0101016e7dd7b824-50600058-ab5e-44d8-8d24-94cf095f1783-000000@us-west-2.amazonses.com>
2019-11-18 15:08     ` Paul E. McKenney
2019-11-18 16:41       ` Neeraj Upadhyay
     [not found]       ` <0101016e7f644106-90b40f19-ba9e-4974-bdbe-1062b52222a2-000000@us-west-2.amazonses.com>
2019-11-18 17:24         ` Paul E. McKenney
2019-11-19  3:35           ` Neeraj Upadhyay
     [not found]           ` <0101016e81ba8865-028ac62b-3fcd-481b-b657-be8519c4edf5-000000@us-west-2.amazonses.com>
2019-11-19  4:05             ` Paul E. McKenney
2019-11-19  7:03               ` Neeraj Upadhyay
2019-11-19 15:09                 ` Paul E. McKenney
2019-11-19 20:06                   ` Paul E. McKenney
2019-11-20  5:09                     ` Neeraj Upadhyay
     [not found]                     ` <eeb7de82-c24b-c914-05fb-683088c9f8e2@codeaurora.org>
2019-11-21  4:10                       ` Paul E. McKenney
2019-11-21  4:36                         ` Neeraj Upadhyay
     [not found]                         ` <71de1f25-75c5-a4bb-49d9-19af5f8ae4fe@codeaurora.org>
2019-11-21 18:55                           ` Paul E. McKenney
2019-11-22  3:44                             ` Neeraj Upadhyay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).