RCU Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
@ 2019-11-19  3:17 Neeraj Upadhyay
  2019-11-19 19:38 ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Neeraj Upadhyay @ 2019-11-19  3:17 UTC (permalink / raw)
  To: paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel
  Cc: linux-kernel, gkohli, prsood, pkondeti, rcu, Neeraj Upadhyay

For the tasks waiting in exp_wq inside exp_funnel_lock(),
there is a chance that they might be indefinitely blocked
in below scenario:

1. There is a task waiting on exp sequence 0b'100' inside
   exp_funnel_lock(). This task blocks at wq index 1.

   synchronize_rcu_expedited()
     s = 0b'100'
     exp_funnel_lock()
       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

2. The expedited grace period (which above task blocks for)
   completes and task (task1) holding exp_mutex queues
   worker and schedules out.

   synchronize_rcu_expedited()
     s = 0b'100'
     queue_work(rcu_gp_wq, &rew.rew_work)
       wake_up_worker()
         schedule()

3. kworker A picks up the queued work and completes the exp gp
   sequence and then blocks on exp_wake_mutex, which is held
   by another kworker, which is doing wakeups for expedited_sequence
   0.

   rcu_exp_wait_wake()
     rcu_exp_wait_wake()
       rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
                               // to 0b'100'
       mutex_lock(&rcu_state.exp_wake_mutex)

4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
   and releases exp_mutex.

   wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
     sync_exp_work_done(rsp, s));
   mutex_unlock(&rsp->exp_mutex);

5. Next exp GP completes, and sequence number is incremented:

   rcu_exp_wait_wake()
     rcu_exp_wait_wake()
       rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'

6. kworker A acquires exp_wake_mutex. As it uses current
   expedited_sequence, it wakes up workers from wrong wait queue
   index - it should have worken wait queue corresponding to
   0b'100' sequence, but wakes up the ones for 0b'200' sequence.
   This results in task at step 1 indefinitely blocked.

   rcu_exp_wait_wake()
     wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

This issue manifested as DPM device timeout during suspend, as scsi
device was stuck in _synchronize_rcu_expedited().

schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()

Fix this by using the correct exp sequence number, the one which
owner of the exp_mutex initiated and passed to kworker,
to index the wait queue, inside rcu_exp_wait_wake().

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
---

Changes since v1:
 - Updates the commit log with failure information.

 kernel/rcu/tree_exp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d3..28979d3 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
 			spin_unlock(&rnp->exp_lock);
 		}
 		smp_mb(); /* All above changes before wakeup. */
-		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
+		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
 	}
 	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
 	mutex_unlock(&rcu_state.exp_wake_mutex);
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19  3:17 [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters Neeraj Upadhyay
@ 2019-11-19 19:38 ` Paul E. McKenney
  2019-11-20  4:58   ` Neeraj Upadhyay
       [not found]   ` <25499aba-04a1-8d03-e2d9-fc89d7794b66@codeaurora.org>
  0 siblings, 2 replies; 7+ messages in thread
From: Paul E. McKenney @ 2019-11-19 19:38 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu

On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
> For the tasks waiting in exp_wq inside exp_funnel_lock(),
> there is a chance that they might be indefinitely blocked
> in below scenario:
> 
> 1. There is a task waiting on exp sequence 0b'100' inside
>    exp_funnel_lock(). This task blocks at wq index 1.
> 
>    synchronize_rcu_expedited()
>      s = 0b'100'
>      exp_funnel_lock()
>        wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> 
> 2. The expedited grace period (which above task blocks for)
>    completes and task (task1) holding exp_mutex queues
>    worker and schedules out.
> 
>    synchronize_rcu_expedited()
>      s = 0b'100'
>      queue_work(rcu_gp_wq, &rew.rew_work)
>        wake_up_worker()
>          schedule()
> 
> 3. kworker A picks up the queued work and completes the exp gp
>    sequence and then blocks on exp_wake_mutex, which is held
>    by another kworker, which is doing wakeups for expedited_sequence
>    0.
> 
>    rcu_exp_wait_wake()
>      rcu_exp_wait_wake()
>        rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>                                // to 0b'100'
>        mutex_lock(&rcu_state.exp_wake_mutex)
> 
> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>    and releases exp_mutex.
> 
>    wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>      sync_exp_work_done(rsp, s));
>    mutex_unlock(&rsp->exp_mutex);
> 
> 5. Next exp GP completes, and sequence number is incremented:
> 
>    rcu_exp_wait_wake()
>      rcu_exp_wait_wake()
>        rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> 
> 6. kworker A acquires exp_wake_mutex. As it uses current
>    expedited_sequence, it wakes up workers from wrong wait queue
>    index - it should have worken wait queue corresponding to
>    0b'100' sequence, but wakes up the ones for 0b'200' sequence.
>    This results in task at step 1 indefinitely blocked.
> 
>    rcu_exp_wait_wake()
>      wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> 
> This issue manifested as DPM device timeout during suspend, as scsi
> device was stuck in _synchronize_rcu_expedited().
> 
> schedule()
> synchronize_rcu_expedited()
> synchronize_rcu()
> scsi_device_quiesce()
> scsi_bus_suspend()
> dpm_run_callback()
> __device_suspend()
> 
> Fix this by using the correct exp sequence number, the one which
> owner of the exp_mutex initiated and passed to kworker,
> to index the wait queue, inside rcu_exp_wait_wake().
> 
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>

Queued, thank you!

I reworked the commit message to make it easier to follow the sequence
of events.  Please see below and let me know if I messed anything up.

							Thanx, Paul

------------------------------------------------------------------------

commit d887fd2a66861f51ed93b5dde894b9646a5569dd
Author: Neeraj Upadhyay <neeraju@codeaurora.org>
Date:   Tue Nov 19 03:17:07 2019 +0000

    rcu: Fix missed wakeup of exp_wq waiters
    
    Tasks waiting within exp_funnel_lock() for an expedited grace period to
    elapse can be starved due to the following sequence of events:
    
    1.      Tasks A and B both attempt to start an expedited grace
            period at about the same time.  This grace period will have
            completed when the lower four bits of the rcu_state structure's
            ->expedited_sequence field are 0b'0100', for example, when the
            initial value of this counter is zero.  Task A wins, and thus
            does the actual work of starting the grace period, including
            acquiring the rcu_state structure's .exp_mutex and sets the
            counter to 0b'0001'.
    
    2.      Because task B lost the race to start the grace period, it
            waits on ->expedited_sequence to reach 0b'0100' inside of
            exp_funnel_lock(). This task therefore blocks on the rcu_node
            structure's ->exp_wq[1] field, keeping in mind that the
            end-of-grace-period value of ->expedited_sequence (0b'0100')
            is shifted down two bits before indexing the ->exp_wq[] field.
    
    3.      Task C attempts to start another expedited grace period,
            but blocks on ->exp_mutex, which is still held by Task A.
    
    4.      The aforementioned expedited grace period completes, so that
            ->expedited_sequence now has the value 0b'0100'.  A kworker task
            therefore acquires the rcu_state structure's ->exp_wake_mutex
            and starts awakening any tasks waiting for this grace period.
    
    5.      One of the first tasks awakened happens to be Task A.  Task A
            therefore releases the rcu_state structure's ->exp_mutex,
            which allows Task C to start the next expedited grace period,
            which causes the lower four bits of the rcu_state structure's
            ->expedited_sequence field to become 0b'0101'.
    
    6.      Task C's expedited grace period completes, so that the lower four
            bits of the rcu_state structure's ->expedited_sequence field now
            become 0b'1000'.
    
    7.      The kworker task from step 4 above continues its wakeups.
            Unfortunately, the wake_up_all() refetches the rcu_state
            structure's .expedited_sequence field:
    
            wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
    
            This results in the wakeup being applied to the rcu_node
            structure's ->exp_wq[2] field, which is unfortunate given that
            Task B is instead waiting on ->exp_wq[1].
    
    On a busy system, no harm is done (or at least no permanent harm is done).
    Some later expedited grace period will redo the wakeup.  But on a quiet
    system, such as many embedded systems, it might be a good long time before
    there was another expedited grace period.  On such embedded systems,
    this situation could therefore result in a system hang.
    
    This issue manifested as DPM device timeout during suspend (which
    usually qualifies as a quiet time) due to a SCSI device being stuck in
    _synchronize_rcu_expedited(), with the following stack trace:
    
            schedule()
            synchronize_rcu_expedited()
            synchronize_rcu()
            scsi_device_quiesce()
            scsi_bus_suspend()
            dpm_run_callback()
            __device_suspend()
    
    This commit therefore prevents such delays, timeouts, and hangs by
    making rcu_exp_wait_wake() use its "s" argument consistently instead of
    refetching from rcu_state.expedited_sequence.
    
    Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6ce598d..4433d00a 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
 			spin_unlock(&rnp->exp_lock);
 		}
 		smp_mb(); /* All above changes before wakeup. */
-		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
+		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
 	}
 	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
 	mutex_unlock(&rcu_state.exp_wake_mutex);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-19 19:38 ` Paul E. McKenney
@ 2019-11-20  4:58   ` Neeraj Upadhyay
       [not found]   ` <25499aba-04a1-8d03-e2d9-fc89d7794b66@codeaurora.org>
  1 sibling, 0 replies; 7+ messages in thread
From: Neeraj Upadhyay @ 2019-11-20  4:58 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu


On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>> there is a chance that they might be indefinitely blocked
>> in below scenario:
>>
>> 1. There is a task waiting on exp sequence 0b'100' inside
>>     exp_funnel_lock(). This task blocks at wq index 1.
>>
>>     synchronize_rcu_expedited()
>>       s = 0b'100'
>>       exp_funnel_lock()
>>         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>
>> 2. The expedited grace period (which above task blocks for)
>>     completes and task (task1) holding exp_mutex queues
>>     worker and schedules out.
>>
>>     synchronize_rcu_expedited()
>>       s = 0b'100'
>>       queue_work(rcu_gp_wq, &rew.rew_work)
>>         wake_up_worker()
>>           schedule()
>>
>> 3. kworker A picks up the queued work and completes the exp gp
>>     sequence and then blocks on exp_wake_mutex, which is held
>>     by another kworker, which is doing wakeups for expedited_sequence
>>     0.
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>                                 // to 0b'100'
>>         mutex_lock(&rcu_state.exp_wake_mutex)
>>
>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>     and releases exp_mutex.
>>
>>     wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>       sync_exp_work_done(rsp, s));
>>     mutex_unlock(&rsp->exp_mutex);
>>
>> 5. Next exp GP completes, and sequence number is incremented:
>>
>>     rcu_exp_wait_wake()
>>       rcu_exp_wait_wake()
>>         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>
>> 6. kworker A acquires exp_wake_mutex. As it uses current
>>     expedited_sequence, it wakes up workers from wrong wait queue
>>     index - it should have worken wait queue corresponding to
>>     0b'100' sequence, but wakes up the ones for 0b'200' sequence.
>>     This results in task at step 1 indefinitely blocked.
>>
>>     rcu_exp_wait_wake()
>>       wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>
>> This issue manifested as DPM device timeout during suspend, as scsi
>> device was stuck in _synchronize_rcu_expedited().
>>
>> schedule()
>> synchronize_rcu_expedited()
>> synchronize_rcu()
>> scsi_device_quiesce()
>> scsi_bus_suspend()
>> dpm_run_callback()
>> __device_suspend()
>>
>> Fix this by using the correct exp sequence number, the one which
>> owner of the exp_mutex initiated and passed to kworker,
>> to index the wait queue, inside rcu_exp_wait_wake().
>>
>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> 
> Queued, thank you!
> 
> I reworked the commit message to make it easier to follow the sequence
> of events.  Please see below and let me know if I messed anything up.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit d887fd2a66861f51ed93b5dde894b9646a5569dd
> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> Date:   Tue Nov 19 03:17:07 2019 +0000
> 
>      rcu: Fix missed wakeup of exp_wq waiters
>      
>      Tasks waiting within exp_funnel_lock() for an expedited grace period to
>      elapse can be starved due to the following sequence of events:
>      
>      1.      Tasks A and B both attempt to start an expedited grace
>              period at about the same time.  This grace period will have
>              completed when the lower four bits of the rcu_state structure's
>              ->expedited_sequence field are 0b'0100', for example, when the
>              initial value of this counter is zero.  Task A wins, and thus
>              does the actual work of starting the grace period, including
>              acquiring the rcu_state structure's .exp_mutex and sets the
>              counter to 0b'0001'.
>      
>      2.      Because task B lost the race to start the grace period, it
>              waits on ->expedited_sequence to reach 0b'0100' inside of
>              exp_funnel_lock(). This task therefore blocks on the rcu_node
>              structure's ->exp_wq[1] field, keeping in mind that the
>              end-of-grace-period value of ->expedited_sequence (0b'0100')
>              is shifted down two bits before indexing the ->exp_wq[] field.
>      
>      3.      Task C attempts to start another expedited grace period,
>              but blocks on ->exp_mutex, which is still held by Task A.
>      
>      4.      The aforementioned expedited grace period completes, so that
>              ->expedited_sequence now has the value 0b'0100'.  A kworker task
>              therefore acquires the rcu_state structure's ->exp_wake_mutex
>              and starts awakening any tasks waiting for this grace period.
>      
>      5.      One of the first tasks awakened happens to be Task A.  Task A
>              therefore releases the rcu_state structure's ->exp_mutex,
>              which allows Task C to start the next expedited grace period,
>              which causes the lower four bits of the rcu_state structure's
>              ->expedited_sequence field to become 0b'0101'.
>      
>      6.      Task C's expedited grace period completes, so that the lower four
>              bits of the rcu_state structure's ->expedited_sequence field now
>              become 0b'1000'.
>      
>      7.      The kworker task from step 4 above continues its wakeups.
>              Unfortunately, the wake_up_all() refetches the rcu_state
>              structure's .expedited_sequence field:

This might not be true. I think wake_up_all(), which internally calls 
__wake_up(), will use a single wq_head for doing all wakeups. So, a 
single .expedited_sequence value would be used to get wq_head.

void __wake_up(struct wait_queue_head *wq_head, ...)

However, below sequence of events would result in problem:

1.      Tasks A starts an expedited grace period at about the same time.
         This grace period will have completed when the lower four bits
		of the rcu_state structure's ->expedited_sequence field are 0b'0100',
		for example, when the initial value of this counter is zero.
		Task A wins, acquires the rcu_state structure's .exp_mutex and
		sets the counter to 0b'0001'.

2.      The aforementioned expedited grace period completes, so that
         ->expedited_sequence now has the value 0b'0100'.  A kworker task
         therefore acquires the rcu_state structure's ->exp_wake_mutex
         and starts awakening any tasks waiting for this grace period.
         This kworker gets preempted while unlocking wq_head lock

         wake_up_all()
           __wake_up()
             __wake_up_common_lock()
               spin_unlock_irqrestore()
                 __raw_spin_unlock_irqrestore()
                   preempt_enable()
                     __preempt_schedule()

3.      One of the first tasks awakened happens to be Task A.  Task A
         therefore releases the rcu_state structure's ->exp_mutex,

4.      Tasks B and C both attempt to start an expedited grace
         period at about the same time.  This grace period will have
         completed when the lower four bits of the rcu_state structure's
         ->expedited_sequence field are 0b'1000'. Task B wins, and thus
         does the actual work of starting the grace period, including
         acquiring the rcu_state structure's .exp_mutex and sets the
         counter to 0b'0101'.

5.      Because task C lost the race to start the grace period, it
         waits on ->expedited_sequence to reach 0b'1000' inside of
         exp_funnel_lock(). This task therefore blocks on the rcu_node
         structure's ->exp_wq[2] field, keeping in mind that the
         end-of-grace-period value of ->expedited_sequence (0b'1000')
         is shifted down two bits before indexing the ->exp_wq[] field.

6.      Task B queues work to complete expedited grace period. This
         task is preempted just before wait_event call. Kworker task picks
		up the work queued by task B and and completes grace period, so
		that the lower four bits of the rcu_state structure's
		->expedited_sequence field now become 0b'1000'. This kworker starts
		waiting on the exp_wake_mutex, which is owned by kworker doing
		wakeups for expedited sequence initiated by task A.

7.      Task B schedules in and finds its expedited sequence snapshot has
         completed; so, it does not enter waitq and releases exp_mutex. This
		allows Task D to start the next expedited grace period,
         which causes the lower four bits of the rcu_state structure's
         ->expedited_sequence field to become 0b'1001'.

8.      Task D's expedited grace period completes, so that the lower four
         bits of the rcu_state structure's ->expedited_sequence field now
         become 0b'1100'.

9.      kworker from step 2 is scheduled in and releases exp_wake_mutex;
         kworker correspnding to Task B's expedited grace period acquires
		exp_wake_mutex and starts wakeups. Unfortunately, it used the
		rcu_state structure's .expedited_sequence field for determining
		the waitq index.

 
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

         This results in the wakeup being applied to the rcu_node
         structure's ->exp_wq[3] field, which is unfortunate given that
         Task C is instead waiting on ->exp_wq[2].


>      
>              wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>      
>              This results in the wakeup being applied to the rcu_node
>              structure's ->exp_wq[2] field, which is unfortunate given that
>              Task B is instead waiting on ->exp_wq[1].
>      
>      On a busy system, no harm is done (or at least no permanent harm is done).
>      Some later expedited grace period will redo the wakeup.  But on a quiet
>      system, such as many embedded systems, it might be a good long time before
>      there was another expedited grace period.  On such embedded systems,
>      this situation could therefore result in a system hang.
>      
>      This issue manifested as DPM device timeout during suspend (which
>      usually qualifies as a quiet time) due to a SCSI device being stuck in
>      _synchronize_rcu_expedited(), with the following stack trace:
>      
>              schedule()
>              synchronize_rcu_expedited()
>              synchronize_rcu()
>              scsi_device_quiesce()
>              scsi_bus_suspend()
>              dpm_run_callback()
>              __device_suspend()
>      
>      This commit therefore prevents such delays, timeouts, and hangs by
>      making rcu_exp_wait_wake() use its "s" argument consistently instead of
>      refetching from rcu_state.expedited_sequence.
>      

Do we need a "fixes" tag here?

>      Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>      Signed-off-by: Paul E. McKenney <paulmck@kernel.org
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6ce598d..4433d00a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>   			spin_unlock(&rnp->exp_lock);
>   		}
>   		smp_mb(); /* All above changes before wakeup. */
> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>   	}
>   	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>   	mutex_unlock(&rcu_state.exp_wake_mutex);
> 

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
       [not found]   ` <25499aba-04a1-8d03-e2d9-fc89d7794b66@codeaurora.org>
@ 2019-11-21  4:07     ` Paul E. McKenney
  2019-11-21  4:18       ` Neeraj Upadhyay
       [not found]       ` <8536f970-d22a-885c-8203-54a7c15a4f1f@codeaurora.org>
  0 siblings, 2 replies; 7+ messages in thread
From: Paul E. McKenney @ 2019-11-21  4:07 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu

On Wed, Nov 20, 2019 at 10:28:38AM +0530, Neeraj Upadhyay wrote:
> 
> On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
> > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > there is a chance that they might be indefinitely blocked
> > > in below scenario:
> > > 
> > > 1. There is a task waiting on exp sequence 0b'100' inside
> > >     exp_funnel_lock(). This task blocks at wq index 1.
> > > 
> > >     synchronize_rcu_expedited()
> > >       s = 0b'100'
> > >       exp_funnel_lock()
> > >         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > 
> > > 2. The expedited grace period (which above task blocks for)
> > >     completes and task (task1) holding exp_mutex queues
> > >     worker and schedules out.
> > > 
> > >     synchronize_rcu_expedited()
> > >       s = 0b'100'
> > >       queue_work(rcu_gp_wq, &rew.rew_work)
> > >         wake_up_worker()
> > >           schedule()
> > > 
> > > 3. kworker A picks up the queued work and completes the exp gp
> > >     sequence and then blocks on exp_wake_mutex, which is held
> > >     by another kworker, which is doing wakeups for expedited_sequence
> > >     0.
> > > 
> > >     rcu_exp_wait_wake()
> > >       rcu_exp_wait_wake()
> > >         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > >                                 // to 0b'100'
> > >         mutex_lock(&rcu_state.exp_wake_mutex)
> > > 
> > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > >     and releases exp_mutex.
> > > 
> > >     wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > >       sync_exp_work_done(rsp, s));
> > >     mutex_unlock(&rsp->exp_mutex);
> > > 
> > > 5. Next exp GP completes, and sequence number is incremented:
> > > 
> > >     rcu_exp_wait_wake()
> > >       rcu_exp_wait_wake()
> > >         rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > 
> > > 6. kworker A acquires exp_wake_mutex. As it uses current
> > >     expedited_sequence, it wakes up workers from wrong wait queue
> > >     index - it should have worken wait queue corresponding to
> > >     0b'100' sequence, but wakes up the ones for 0b'200' sequence.
> > >     This results in task at step 1 indefinitely blocked.
> > > 
> > >     rcu_exp_wait_wake()
> > >       wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > 
> > > This issue manifested as DPM device timeout during suspend, as scsi
> > > device was stuck in _synchronize_rcu_expedited().
> > > 
> > > schedule()
> > > synchronize_rcu_expedited()
> > > synchronize_rcu()
> > > scsi_device_quiesce()
> > > scsi_bus_suspend()
> > > dpm_run_callback()
> > > __device_suspend()
> > > 
> > > Fix this by using the correct exp sequence number, the one which
> > > owner of the exp_mutex initiated and passed to kworker,
> > > to index the wait queue, inside rcu_exp_wait_wake().
> > > 
> > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > 
> > Queued, thank you!
> > 
> > I reworked the commit message to make it easier to follow the sequence
> > of events.  Please see below and let me know if I messed anything up.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit d887fd2a66861f51ed93b5dde894b9646a5569dd
> > Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> > Date:   Tue Nov 19 03:17:07 2019 +0000
> > 
> >      rcu: Fix missed wakeup of exp_wq waiters
> >      Tasks waiting within exp_funnel_lock() for an expedited grace period to
> >      elapse can be starved due to the following sequence of events:
> >      1.      Tasks A and B both attempt to start an expedited grace
> >              period at about the same time.  This grace period will have
> >              completed when the lower four bits of the rcu_state structure's
> >              ->expedited_sequence field are 0b'0100', for example, when the
> >              initial value of this counter is zero.  Task A wins, and thus
> >              does the actual work of starting the grace period, including
> >              acquiring the rcu_state structure's .exp_mutex and sets the
> >              counter to 0b'0001'.
> >      2.      Because task B lost the race to start the grace period, it
> >              waits on ->expedited_sequence to reach 0b'0100' inside of
> >              exp_funnel_lock(). This task therefore blocks on the rcu_node
> >              structure's ->exp_wq[1] field, keeping in mind that the
> >              end-of-grace-period value of ->expedited_sequence (0b'0100')
> >              is shifted down two bits before indexing the ->exp_wq[] field.
> >      3.      Task C attempts to start another expedited grace period,
> >              but blocks on ->exp_mutex, which is still held by Task A.
> >      4.      The aforementioned expedited grace period completes, so that
> >              ->expedited_sequence now has the value 0b'0100'.  A kworker task
> >              therefore acquires the rcu_state structure's ->exp_wake_mutex
> >              and starts awakening any tasks waiting for this grace period.
> >      5.      One of the first tasks awakened happens to be Task A.  Task A
> >              therefore releases the rcu_state structure's ->exp_mutex,
> >              which allows Task C to start the next expedited grace period,
> >              which causes the lower four bits of the rcu_state structure's
> >              ->expedited_sequence field to become 0b'0101'.
> >      6.      Task C's expedited grace period completes, so that the lower four
> >              bits of the rcu_state structure's ->expedited_sequence field now
> >              become 0b'1000'.
> >      7.      The kworker task from step 4 above continues its wakeups.
> >              Unfortunately, the wake_up_all() refetches the rcu_state
> >              structure's .expedited_sequence field:
> 
> This might not be true. I think wake_up_all(), which internally calls
> __wake_up(), will use a single wq_head for doing all wakeups. So, a single
> .expedited_sequence value would be used to get wq_head.
> 
> void __wake_up(struct wait_queue_head *wq_head, ...)

The wake_up_all() really is selecting among four different ->exp_wq[]
array entries:

wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

So I lost you here.  Are you saying that the wake_up_all() will somehow
be operating on ->exp_wq[1], which is where Task B is blocked?  Or that
Task B would instead be blocked on ->exp_wq[2]?  Or that failing to wake
up Task B is OK for some reason?  Or something else entirely?

							Thanx, Paul

> However, below sequence of events would result in problem:
> 
> 1.      Tasks A starts an expedited grace period at about the same time.
>         This grace period will have completed when the lower four bits
> 		of the rcu_state structure's ->expedited_sequence field are 0b'0100',
> 		for example, when the initial value of this counter is zero.
> 		Task A wins, acquires the rcu_state structure's .exp_mutex and
> 		sets the counter to 0b'0001'.
> 
> 2.      The aforementioned expedited grace period completes, so that
>         ->expedited_sequence now has the value 0b'0100'.  A kworker task
>         therefore acquires the rcu_state structure's ->exp_wake_mutex
>         and starts awakening any tasks waiting for this grace period.
>         This kworker gets preempted while unlocking wq_head lock
> 
>         wake_up_all()
>           __wake_up()
>             __wake_up_common_lock()
>               spin_unlock_irqrestore()
>                 __raw_spin_unlock_irqrestore()
>                   preempt_enable()
>                     __preempt_schedule()
> 
> 3.      One of the first tasks awakened happens to be Task A.  Task A
>         therefore releases the rcu_state structure's ->exp_mutex,
> 
> 4.      Tasks B and C both attempt to start an expedited grace
>         period at about the same time.  This grace period will have
>         completed when the lower four bits of the rcu_state structure's
>         ->expedited_sequence field are 0b'1000'. Task B wins, and thus
>         does the actual work of starting the grace period, including
>         acquiring the rcu_state structure's .exp_mutex and sets the
>         counter to 0b'0101'.
> 
> 5.      Because task C lost the race to start the grace period, it
>         waits on ->expedited_sequence to reach 0b'1000' inside of
>         exp_funnel_lock(). This task therefore blocks on the rcu_node
>         structure's ->exp_wq[2] field, keeping in mind that the
>         end-of-grace-period value of ->expedited_sequence (0b'1000')
>         is shifted down two bits before indexing the ->exp_wq[] field.
> 
> 6.      Task B queues work to complete expedited grace period. This
>         task is preempted just before wait_event call. Kworker task picks
> 		up the work queued by task B and and completes grace period, so
> 		that the lower four bits of the rcu_state structure's
> 		->expedited_sequence field now become 0b'1000'. This kworker starts
> 		waiting on the exp_wake_mutex, which is owned by kworker doing
> 		wakeups for expedited sequence initiated by task A.
> 
> 7.      Task B schedules in and finds its expedited sequence snapshot has
>         completed; so, it does not enter waitq and releases exp_mutex. This
> 		allows Task D to start the next expedited grace period,
>         which causes the lower four bits of the rcu_state structure's
>         ->expedited_sequence field to become 0b'1001'.
> 
> 8.      Task D's expedited grace period completes, so that the lower four
>         bits of the rcu_state structure's ->expedited_sequence field now
>         become 0b'1100'.
> 
> 9.      kworker from step 2 is scheduled in and releases exp_wake_mutex;
>         kworker correspnding to Task B's expedited grace period acquires
> 		exp_wake_mutex and starts wakeups. Unfortunately, it used the
> 		rcu_state structure's .expedited_sequence field for determining
> 		the waitq index.
> 
> 
> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> 
>         This results in the wakeup being applied to the rcu_node
>         structure's ->exp_wq[3] field, which is unfortunate given that
>         Task C is instead waiting on ->exp_wq[2].
> 
> 
> >              wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> >              This results in the wakeup being applied to the rcu_node
> >              structure's ->exp_wq[2] field, which is unfortunate given that
> >              Task B is instead waiting on ->exp_wq[1].
> >      On a busy system, no harm is done (or at least no permanent harm is done).
> >      Some later expedited grace period will redo the wakeup.  But on a quiet
> >      system, such as many embedded systems, it might be a good long time before
> >      there was another expedited grace period.  On such embedded systems,
> >      this situation could therefore result in a system hang.
> >      This issue manifested as DPM device timeout during suspend (which
> >      usually qualifies as a quiet time) due to a SCSI device being stuck in
> >      _synchronize_rcu_expedited(), with the following stack trace:
> >              schedule()
> >              synchronize_rcu_expedited()
> >              synchronize_rcu()
> >              scsi_device_quiesce()
> >              scsi_bus_suspend()
> >              dpm_run_callback()
> >              __device_suspend()
> >      This commit therefore prevents such delays, timeouts, and hangs by
> >      making rcu_exp_wait_wake() use its "s" argument consistently instead of
> >      refetching from rcu_state.expedited_sequence.
> 
> Do we need a "fixes" tag here?
> 
> >      Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> >      Signed-off-by: Paul E. McKenney <paulmck@kernel.org
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 6ce598d..4433d00a 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> >   			spin_unlock(&rnp->exp_lock);
> >   		}
> >   		smp_mb(); /* All above changes before wakeup. */
> > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> >   	}
> >   	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> >   	mutex_unlock(&rcu_state.exp_wake_mutex);
> > 
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-21  4:07     ` Paul E. McKenney
@ 2019-11-21  4:18       ` Neeraj Upadhyay
       [not found]       ` <8536f970-d22a-885c-8203-54a7c15a4f1f@codeaurora.org>
  1 sibling, 0 replies; 7+ messages in thread
From: Neeraj Upadhyay @ 2019-11-21  4:18 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu



On 11/21/2019 9:37 AM, Paul E. McKenney wrote:
> On Wed, Nov 20, 2019 at 10:28:38AM +0530, Neeraj Upadhyay wrote:
>>
>> On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
>>> On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>> there is a chance that they might be indefinitely blocked
>>>> in below scenario:
>>>>
>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>      exp_funnel_lock(). This task blocks at wq index 1.
>>>>
>>>>      synchronize_rcu_expedited()
>>>>        s = 0b'100'
>>>>        exp_funnel_lock()
>>>>          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>
>>>> 2. The expedited grace period (which above task blocks for)
>>>>      completes and task (task1) holding exp_mutex queues
>>>>      worker and schedules out.
>>>>
>>>>      synchronize_rcu_expedited()
>>>>        s = 0b'100'
>>>>        queue_work(rcu_gp_wq, &rew.rew_work)
>>>>          wake_up_worker()
>>>>            schedule()
>>>>
>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>      sequence and then blocks on exp_wake_mutex, which is held
>>>>      by another kworker, which is doing wakeups for expedited_sequence
>>>>      0.
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        rcu_exp_wait_wake()
>>>>          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>                                  // to 0b'100'
>>>>          mutex_lock(&rcu_state.exp_wake_mutex)
>>>>
>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>      and releases exp_mutex.
>>>>
>>>>      wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>        sync_exp_work_done(rsp, s));
>>>>      mutex_unlock(&rsp->exp_mutex);
>>>>
>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        rcu_exp_wait_wake()
>>>>          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>
>>>> 6. kworker A acquires exp_wake_mutex. As it uses current
>>>>      expedited_sequence, it wakes up workers from wrong wait queue
>>>>      index - it should have worken wait queue corresponding to
>>>>      0b'100' sequence, but wakes up the ones for 0b'200' sequence.
>>>>      This results in task at step 1 indefinitely blocked.
>>>>
>>>>      rcu_exp_wait_wake()
>>>>        wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>
>>>> This issue manifested as DPM device timeout during suspend, as scsi
>>>> device was stuck in _synchronize_rcu_expedited().
>>>>
>>>> schedule()
>>>> synchronize_rcu_expedited()
>>>> synchronize_rcu()
>>>> scsi_device_quiesce()
>>>> scsi_bus_suspend()
>>>> dpm_run_callback()
>>>> __device_suspend()
>>>>
>>>> Fix this by using the correct exp sequence number, the one which
>>>> owner of the exp_mutex initiated and passed to kworker,
>>>> to index the wait queue, inside rcu_exp_wait_wake().
>>>>
>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>
>>> Queued, thank you!
>>>
>>> I reworked the commit message to make it easier to follow the sequence
>>> of events.  Please see below and let me know if I messed anything up.
>>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> commit d887fd2a66861f51ed93b5dde894b9646a5569dd
>>> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
>>> Date:   Tue Nov 19 03:17:07 2019 +0000
>>>
>>>       rcu: Fix missed wakeup of exp_wq waiters
>>>       Tasks waiting within exp_funnel_lock() for an expedited grace period to
>>>       elapse can be starved due to the following sequence of events:
>>>       1.      Tasks A and B both attempt to start an expedited grace
>>>               period at about the same time.  This grace period will have
>>>               completed when the lower four bits of the rcu_state structure's
>>>               ->expedited_sequence field are 0b'0100', for example, when the
>>>               initial value of this counter is zero.  Task A wins, and thus
>>>               does the actual work of starting the grace period, including
>>>               acquiring the rcu_state structure's .exp_mutex and sets the
>>>               counter to 0b'0001'.
>>>       2.      Because task B lost the race to start the grace period, it
>>>               waits on ->expedited_sequence to reach 0b'0100' inside of
>>>               exp_funnel_lock(). This task therefore blocks on the rcu_node
>>>               structure's ->exp_wq[1] field, keeping in mind that the
>>>               end-of-grace-period value of ->expedited_sequence (0b'0100')
>>>               is shifted down two bits before indexing the ->exp_wq[] field.
>>>       3.      Task C attempts to start another expedited grace period,
>>>               but blocks on ->exp_mutex, which is still held by Task A.
>>>       4.      The aforementioned expedited grace period completes, so that
>>>               ->expedited_sequence now has the value 0b'0100'.  A kworker task
>>>               therefore acquires the rcu_state structure's ->exp_wake_mutex
>>>               and starts awakening any tasks waiting for this grace period.
>>>       5.      One of the first tasks awakened happens to be Task A.  Task A
>>>               therefore releases the rcu_state structure's ->exp_mutex,
>>>               which allows Task C to start the next expedited grace period,
>>>               which causes the lower four bits of the rcu_state structure's
>>>               ->expedited_sequence field to become 0b'0101'.
>>>       6.      Task C's expedited grace period completes, so that the lower four
>>>               bits of the rcu_state structure's ->expedited_sequence field now
>>>               become 0b'1000'.
>>>       7.      The kworker task from step 4 above continues its wakeups.
>>>               Unfortunately, the wake_up_all() refetches the rcu_state
>>>               structure's .expedited_sequence field:
>>
>> This might not be true. I think wake_up_all(), which internally calls
>> __wake_up(), will use a single wq_head for doing all wakeups. So, a single
>> .expedited_sequence value would be used to get wq_head.
>>
>> void __wake_up(struct wait_queue_head *wq_head, ...)
> 
> The wake_up_all() really is selecting among four different ->exp_wq[]
> array entries:
> 
> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> 
> So I lost you here.  Are you saying that the wake_up_all() will somehow
> be operating on ->exp_wq[1], which is where Task B is blocked?  Or that
> Task B would instead be blocked on ->exp_wq[2]?  Or that failing to wake
> up Task B is OK for some reason?  Or something else entirely?

My bad; I was thinking only of the case where there is only one
rnp node (which is also the root) in RCU tree. In case of only
one rnp node also, issue can be seen. Please ignore my
comment. The commit description looks good to me.


Thanks
Neeraj
> 
> 							Thanx, Paul
> 
>> However, below sequence of events would result in problem:
>>
>> 1.      Tasks A starts an expedited grace period at about the same time.
>>          This grace period will have completed when the lower four bits
>> 		of the rcu_state structure's ->expedited_sequence field are 0b'0100',
>> 		for example, when the initial value of this counter is zero.
>> 		Task A wins, acquires the rcu_state structure's .exp_mutex and
>> 		sets the counter to 0b'0001'.
>>
>> 2.      The aforementioned expedited grace period completes, so that
>>          ->expedited_sequence now has the value 0b'0100'.  A kworker task
>>          therefore acquires the rcu_state structure's ->exp_wake_mutex
>>          and starts awakening any tasks waiting for this grace period.
>>          This kworker gets preempted while unlocking wq_head lock
>>
>>          wake_up_all()
>>            __wake_up()
>>              __wake_up_common_lock()
>>                spin_unlock_irqrestore()
>>                  __raw_spin_unlock_irqrestore()
>>                    preempt_enable()
>>                      __preempt_schedule()
>>
>> 3.      One of the first tasks awakened happens to be Task A.  Task A
>>          therefore releases the rcu_state structure's ->exp_mutex,
>>
>> 4.      Tasks B and C both attempt to start an expedited grace
>>          period at about the same time.  This grace period will have
>>          completed when the lower four bits of the rcu_state structure's
>>          ->expedited_sequence field are 0b'1000'. Task B wins, and thus
>>          does the actual work of starting the grace period, including
>>          acquiring the rcu_state structure's .exp_mutex and sets the
>>          counter to 0b'0101'.
>>
>> 5.      Because task C lost the race to start the grace period, it
>>          waits on ->expedited_sequence to reach 0b'1000' inside of
>>          exp_funnel_lock(). This task therefore blocks on the rcu_node
>>          structure's ->exp_wq[2] field, keeping in mind that the
>>          end-of-grace-period value of ->expedited_sequence (0b'1000')
>>          is shifted down two bits before indexing the ->exp_wq[] field.
>>
>> 6.      Task B queues work to complete expedited grace period. This
>>          task is preempted just before wait_event call. Kworker task picks
>> 		up the work queued by task B and and completes grace period, so
>> 		that the lower four bits of the rcu_state structure's
>> 		->expedited_sequence field now become 0b'1000'. This kworker starts
>> 		waiting on the exp_wake_mutex, which is owned by kworker doing
>> 		wakeups for expedited sequence initiated by task A.
>>
>> 7.      Task B schedules in and finds its expedited sequence snapshot has
>>          completed; so, it does not enter waitq and releases exp_mutex. This
>> 		allows Task D to start the next expedited grace period,
>>          which causes the lower four bits of the rcu_state structure's
>>          ->expedited_sequence field to become 0b'1001'.
>>
>> 8.      Task D's expedited grace period completes, so that the lower four
>>          bits of the rcu_state structure's ->expedited_sequence field now
>>          become 0b'1100'.
>>
>> 9.      kworker from step 2 is scheduled in and releases exp_wake_mutex;
>>          kworker correspnding to Task B's expedited grace period acquires
>> 		exp_wake_mutex and starts wakeups. Unfortunately, it used the
>> 		rcu_state structure's .expedited_sequence field for determining
>> 		the waitq index.
>>
>>
>> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>
>>          This results in the wakeup being applied to the rcu_node
>>          structure's ->exp_wq[3] field, which is unfortunate given that
>>          Task C is instead waiting on ->exp_wq[2].
>>
>>
>>>               wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>               This results in the wakeup being applied to the rcu_node
>>>               structure's ->exp_wq[2] field, which is unfortunate given that
>>>               Task B is instead waiting on ->exp_wq[1].
>>>       On a busy system, no harm is done (or at least no permanent harm is done).
>>>       Some later expedited grace period will redo the wakeup.  But on a quiet
>>>       system, such as many embedded systems, it might be a good long time before
>>>       there was another expedited grace period.  On such embedded systems,
>>>       this situation could therefore result in a system hang.
>>>       This issue manifested as DPM device timeout during suspend (which
>>>       usually qualifies as a quiet time) due to a SCSI device being stuck in
>>>       _synchronize_rcu_expedited(), with the following stack trace:
>>>               schedule()
>>>               synchronize_rcu_expedited()
>>>               synchronize_rcu()
>>>               scsi_device_quiesce()
>>>               scsi_bus_suspend()
>>>               dpm_run_callback()
>>>               __device_suspend()
>>>       This commit therefore prevents such delays, timeouts, and hangs by
>>>       making rcu_exp_wait_wake() use its "s" argument consistently instead of
>>>       refetching from rcu_state.expedited_sequence.
>>
>> Do we need a "fixes" tag here?
>>
>>>       Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>       Signed-off-by: Paul E. McKenney <paulmck@kernel.org
>>>
>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>> index 6ce598d..4433d00a 100644
>>> --- a/kernel/rcu/tree_exp.h
>>> +++ b/kernel/rcu/tree_exp.h
>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>    			spin_unlock(&rnp->exp_lock);
>>>    		}
>>>    		smp_mb(); /* All above changes before wakeup. */
>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>    	}
>>>    	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>    	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
       [not found]       ` <8536f970-d22a-885c-8203-54a7c15a4f1f@codeaurora.org>
@ 2019-11-21 15:01         ` Paul E. McKenney
  2019-11-22  3:40           ` Neeraj Upadhyay
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2019-11-21 15:01 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu

On Thu, Nov 21, 2019 at 09:48:05AM +0530, Neeraj Upadhyay wrote:
> On 11/21/2019 9:37 AM, Paul E. McKenney wrote:
> > On Wed, Nov 20, 2019 at 10:28:38AM +0530, Neeraj Upadhyay wrote:
> > > 
> > > On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
> > > > On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
> > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > there is a chance that they might be indefinitely blocked
> > > > > in below scenario:
> > > > > 
> > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > >      exp_funnel_lock(). This task blocks at wq index 1.
> > > > > 
> > > > >      synchronize_rcu_expedited()
> > > > >        s = 0b'100'
> > > > >        exp_funnel_lock()
> > > > >          wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > 
> > > > > 2. The expedited grace period (which above task blocks for)
> > > > >      completes and task (task1) holding exp_mutex queues
> > > > >      worker and schedules out.
> > > > > 
> > > > >      synchronize_rcu_expedited()
> > > > >        s = 0b'100'
> > > > >        queue_work(rcu_gp_wq, &rew.rew_work)
> > > > >          wake_up_worker()
> > > > >            schedule()
> > > > > 
> > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > >      sequence and then blocks on exp_wake_mutex, which is held
> > > > >      by another kworker, which is doing wakeups for expedited_sequence
> > > > >      0.
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        rcu_exp_wait_wake()
> > > > >          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > >                                  // to 0b'100'
> > > > >          mutex_lock(&rcu_state.exp_wake_mutex)
> > > > > 
> > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > >      and releases exp_mutex.
> > > > > 
> > > > >      wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > >        sync_exp_work_done(rsp, s));
> > > > >      mutex_unlock(&rsp->exp_mutex);
> > > > > 
> > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        rcu_exp_wait_wake()
> > > > >          rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > 
> > > > > 6. kworker A acquires exp_wake_mutex. As it uses current
> > > > >      expedited_sequence, it wakes up workers from wrong wait queue
> > > > >      index - it should have worken wait queue corresponding to
> > > > >      0b'100' sequence, but wakes up the ones for 0b'200' sequence.
> > > > >      This results in task at step 1 indefinitely blocked.
> > > > > 
> > > > >      rcu_exp_wait_wake()
> > > > >        wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > 
> > > > > This issue manifested as DPM device timeout during suspend, as scsi
> > > > > device was stuck in _synchronize_rcu_expedited().
> > > > > 
> > > > > schedule()
> > > > > synchronize_rcu_expedited()
> > > > > synchronize_rcu()
> > > > > scsi_device_quiesce()
> > > > > scsi_bus_suspend()
> > > > > dpm_run_callback()
> > > > > __device_suspend()
> > > > > 
> > > > > Fix this by using the correct exp sequence number, the one which
> > > > > owner of the exp_mutex initiated and passed to kworker,
> > > > > to index the wait queue, inside rcu_exp_wait_wake().
> > > > > 
> > > > > Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > 
> > > > Queued, thank you!
> > > > 
> > > > I reworked the commit message to make it easier to follow the sequence
> > > > of events.  Please see below and let me know if I messed anything up.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit d887fd2a66861f51ed93b5dde894b9646a5569dd
> > > > Author: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > > Date:   Tue Nov 19 03:17:07 2019 +0000
> > > > 
> > > >       rcu: Fix missed wakeup of exp_wq waiters
> > > >       Tasks waiting within exp_funnel_lock() for an expedited grace period to
> > > >       elapse can be starved due to the following sequence of events:
> > > >       1.      Tasks A and B both attempt to start an expedited grace
> > > >               period at about the same time.  This grace period will have
> > > >               completed when the lower four bits of the rcu_state structure's
> > > >               ->expedited_sequence field are 0b'0100', for example, when the
> > > >               initial value of this counter is zero.  Task A wins, and thus
> > > >               does the actual work of starting the grace period, including
> > > >               acquiring the rcu_state structure's .exp_mutex and sets the
> > > >               counter to 0b'0001'.
> > > >       2.      Because task B lost the race to start the grace period, it
> > > >               waits on ->expedited_sequence to reach 0b'0100' inside of
> > > >               exp_funnel_lock(). This task therefore blocks on the rcu_node
> > > >               structure's ->exp_wq[1] field, keeping in mind that the
> > > >               end-of-grace-period value of ->expedited_sequence (0b'0100')
> > > >               is shifted down two bits before indexing the ->exp_wq[] field.
> > > >       3.      Task C attempts to start another expedited grace period,
> > > >               but blocks on ->exp_mutex, which is still held by Task A.
> > > >       4.      The aforementioned expedited grace period completes, so that
> > > >               ->expedited_sequence now has the value 0b'0100'.  A kworker task
> > > >               therefore acquires the rcu_state structure's ->exp_wake_mutex
> > > >               and starts awakening any tasks waiting for this grace period.
> > > >       5.      One of the first tasks awakened happens to be Task A.  Task A
> > > >               therefore releases the rcu_state structure's ->exp_mutex,
> > > >               which allows Task C to start the next expedited grace period,
> > > >               which causes the lower four bits of the rcu_state structure's
> > > >               ->expedited_sequence field to become 0b'0101'.
> > > >       6.      Task C's expedited grace period completes, so that the lower four
> > > >               bits of the rcu_state structure's ->expedited_sequence field now
> > > >               become 0b'1000'.
> > > >       7.      The kworker task from step 4 above continues its wakeups.
> > > >               Unfortunately, the wake_up_all() refetches the rcu_state
> > > >               structure's .expedited_sequence field:
> > > 
> > > This might not be true. I think wake_up_all(), which internally calls
> > > __wake_up(), will use a single wq_head for doing all wakeups. So, a single
> > > .expedited_sequence value would be used to get wq_head.
> > > 
> > > void __wake_up(struct wait_queue_head *wq_head, ...)
> > 
> > The wake_up_all() really is selecting among four different ->exp_wq[]
> > array entries:
> > 
> > wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > 
> > So I lost you here.  Are you saying that the wake_up_all() will somehow
> > be operating on ->exp_wq[1], which is where Task B is blocked?  Or that
> > Task B would instead be blocked on ->exp_wq[2]?  Or that failing to wake
> > up Task B is OK for some reason?  Or something else entirely?
> 
> My bad; I was thinking only of the case where there is only one
> rnp node (which is also the root) in RCU tree. In case of only
> one rnp node also, issue can be seen. Please ignore my
> comment. The commit description looks good to me.

Thank you for checking!

And the sequence of events below looks greatly improved over your
original.  I suspect that there are more similar bugs to find in
Linux-kernel RCU, so please use a carefully labeled style like that
below when reporting the next one.

							Thanx, Paul

> Thanks
> Neeraj
> > 
> > 							Thanx, Paul
> > 
> > > However, below sequence of events would result in problem:
> > > 
> > > 1.      Tasks A starts an expedited grace period at about the same time.
> > >          This grace period will have completed when the lower four bits
> > > 		of the rcu_state structure's ->expedited_sequence field are 0b'0100',
> > > 		for example, when the initial value of this counter is zero.
> > > 		Task A wins, acquires the rcu_state structure's .exp_mutex and
> > > 		sets the counter to 0b'0001'.
> > > 
> > > 2.      The aforementioned expedited grace period completes, so that
> > >          ->expedited_sequence now has the value 0b'0100'.  A kworker task
> > >          therefore acquires the rcu_state structure's ->exp_wake_mutex
> > >          and starts awakening any tasks waiting for this grace period.
> > >          This kworker gets preempted while unlocking wq_head lock
> > > 
> > >          wake_up_all()
> > >            __wake_up()
> > >              __wake_up_common_lock()
> > >                spin_unlock_irqrestore()
> > >                  __raw_spin_unlock_irqrestore()
> > >                    preempt_enable()
> > >                      __preempt_schedule()
> > > 
> > > 3.      One of the first tasks awakened happens to be Task A.  Task A
> > >          therefore releases the rcu_state structure's ->exp_mutex,
> > > 
> > > 4.      Tasks B and C both attempt to start an expedited grace
> > >          period at about the same time.  This grace period will have
> > >          completed when the lower four bits of the rcu_state structure's
> > >          ->expedited_sequence field are 0b'1000'. Task B wins, and thus
> > >          does the actual work of starting the grace period, including
> > >          acquiring the rcu_state structure's .exp_mutex and sets the
> > >          counter to 0b'0101'.
> > > 
> > > 5.      Because task C lost the race to start the grace period, it
> > >          waits on ->expedited_sequence to reach 0b'1000' inside of
> > >          exp_funnel_lock(). This task therefore blocks on the rcu_node
> > >          structure's ->exp_wq[2] field, keeping in mind that the
> > >          end-of-grace-period value of ->expedited_sequence (0b'1000')
> > >          is shifted down two bits before indexing the ->exp_wq[] field.
> > > 
> > > 6.      Task B queues work to complete expedited grace period. This
> > >          task is preempted just before wait_event call. Kworker task picks
> > > 		up the work queued by task B and and completes grace period, so
> > > 		that the lower four bits of the rcu_state structure's
> > > 		->expedited_sequence field now become 0b'1000'. This kworker starts
> > > 		waiting on the exp_wake_mutex, which is owned by kworker doing
> > > 		wakeups for expedited sequence initiated by task A.
> > > 
> > > 7.      Task B schedules in and finds its expedited sequence snapshot has
> > >          completed; so, it does not enter waitq and releases exp_mutex. This
> > > 		allows Task D to start the next expedited grace period,
> > >          which causes the lower four bits of the rcu_state structure's
> > >          ->expedited_sequence field to become 0b'1001'.
> > > 
> > > 8.      Task D's expedited grace period completes, so that the lower four
> > >          bits of the rcu_state structure's ->expedited_sequence field now
> > >          become 0b'1100'.
> > > 
> > > 9.      kworker from step 2 is scheduled in and releases exp_wake_mutex;
> > >          kworker correspnding to Task B's expedited grace period acquires
> > > 		exp_wake_mutex and starts wakeups. Unfortunately, it used the
> > > 		rcu_state structure's .expedited_sequence field for determining
> > > 		the waitq index.
> > > 
> > > 
> > > wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > 
> > >          This results in the wakeup being applied to the rcu_node
> > >          structure's ->exp_wq[3] field, which is unfortunate given that
> > >          Task C is instead waiting on ->exp_wq[2].
> > > 
> > > 
> > > >               wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > >               This results in the wakeup being applied to the rcu_node
> > > >               structure's ->exp_wq[2] field, which is unfortunate given that
> > > >               Task B is instead waiting on ->exp_wq[1].
> > > >       On a busy system, no harm is done (or at least no permanent harm is done).
> > > >       Some later expedited grace period will redo the wakeup.  But on a quiet
> > > >       system, such as many embedded systems, it might be a good long time before
> > > >       there was another expedited grace period.  On such embedded systems,
> > > >       this situation could therefore result in a system hang.
> > > >       This issue manifested as DPM device timeout during suspend (which
> > > >       usually qualifies as a quiet time) due to a SCSI device being stuck in
> > > >       _synchronize_rcu_expedited(), with the following stack trace:
> > > >               schedule()
> > > >               synchronize_rcu_expedited()
> > > >               synchronize_rcu()
> > > >               scsi_device_quiesce()
> > > >               scsi_bus_suspend()
> > > >               dpm_run_callback()
> > > >               __device_suspend()
> > > >       This commit therefore prevents such delays, timeouts, and hangs by
> > > >       making rcu_exp_wait_wake() use its "s" argument consistently instead of
> > > >       refetching from rcu_state.expedited_sequence.
> > > 
> > > Do we need a "fixes" tag here?
> > > 
> > > >       Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> > > >       Signed-off-by: Paul E. McKenney <paulmck@kernel.org
> > > > 
> > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > index 6ce598d..4433d00a 100644
> > > > --- a/kernel/rcu/tree_exp.h
> > > > +++ b/kernel/rcu/tree_exp.h
> > > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > >    			spin_unlock(&rnp->exp_lock);
> > > >    		}
> > > >    		smp_mb(); /* All above changes before wakeup. */
> > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > > >    	}
> > > >    	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > > >    	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > > 
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters
  2019-11-21 15:01         ` Paul E. McKenney
@ 2019-11-22  3:40           ` Neeraj Upadhyay
  0 siblings, 0 replies; 7+ messages in thread
From: Neeraj Upadhyay @ 2019-11-22  3:40 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	linux-kernel, gkohli, prsood, pkondeti, rcu



On 11/21/2019 8:31 PM, Paul E. McKenney wrote:
> On Thu, Nov 21, 2019 at 09:48:05AM +0530, Neeraj Upadhyay wrote:
>> On 11/21/2019 9:37 AM, Paul E. McKenney wrote:
>>> On Wed, Nov 20, 2019 at 10:28:38AM +0530, Neeraj Upadhyay wrote:
>>>>
>>>> On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
>>>>> On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
>>>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>>>> there is a chance that they might be indefinitely blocked
>>>>>> in below scenario:
>>>>>>
>>>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>>>>       exp_funnel_lock(). This task blocks at wq index 1.
>>>>>>
>>>>>>       synchronize_rcu_expedited()
>>>>>>         s = 0b'100'
>>>>>>         exp_funnel_lock()
>>>>>>           wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>>>>
>>>>>> 2. The expedited grace period (which above task blocks for)
>>>>>>       completes and task (task1) holding exp_mutex queues
>>>>>>       worker and schedules out.
>>>>>>
>>>>>>       synchronize_rcu_expedited()
>>>>>>         s = 0b'100'
>>>>>>         queue_work(rcu_gp_wq, &rew.rew_work)
>>>>>>           wake_up_worker()
>>>>>>             schedule()
>>>>>>
>>>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>>>>       sequence and then blocks on exp_wake_mutex, which is held
>>>>>>       by another kworker, which is doing wakeups for expedited_sequence
>>>>>>       0.
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         rcu_exp_wait_wake()
>>>>>>           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>>>>                                   // to 0b'100'
>>>>>>           mutex_lock(&rcu_state.exp_wake_mutex)
>>>>>>
>>>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>>>>       and releases exp_mutex.
>>>>>>
>>>>>>       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>>>>         sync_exp_work_done(rsp, s));
>>>>>>       mutex_unlock(&rsp->exp_mutex);
>>>>>>
>>>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         rcu_exp_wait_wake()
>>>>>>           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>>>
>>>>>> 6. kworker A acquires exp_wake_mutex. As it uses current
>>>>>>       expedited_sequence, it wakes up workers from wrong wait queue
>>>>>>       index - it should have worken wait queue corresponding to
>>>>>>       0b'100' sequence, but wakes up the ones for 0b'200' sequence.
>>>>>>       This results in task at step 1 indefinitely blocked.
>>>>>>
>>>>>>       rcu_exp_wait_wake()
>>>>>>         wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>>>>
>>>>>> This issue manifested as DPM device timeout during suspend, as scsi
>>>>>> device was stuck in _synchronize_rcu_expedited().
>>>>>>
>>>>>> schedule()
>>>>>> synchronize_rcu_expedited()
>>>>>> synchronize_rcu()
>>>>>> scsi_device_quiesce()
>>>>>> scsi_bus_suspend()
>>>>>> dpm_run_callback()
>>>>>> __device_suspend()
>>>>>>
>>>>>> Fix this by using the correct exp sequence number, the one which
>>>>>> owner of the exp_mutex initiated and passed to kworker,
>>>>>> to index the wait queue, inside rcu_exp_wait_wake().
>>>>>>
>>>>>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>
>>>>> Queued, thank you!
>>>>>
>>>>> I reworked the commit message to make it easier to follow the sequence
>>>>> of events.  Please see below and let me know if I messed anything up.
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> commit d887fd2a66861f51ed93b5dde894b9646a5569dd
>>>>> Author: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>> Date:   Tue Nov 19 03:17:07 2019 +0000
>>>>>
>>>>>        rcu: Fix missed wakeup of exp_wq waiters
>>>>>        Tasks waiting within exp_funnel_lock() for an expedited grace period to
>>>>>        elapse can be starved due to the following sequence of events:
>>>>>        1.      Tasks A and B both attempt to start an expedited grace
>>>>>                period at about the same time.  This grace period will have
>>>>>                completed when the lower four bits of the rcu_state structure's
>>>>>                ->expedited_sequence field are 0b'0100', for example, when the
>>>>>                initial value of this counter is zero.  Task A wins, and thus
>>>>>                does the actual work of starting the grace period, including
>>>>>                acquiring the rcu_state structure's .exp_mutex and sets the
>>>>>                counter to 0b'0001'.
>>>>>        2.      Because task B lost the race to start the grace period, it
>>>>>                waits on ->expedited_sequence to reach 0b'0100' inside of
>>>>>                exp_funnel_lock(). This task therefore blocks on the rcu_node
>>>>>                structure's ->exp_wq[1] field, keeping in mind that the
>>>>>                end-of-grace-period value of ->expedited_sequence (0b'0100')
>>>>>                is shifted down two bits before indexing the ->exp_wq[] field.
>>>>>        3.      Task C attempts to start another expedited grace period,
>>>>>                but blocks on ->exp_mutex, which is still held by Task A.
>>>>>        4.      The aforementioned expedited grace period completes, so that
>>>>>                ->expedited_sequence now has the value 0b'0100'.  A kworker task
>>>>>                therefore acquires the rcu_state structure's ->exp_wake_mutex
>>>>>                and starts awakening any tasks waiting for this grace period.
>>>>>        5.      One of the first tasks awakened happens to be Task A.  Task A
>>>>>                therefore releases the rcu_state structure's ->exp_mutex,
>>>>>                which allows Task C to start the next expedited grace period,
>>>>>                which causes the lower four bits of the rcu_state structure's
>>>>>                ->expedited_sequence field to become 0b'0101'.
>>>>>        6.      Task C's expedited grace period completes, so that the lower four
>>>>>                bits of the rcu_state structure's ->expedited_sequence field now
>>>>>                become 0b'1000'.
>>>>>        7.      The kworker task from step 4 above continues its wakeups.
>>>>>                Unfortunately, the wake_up_all() refetches the rcu_state
>>>>>                structure's .expedited_sequence field:
>>>>
>>>> This might not be true. I think wake_up_all(), which internally calls
>>>> __wake_up(), will use a single wq_head for doing all wakeups. So, a single
>>>> .expedited_sequence value would be used to get wq_head.
>>>>
>>>> void __wake_up(struct wait_queue_head *wq_head, ...)
>>>
>>> The wake_up_all() really is selecting among four different ->exp_wq[]
>>> array entries:
>>>
>>> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>
>>> So I lost you here.  Are you saying that the wake_up_all() will somehow
>>> be operating on ->exp_wq[1], which is where Task B is blocked?  Or that
>>> Task B would instead be blocked on ->exp_wq[2]?  Or that failing to wake
>>> up Task B is OK for some reason?  Or something else entirely?
>>
>> My bad; I was thinking only of the case where there is only one
>> rnp node (which is also the root) in RCU tree. In case of only
>> one rnp node also, issue can be seen. Please ignore my
>> comment. The commit description looks good to me.
> 
> Thank you for checking!
> 
> And the sequence of events below looks greatly improved over your
> original.  I suspect that there are more similar bugs to find in
> Linux-kernel RCU, so please use a carefully labeled style like that
> below when reporting the next one.
> 
> 							Thanx, Paul
> 

Ok, thanks. I will keep that in mind, while reporting any issues in future.


Thanks
Neeraj

>> Thanks
>> Neeraj
>>>
>>> 							Thanx, Paul
>>>
>>>> However, below sequence of events would result in problem:
>>>>
>>>> 1.      Tasks A starts an expedited grace period at about the same time.
>>>>           This grace period will have completed when the lower four bits
>>>> 		of the rcu_state structure's ->expedited_sequence field are 0b'0100',
>>>> 		for example, when the initial value of this counter is zero.
>>>> 		Task A wins, acquires the rcu_state structure's .exp_mutex and
>>>> 		sets the counter to 0b'0001'.
>>>>
>>>> 2.      The aforementioned expedited grace period completes, so that
>>>>           ->expedited_sequence now has the value 0b'0100'.  A kworker task
>>>>           therefore acquires the rcu_state structure's ->exp_wake_mutex
>>>>           and starts awakening any tasks waiting for this grace period.
>>>>           This kworker gets preempted while unlocking wq_head lock
>>>>
>>>>           wake_up_all()
>>>>             __wake_up()
>>>>               __wake_up_common_lock()
>>>>                 spin_unlock_irqrestore()
>>>>                   __raw_spin_unlock_irqrestore()
>>>>                     preempt_enable()
>>>>                       __preempt_schedule()
>>>>
>>>> 3.      One of the first tasks awakened happens to be Task A.  Task A
>>>>           therefore releases the rcu_state structure's ->exp_mutex,
>>>>
>>>> 4.      Tasks B and C both attempt to start an expedited grace
>>>>           period at about the same time.  This grace period will have
>>>>           completed when the lower four bits of the rcu_state structure's
>>>>           ->expedited_sequence field are 0b'1000'. Task B wins, and thus
>>>>           does the actual work of starting the grace period, including
>>>>           acquiring the rcu_state structure's .exp_mutex and sets the
>>>>           counter to 0b'0101'.
>>>>
>>>> 5.      Because task C lost the race to start the grace period, it
>>>>           waits on ->expedited_sequence to reach 0b'1000' inside of
>>>>           exp_funnel_lock(). This task therefore blocks on the rcu_node
>>>>           structure's ->exp_wq[2] field, keeping in mind that the
>>>>           end-of-grace-period value of ->expedited_sequence (0b'1000')
>>>>           is shifted down two bits before indexing the ->exp_wq[] field.
>>>>
>>>> 6.      Task B queues work to complete expedited grace period. This
>>>>           task is preempted just before wait_event call. Kworker task picks
>>>> 		up the work queued by task B and and completes grace period, so
>>>> 		that the lower four bits of the rcu_state structure's
>>>> 		->expedited_sequence field now become 0b'1000'. This kworker starts
>>>> 		waiting on the exp_wake_mutex, which is owned by kworker doing
>>>> 		wakeups for expedited sequence initiated by task A.
>>>>
>>>> 7.      Task B schedules in and finds its expedited sequence snapshot has
>>>>           completed; so, it does not enter waitq and releases exp_mutex. This
>>>> 		allows Task D to start the next expedited grace period,
>>>>           which causes the lower four bits of the rcu_state structure's
>>>>           ->expedited_sequence field to become 0b'1001'.
>>>>
>>>> 8.      Task D's expedited grace period completes, so that the lower four
>>>>           bits of the rcu_state structure's ->expedited_sequence field now
>>>>           become 0b'1100'.
>>>>
>>>> 9.      kworker from step 2 is scheduled in and releases exp_wake_mutex;
>>>>           kworker correspnding to Task B's expedited grace period acquires
>>>> 		exp_wake_mutex and starts wakeups. Unfortunately, it used the
>>>> 		rcu_state structure's .expedited_sequence field for determining
>>>> 		the waitq index.
>>>>
>>>>
>>>> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>
>>>>           This results in the wakeup being applied to the rcu_node
>>>>           structure's ->exp_wq[3] field, which is unfortunate given that
>>>>           Task C is instead waiting on ->exp_wq[2].
>>>>
>>>>
>>>>>                wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>>                This results in the wakeup being applied to the rcu_node
>>>>>                structure's ->exp_wq[2] field, which is unfortunate given that
>>>>>                Task B is instead waiting on ->exp_wq[1].
>>>>>        On a busy system, no harm is done (or at least no permanent harm is done).
>>>>>        Some later expedited grace period will redo the wakeup.  But on a quiet
>>>>>        system, such as many embedded systems, it might be a good long time before
>>>>>        there was another expedited grace period.  On such embedded systems,
>>>>>        this situation could therefore result in a system hang.
>>>>>        This issue manifested as DPM device timeout during suspend (which
>>>>>        usually qualifies as a quiet time) due to a SCSI device being stuck in
>>>>>        _synchronize_rcu_expedited(), with the following stack trace:
>>>>>                schedule()
>>>>>                synchronize_rcu_expedited()
>>>>>                synchronize_rcu()
>>>>>                scsi_device_quiesce()
>>>>>                scsi_bus_suspend()
>>>>>                dpm_run_callback()
>>>>>                __device_suspend()
>>>>>        This commit therefore prevents such delays, timeouts, and hangs by
>>>>>        making rcu_exp_wait_wake() use its "s" argument consistently instead of
>>>>>        refetching from rcu_state.expedited_sequence.
>>>>
>>>> Do we need a "fixes" tag here?
>>>>
>>>>>        Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>>>>>        Signed-off-by: Paul E. McKenney <paulmck@kernel.org
>>>>>
>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>> index 6ce598d..4433d00a 100644
>>>>> --- a/kernel/rcu/tree_exp.h
>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>>>     			spin_unlock(&rnp->exp_lock);
>>>>>     		}
>>>>>     		smp_mb(); /* All above changes before wakeup. */
>>>>> -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>>> +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>>>     	}
>>>>>     	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>>>     	mutex_unlock(&rcu_state.exp_wake_mutex);
>>>>>
>>>>
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>>>> the Code Aurora Forum, hosted by The Linux Foundation
>>
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-19  3:17 [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters Neeraj Upadhyay
2019-11-19 19:38 ` Paul E. McKenney
2019-11-20  4:58   ` Neeraj Upadhyay
     [not found]   ` <25499aba-04a1-8d03-e2d9-fc89d7794b66@codeaurora.org>
2019-11-21  4:07     ` Paul E. McKenney
2019-11-21  4:18       ` Neeraj Upadhyay
     [not found]       ` <8536f970-d22a-885c-8203-54a7c15a4f1f@codeaurora.org>
2019-11-21 15:01         ` Paul E. McKenney
2019-11-22  3:40           ` Neeraj Upadhyay

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git