All of lore.kernel.org
 help / color / mirror / Atom feed
* cpu stopper threads and setaffinity leads to deadlock
@ 2018-08-02  1:34 Sodagudi Prasad
  2018-08-02  8:12 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Sodagudi Prasad @ 2018-08-02  1:34 UTC (permalink / raw)
  To: peterz, mingo, gregkh, bigeasy, tglx
  Cc: isaacm, psodagud, linux-kernel, mingo

Hi Peter and Tglx,

We are observing another deadlock issue due to commit 
0b26351b91(stop_machine, sched: Fix migrate_swap() vs. active_balance() 
deadlock), even after taking the following fix
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740526.html 
on the Linux-4.14.56  kernel.

Here is the scenario that leads to this deadlock.
We have used the stress-ng-64 --affinity test case to reproduce this 
issue in a controlled environment, while simultaneously running CPU hot 
plug and task migrations.

Stress-ng-affin (call stack shown below) is changing its own affinity 
from cpu3 to cpu7. Stress-ng-affin is preempted in the 
cpu_stop_queue_work() function
as soon as the stopper lock for migration/3 is released . At the same 
time, on CPU 7, cross migration of tasks happens between  cpu3 and cpu7.

=======================================================
Process: stress-ng-affin, cpu: 3 pid: 1748 start: 0xffffffd8817e4480
=====================================================
     Task name: stress-ng-affin pid: 1748 cpu: 3 start: ffffffd8817e4480
     state: 0x0 exit_state: 0x0 stack base: 0xffffff801c8e8000 Prio: 120
     Stack:
     [<ffffff87754864f4>] __switch_to+0xb8
     [<ffffff87763ebf8c>] __schedule+0x690
     [<ffffff87763ec388>] preempt_schedule_common+0x100
     [<ffffff87763eb8f4>] preempt_schedule+0x24
     [<ffffff87763f0e58>] _raw_spin_unlock_irqrestore+0x64
     [<ffffff8775574f8c>] cpu_stop_queue_work+0x9c
     [<ffffff8775574dfc>] stop_one_cpu+0x58
     [<ffffff87754e4884>] __set_cpus_allowed_ptr+0x234
     [<ffffff87754e8888>] sched_setaffinity+0x150
     [<ffffff87754e8ad8>] SyS_sched_setaffinity+0xcc
     [<ffffff87754837c0>] el0_svc_naked+0x34
     [<0>] UNKNOWN+0x0

Due to cross migration of tasks between cpu7 and cpu3, migration/7 has 
started executing and waits for the migration/3 task, so that they can 
proceed within the multi cpu stop state machine together.
Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 
has started running, and has monopolized cpu7’s execution, stress-ng 
will never run on cpu7, and cpu3’s migration task is never woken up.

Essentially:
Due to the nature of the wake_q interface,  a thread can only be in at 
most one wake queue at a time.
migration/3 is currently in stress-ng-affin’s wake_q. This means that no 
other thread can add migration/3 to their wake queue.
Thus, even if any attempt is made to stop CPU 3 (e.g. cross-migration, 
hot plugging, etc), no thread will wake up migration/3.

Below change helped to fix this deadlock.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index e190d1e..f932e1e 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, 
struct cpu_stop_work *work)
                 __cpu_stop_queue_work(stopper, work, &wakeq);
         else if (work->done)
                 cpu_stop_signal_done(work->done);
-       raw_spin_unlock_irqrestore(&stopper->lock, flags);

         wake_up_q(&wakeq);
+       raw_spin_unlock_irqrestore(&stopper->lock, flags);


-Thanks, Prasad

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-02  1:34 cpu stopper threads and setaffinity leads to deadlock Sodagudi Prasad
@ 2018-08-02  8:12 ` Peter Zijlstra
  2018-08-02  8:27   ` Mike Galbraith
  2018-08-02  8:45 ` Peter Zijlstra
  2018-08-02  9:49 ` Peter Zijlstra
  2 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2018-08-02  8:12 UTC (permalink / raw)
  To: Sodagudi Prasad; +Cc: mingo, gregkh, bigeasy, tglx, isaacm, linux-kernel

On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
>                 __cpu_stop_queue_work(stopper, work, &wakeq);
>         else if (work->done)
>                 cpu_stop_signal_done(work->done);
> -       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> 
>         wake_up_q(&wakeq);
> +       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> 

That puts the wakeup back under stopper lock, which causes another
deadlock iirc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-02  8:12 ` Peter Zijlstra
@ 2018-08-02  8:27   ` Mike Galbraith
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Galbraith @ 2018-08-02  8:27 UTC (permalink / raw)
  To: Peter Zijlstra, Sodagudi Prasad
  Cc: mingo, gregkh, bigeasy, tglx, isaacm, linux-kernel

On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index e190d1e..f932e1e 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> > cpu_stop_work *work)
> >                 __cpu_stop_queue_work(stopper, work, &wakeq);
> >         else if (work->done)
> >                 cpu_stop_signal_done(work->done);
> > -       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> > 
> >         wake_up_q(&wakeq);
> > +       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> > 
> 
> That puts the wakeup back under stopper lock, which causes another
> deadlock iirc.

Yup, one you fixed.

0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92)     wake_up_q(&wakeq);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-02  1:34 cpu stopper threads and setaffinity leads to deadlock Sodagudi Prasad
  2018-08-02  8:12 ` Peter Zijlstra
@ 2018-08-02  8:45 ` Peter Zijlstra
  2018-08-02  9:49 ` Peter Zijlstra
  2 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2018-08-02  8:45 UTC (permalink / raw)
  To: Sodagudi Prasad; +Cc: mingo, gregkh, bigeasy, tglx, isaacm, linux-kernel

On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> the Linux-4.14.56  kernel.

Can you also please run on something recent...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-02  1:34 cpu stopper threads and setaffinity leads to deadlock Sodagudi Prasad
  2018-08-02  8:12 ` Peter Zijlstra
  2018-08-02  8:45 ` Peter Zijlstra
@ 2018-08-02  9:49 ` Peter Zijlstra
  2018-08-03 11:41   ` Thomas Gleixner
  2 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2018-08-02  9:49 UTC (permalink / raw)
  To: Sodagudi Prasad; +Cc: mingo, gregkh, bigeasy, tglx, isaacm, linux-kernel

On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> Due to cross migration of tasks between cpu7 and cpu3, migration/7 has
> started executing and waits for the migration/3 task, so that they can
> proceed within the multi cpu stop state machine together.
> Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has
> started running, and has monopolized cpu7’s execution, stress-ng will never
> run on cpu7, and cpu3’s migration task is never woken up.

> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
>                 __cpu_stop_queue_work(stopper, work, &wakeq);
>         else if (work->done)
>                 cpu_stop_signal_done(work->done);
> -       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> 
>         wake_up_q(&wakeq);
> +       raw_spin_unlock_irqrestore(&stopper->lock, flags);
> 

So why didn't you do the 'obvious' parallel to what you did for
cpu_stop_queue_two_works(), namely:

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
 	unsigned long flags;
 	bool enabled;
 
+	preempt_disable();
 	raw_spin_lock_irqsave(&stopper->lock, flags);
 	enabled = stopper->enabled;
 	if (enabled)
@@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
 	raw_spin_unlock_irqrestore(&stopper->lock, flags);
 
 	wake_up_q(&wakeq);
+	preempt_enable();
 
 	return enabled;
 }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-02  9:49 ` Peter Zijlstra
@ 2018-08-03 11:41   ` Thomas Gleixner
  2018-08-03 18:57     ` Sodagudi Prasad
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2018-08-03 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sodagudi Prasad, mingo, gregkh, bigeasy, isaacm, linux-kernel

Prasad.

On Thu, 2 Aug 2018, Peter Zijlstra wrote:
> 
> So why didn't you do the 'obvious' parallel to what you did for
> cpu_stop_queue_two_works(), namely:

Is that patch fixing the issue for you?

> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
>  	unsigned long flags;
>  	bool enabled;
>  
> +	preempt_disable();
>  	raw_spin_lock_irqsave(&stopper->lock, flags);
>  	enabled = stopper->enabled;
>  	if (enabled)
> @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
>  	raw_spin_unlock_irqrestore(&stopper->lock, flags);
>  
>  	wake_up_q(&wakeq);
> +	preempt_enable();
>  
>  	return enabled;
>  }
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cpu stopper threads and setaffinity leads to deadlock
  2018-08-03 11:41   ` Thomas Gleixner
@ 2018-08-03 18:57     ` Sodagudi Prasad
  0 siblings, 0 replies; 7+ messages in thread
From: Sodagudi Prasad @ 2018-08-03 18:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, mingo, gregkh, bigeasy, isaacm, linux-kernel

On 2018-08-03 04:41, Thomas Gleixner wrote:
> Prasad.
> 
> On Thu, 2 Aug 2018, Peter Zijlstra wrote:
>> 
>> So why didn't you do the 'obvious' parallel to what you did for
>> cpu_stop_queue_two_works(), namely:
> 
> Is that patch fixing the issue for you?
<Prasad> Hi Thomas and Peter,

Yes. Tested both versions of patches and both variants are working on 
Qualcomm devices
with stress testing of set affinity and tasks cross-migration, which 
were previously leading to the deadlock.

-Thanks, Prasad
> 
>> --- a/kernel/stop_machine.c
>> +++ b/kernel/stop_machine.c
>> @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
>>  	unsigned long flags;
>>  	bool enabled;
>> 
>> +	preempt_disable();
>>  	raw_spin_lock_irqsave(&stopper->lock, flags);
>>  	enabled = stopper->enabled;
>>  	if (enabled)
>> @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
>>  	raw_spin_unlock_irqrestore(&stopper->lock, flags);
>> 
>>  	wake_up_q(&wakeq);
>> +	preempt_enable();
>> 
>>  	return enabled;
>>  }
>> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-08-03 18:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-02  1:34 cpu stopper threads and setaffinity leads to deadlock Sodagudi Prasad
2018-08-02  8:12 ` Peter Zijlstra
2018-08-02  8:27   ` Mike Galbraith
2018-08-02  8:45 ` Peter Zijlstra
2018-08-02  9:49 ` Peter Zijlstra
2018-08-03 11:41   ` Thomas Gleixner
2018-08-03 18:57     ` Sodagudi Prasad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.