All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
@ 2017-10-08  9:02 Boqun Feng
  2017-10-08 19:03 ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Boqun Feng @ 2017-10-08  9:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Josef Bacik, Boqun Feng, Peter Zijlstra, Tejun Heo, Lai Jiangshan

Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:

| [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
| [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
| [ 1270.473240] -----------------------------------------------------
| [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
| [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
| [ 1270.474994]
| [ 1270.474994] and this task is already holding:
| [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
| [ 1270.476046] which would create a new lock dependency:
| [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
| [ 1270.476949]
| [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
| [ 1270.477553]  (&pool->lock/1){-.-.}
...
| [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
| [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
...
| [ 1270.494735]  Possible interrupt unsafe locking scenario:
| [ 1270.494735]
| [ 1270.495250]        CPU0                    CPU1
| [ 1270.495600]        ----                    ----
| [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
| [ 1270.496295]                                local_irq_disable();
| [ 1270.496753]                                lock(&pool->lock/1);
| [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
| [ 1270.497744]   <Interrupt>
| [ 1270.497948]     lock(&pool->lock/1);

, which will cause a irq inversion deadlock if the above lock scenario
happens.

The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock
held. An obvious fix is dropping the pool::lock before mutex_unlock()
and re-grabing afterwards, which however will introduce a race condition
between worker_thread() and put_unbound_pool():

put_unbound_pool() will grab both pool::manager_arb and pool::lock to
set all current IDLE workers to DIE, and may wait on the
pool::detach_completion for the last worker to detach from the pool.

And when manage_workers() is called, the caller worker_thread is in
non-ILDE state, so if the worker dropped both pool::{manager_arb, lock}
and got delayed for a while long enough for a put_unbound_pool(), the
put_unbound_pool() would not switch that worker to DIE. As a result, the
worker will not detach from the pool as it's not DIE and the
put_unbound_pool() will not proceed as it's waiting for the last worker
to detach, therefore deadlock.

To overcome this, put the worker back to IDLE state before it drops
pool::lock in manage_workers(), and make the worker check again whether
it's DIE after it re-grabs the pool::lock. In this way, we fix the
potential deadlock reported by lockdep without introducing another.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 64d0edf428f8..2ea7b04cc48b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1997,7 +1997,40 @@ static bool manage_workers(struct worker *worker)
 	maybe_create_worker(pool);
 
 	pool->manager = NULL;
+
+	/*
+	 * Put the manager back to ->idle_list, this allows us to drop the
+	 * pool->lock safely without racing with put_unbound_pool()
+	 *
+	 * 						<in "manager worker" thread>
+	 * 						worker_thread():
+	 * 						  spin_lock_irq(&pool->lock);
+	 * 						  worker_leave_idle();
+	 * 						  manage_workers(): // return true
+	 * 						    mutex_trylock(&pool->manager_arb);
+	 *						    <without entering idle here>
+	 * 						    spin_unlock_irq(&pool->lock);
+	 * 						    mutex_unlock(&pool->manager_arb);
+	 *
+	 * 	put_unbound_pool():
+	 * 	  mutex_lock(&pool->manager_arb);
+	 * 	  spin_lock_irq(&pool->lock);
+	 * 	  <set ILDE worker to DIE>
+	 * 	  <the manager worker is not set to be DIE, because it's not IDLE>
+	 * 	  ...
+	 * 	  wait_for_completion(&pool->detach_completion);
+	 * 	  <no one will complete() because pool->workers is not empty>
+	 *
+	 * 	  					  spin_lock_irq(&pool->lock);
+	 * 	  					  <pool->worklist is empty, go to sleep>
+	 *
+	 * No one is going to wake up the manager worker, even so, it won't
+	 * complete(->detach_completion), since it's not a DIE worker.
+	 */
+	worker_enter_idle(worker);
+	spin_unlock_irq(&pool->lock);
 	mutex_unlock(&pool->manager_arb);
+	spin_lock_irq(&pool->lock);
 	return true;
 }
 
@@ -2202,6 +2235,7 @@ static int worker_thread(void *__worker)
 woke_up:
 	spin_lock_irq(&pool->lock);
 
+recheck:
 	/* am I supposed to die? */
 	if (unlikely(worker->flags & WORKER_DIE)) {
 		spin_unlock_irq(&pool->lock);
@@ -2216,7 +2250,6 @@ static int worker_thread(void *__worker)
 	}
 
 	worker_leave_idle(worker);
-recheck:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))
 		goto sleep;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-08  9:02 [RFC] workqueue: Fix irq inversion deadlock in manage_workers() Boqun Feng
@ 2017-10-08 19:03 ` Tejun Heo
  2017-10-09  3:24   ` Boqun Feng
  2017-10-09  6:42   ` Peter Zijlstra
  2017-10-09  9:40 ` Lai Jiangshan
  2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
  2 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-08 19:03 UTC (permalink / raw)
  To: Boqun Feng; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

Hello, Boqun.

On Sun, Oct 08, 2017 at 05:02:23PM +0800, Boqun Feng wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
> 
> | [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> | [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
> | [ 1270.473240] -----------------------------------------------------
> | [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> | [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
> | [ 1270.474994]
> | [ 1270.474994] and this task is already holding:
> | [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
> | [ 1270.476046] which would create a new lock dependency:
> | [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
> | [ 1270.476949]
> | [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
> | [ 1270.477553]  (&pool->lock/1){-.-.}
> ...
> | [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
> | [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
> ...
> | [ 1270.494735]  Possible interrupt unsafe locking scenario:
> | [ 1270.494735]
> | [ 1270.495250]        CPU0                    CPU1
> | [ 1270.495600]        ----                    ----
> | [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
> | [ 1270.496295]                                local_irq_disable();
> | [ 1270.496753]                                lock(&pool->lock/1);
> | [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
> | [ 1270.497744]   <Interrupt>
> | [ 1270.497948]     lock(&pool->lock/1);
> 
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
> 
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock

So, if I'm not mistaken, this is a regression caused by b9c16a0e1f73
("locking/mutex: Fix lockdep_assert_held() fail") which seems to
replace irqsave operations inside mutex to unconditional irq ones.

I suppose it's a requirement we can add but that needs to be an
explicit change with backing rationales.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-08 19:03 ` Tejun Heo
@ 2017-10-09  3:24   ` Boqun Feng
  2017-10-09  6:42   ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Boqun Feng @ 2017-10-09  3:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

[-- Attachment #1: Type: text/plain, Size: 2885 bytes --]

On Sun, Oct 08, 2017 at 07:03:47PM +0000, Tejun Heo wrote:
> Hello, Boqun.
> 

Hi Tejun,

> On Sun, Oct 08, 2017 at 05:02:23PM +0800, Boqun Feng wrote:
> > Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> > lockdep:
> > 
> > | [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> > | [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
> > | [ 1270.473240] -----------------------------------------------------
> > | [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > | [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
> > | [ 1270.474994]
> > | [ 1270.474994] and this task is already holding:
> > | [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
> > | [ 1270.476046] which would create a new lock dependency:
> > | [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
> > | [ 1270.476949]
> > | [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
> > | [ 1270.477553]  (&pool->lock/1){-.-.}
> > ...
> > | [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
> > | [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
> > ...
> > | [ 1270.494735]  Possible interrupt unsafe locking scenario:
> > | [ 1270.494735]
> > | [ 1270.495250]        CPU0                    CPU1
> > | [ 1270.495600]        ----                    ----
> > | [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
> > | [ 1270.496295]                                local_irq_disable();
> > | [ 1270.496753]                                lock(&pool->lock/1);
> > | [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
> > | [ 1270.497744]   <Interrupt>
> > | [ 1270.497948]     lock(&pool->lock/1);
> > 
> > , which will cause a irq inversion deadlock if the above lock scenario
> > happens.
> > 
> > The root cause of this safe -> unsafe lock order is the
> > mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock
> 
> So, if I'm not mistaken, this is a regression caused by b9c16a0e1f73
> ("locking/mutex: Fix lockdep_assert_held() fail") which seems to
> replace irqsave operations inside mutex to unconditional irq ones.
> 

Hmm.. I don't that commit replaced _irqsave with _irq ones. That commit
simply exposed mutex::wait_lock to lockdep(especially for the irq
inversion deadlock checking).

Even before that commit, the !DEBUG version of spin_{,un}lock_mutex()
wouldn't disable interrupts. So this is not a regression, it's a real
deadlock potential, which had not been detected until commit
b9c16a0e1f73.

Regards,
Boqun

> I suppose it's a requirement we can add but that needs to be an
> explicit change with backing rationales.
> 
> Thanks.
> 
> -- 
> tejun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-08 19:03 ` Tejun Heo
  2017-10-09  3:24   ` Boqun Feng
@ 2017-10-09  6:42   ` Peter Zijlstra
  2017-10-09 13:07     ` Tejun Heo
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-10-09  6:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Boqun Feng, linux-kernel, Josef Bacik, Lai Jiangshan

On Sun, Oct 08, 2017 at 12:03:47PM -0700, Tejun Heo wrote:
> So, if I'm not mistaken, this is a regression caused by b9c16a0e1f73
> ("locking/mutex: Fix lockdep_assert_held() fail") which seems to
> replace irqsave operations inside mutex to unconditional irq ones.

No, it existed before that. You're looking at the DEBUG_MUTEX case, the
normal case looked like:

diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 4410a4af42a3..6ebc1902f779 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -9,10 +9,6 @@
  * !CONFIG_DEBUG_MUTEXES case. Most of them are NOPs:
  */
 
-#define spin_lock_mutex(lock, flags) \
-		do { spin_lock(lock); (void)(flags); } while (0)
-#define spin_unlock_mutex(lock, flags) \
-		do { spin_unlock(lock); (void)(flags); } while (0)
 #define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 

Which is exactly what lives today.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-08  9:02 [RFC] workqueue: Fix irq inversion deadlock in manage_workers() Boqun Feng
  2017-10-08 19:03 ` Tejun Heo
@ 2017-10-09  9:40 ` Lai Jiangshan
  2017-10-09 12:40   ` Boqun Feng
  2017-10-09 15:24   ` Peter Zijlstra
  2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
  2 siblings, 2 replies; 19+ messages in thread
From: Lai Jiangshan @ 2017-10-09  9:40 UTC (permalink / raw)
  To: Boqun Feng; +Cc: LKML, Josef Bacik, Peter Zijlstra, Tejun Heo

On Sun, Oct 8, 2017 at 5:02 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
>
> | [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> | [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
> | [ 1270.473240] -----------------------------------------------------
> | [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> | [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
> | [ 1270.474994]
> | [ 1270.474994] and this task is already holding:
> | [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
> | [ 1270.476046] which would create a new lock dependency:
> | [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
> | [ 1270.476949]
> | [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
> | [ 1270.477553]  (&pool->lock/1){-.-.}
> ...
> | [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
> | [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
> ...
> | [ 1270.494735]  Possible interrupt unsafe locking scenario:
> | [ 1270.494735]
> | [ 1270.495250]        CPU0                    CPU1
> | [ 1270.495600]        ----                    ----
> | [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
> | [ 1270.496295]                                local_irq_disable();
> | [ 1270.496753]                                lock(&pool->lock/1);
> | [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
> | [ 1270.497744]   <Interrupt>
> | [ 1270.497948]     lock(&pool->lock/1);
>
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
>
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock
> held.

I didn't thought this kind of pattern is very seldom.  I remember I saw several.
  mutex_lock();
  do_something();
  spin_lock_irq();
  record_the_state_for_ do_something().
   // keep the spin lock held to hold the state for do_more_things().
  mutex_unlock(); // unlock() is suggested to be called when just exiting C.S.
  do_more_things();
  spin_unlock_irq();

Was all code of this pattern removed?
Could it be possible that mutex will be changed to allow this?

(If the mutex can't be changed...)
And I think the little more proper fix is to move the 'mutex_unlock();'
down. In the case for manager_arb,  'mutex_unlock();' can be called
in process_one_work() and before the worker sleeps. However,
a variable might be needed to indicate whether it should be called.

It doesn't means I don't like this fix. 'spin_unlock_irq(&pool->lock);'
before 'mutex_unlock();' makes the things more complicated.
More time is needed for understanding it. And I leave one concern
about the worker_leave_idle() below.

> An obvious fix is dropping the pool::lock before mutex_unlock()
> and re-grabing afterwards, which however will introduce a race condition
> between worker_thread() and put_unbound_pool():
>
> put_unbound_pool() will grab both pool::manager_arb and pool::lock to
> set all current IDLE workers to DIE, and may wait on the
> pool::detach_completion for the last worker to detach from the pool.
>
> And when manage_workers() is called, the caller worker_thread is in
> non-ILDE state, so if the worker dropped both pool::{manager_arb, lock}
> and got delayed for a while long enough for a put_unbound_pool(), the
> put_unbound_pool() would not switch that worker to DIE. As a result, the
> worker will not detach from the pool as it's not DIE and the
> put_unbound_pool() will not proceed as it's waiting for the last worker
> to detach, therefore deadlock.
>
> To overcome this, put the worker back to IDLE state before it drops
> pool::lock in manage_workers(), and make the worker check again whether
> it's DIE after it re-grabs the pool::lock. In this way, we fix the
> potential deadlock reported by lockdep without introducing another.
>
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf428f8..2ea7b04cc48b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1997,7 +1997,40 @@ static bool manage_workers(struct worker *worker)
>         maybe_create_worker(pool);
>
>         pool->manager = NULL;
> +
> +       /*
> +        * Put the manager back to ->idle_list, this allows us to drop the
> +        * pool->lock safely without racing with put_unbound_pool()
> +        *
> +        *                                              <in "manager worker" thread>
> +        *                                              worker_thread():
> +        *                                                spin_lock_irq(&pool->lock);
> +        *                                                worker_leave_idle();
> +        *                                                manage_workers(): // return true
> +        *                                                  mutex_trylock(&pool->manager_arb);
> +        *                                                  <without entering idle here>
> +        *                                                  spin_unlock_irq(&pool->lock);
> +        *                                                  mutex_unlock(&pool->manager_arb);
> +        *
> +        *      put_unbound_pool():
> +        *        mutex_lock(&pool->manager_arb);
> +        *        spin_lock_irq(&pool->lock);
> +        *        <set ILDE worker to DIE>
> +        *        <the manager worker is not set to be DIE, because it's not IDLE>
> +        *        ...
> +        *        wait_for_completion(&pool->detach_completion);
> +        *        <no one will complete() because pool->workers is not empty>
> +        *
> +        *                                                spin_lock_irq(&pool->lock);
> +        *                                                <pool->worklist is empty, go to sleep>
> +        *
> +        * No one is going to wake up the manager worker, even so, it won't
> +        * complete(->detach_completion), since it's not a DIE worker.
> +        */
> +       worker_enter_idle(worker);
> +       spin_unlock_irq(&pool->lock);
>         mutex_unlock(&pool->manager_arb);
> +       spin_lock_irq(&pool->lock);
>         return true;
>  }
>
> @@ -2202,6 +2235,7 @@ static int worker_thread(void *__worker)
>  woke_up:
>         spin_lock_irq(&pool->lock);
>
> +recheck:
>         /* am I supposed to die? */
>         if (unlikely(worker->flags & WORKER_DIE)) {
>                 spin_unlock_irq(&pool->lock);
> @@ -2216,7 +2250,6 @@ static int worker_thread(void *__worker)
>         }
>
>         worker_leave_idle(worker);

I think worker_leave_idle() might be called multiple times,
which might cause bugs, since recheck is moved up.

> -recheck:
>         /* no more worker necessary? */
>         if (!need_more_worker(pool))
>                 goto sleep;
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-09  9:40 ` Lai Jiangshan
@ 2017-10-09 12:40   ` Boqun Feng
  2017-10-09 15:24   ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Boqun Feng @ 2017-10-09 12:40 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: LKML, Josef Bacik, Peter Zijlstra, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 3773 bytes --]

On Mon, Oct 09, 2017 at 09:40:43AM +0000, Lai Jiangshan wrote:
[...]
> > Reported-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 64d0edf428f8..2ea7b04cc48b 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1997,7 +1997,40 @@ static bool manage_workers(struct worker *worker)
> >         maybe_create_worker(pool);
> >
> >         pool->manager = NULL;
> > +
> > +       /*
> > +        * Put the manager back to ->idle_list, this allows us to drop the
> > +        * pool->lock safely without racing with put_unbound_pool()
> > +        *
> > +        *                                              <in "manager worker" thread>
> > +        *                                              worker_thread():
> > +        *                                                spin_lock_irq(&pool->lock);
> > +        *                                                worker_leave_idle();
> > +        *                                                manage_workers(): // return true
> > +        *                                                  mutex_trylock(&pool->manager_arb);
> > +        *                                                  <without entering idle here>
> > +        *                                                  spin_unlock_irq(&pool->lock);
> > +        *                                                  mutex_unlock(&pool->manager_arb);
> > +        *
> > +        *      put_unbound_pool():
> > +        *        mutex_lock(&pool->manager_arb);
> > +        *        spin_lock_irq(&pool->lock);
> > +        *        <set ILDE worker to DIE>
> > +        *        <the manager worker is not set to be DIE, because it's not IDLE>
> > +        *        ...
> > +        *        wait_for_completion(&pool->detach_completion);
> > +        *        <no one will complete() because pool->workers is not empty>
> > +        *
> > +        *                                                spin_lock_irq(&pool->lock);
> > +        *                                                <pool->worklist is empty, go to sleep>
> > +        *
> > +        * No one is going to wake up the manager worker, even so, it won't
> > +        * complete(->detach_completion), since it's not a DIE worker.
> > +        */
> > +       worker_enter_idle(worker);
> > +       spin_unlock_irq(&pool->lock);
> >         mutex_unlock(&pool->manager_arb);
> > +       spin_lock_irq(&pool->lock);
> >         return true;
> >  }
> >
> > @@ -2202,6 +2235,7 @@ static int worker_thread(void *__worker)
> >  woke_up:
> >         spin_lock_irq(&pool->lock);
> >
> > +recheck:
> >         /* am I supposed to die? */
> >         if (unlikely(worker->flags & WORKER_DIE)) {
> >                 spin_unlock_irq(&pool->lock);
> > @@ -2216,7 +2250,6 @@ static int worker_thread(void *__worker)
> >         }
> >
> >         worker_leave_idle(worker);
> 
> I think worker_leave_idle() might be called multiple times,

"Multiple times" is not a problem, the problem is unbalanced
enter/leave, right?

> which might cause bugs, since recheck is moved up.
> 

And that's impossible, as we only goto recheck if manage_workers()
return true, and that means the worker got the manager_arb and set
itself idle again. So it's fine.

Regards,
Boqun

> > -recheck:
> >         /* no more worker necessary? */
> >         if (!need_more_worker(pool))
> >                 goto sleep;
> > --
> > 2.14.1
> >

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-09  6:42   ` Peter Zijlstra
@ 2017-10-09 13:07     ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 13:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Boqun Feng, linux-kernel, Josef Bacik, Lai Jiangshan

On Mon, Oct 09, 2017 at 08:42:57AM +0200, Peter Zijlstra wrote:
> On Sun, Oct 08, 2017 at 12:03:47PM -0700, Tejun Heo wrote:
> > So, if I'm not mistaken, this is a regression caused by b9c16a0e1f73
> > ("locking/mutex: Fix lockdep_assert_held() fail") which seems to
> > replace irqsave operations inside mutex to unconditional irq ones.
> 
> No, it existed before that. You're looking at the DEBUG_MUTEX case, the
> normal case looked like:

Ah, I see.  That mutex usage was a stretch anyway.  I'll get rid of
it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-08  9:02 [RFC] workqueue: Fix irq inversion deadlock in manage_workers() Boqun Feng
  2017-10-08 19:03 ` Tejun Heo
  2017-10-09  9:40 ` Lai Jiangshan
@ 2017-10-09 13:21 ` Tejun Heo
  2017-10-09 14:21   ` Boqun Feng
                     ` (2 more replies)
  2 siblings, 3 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 13:21 UTC (permalink / raw)
  To: Boqun Feng; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:

 [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
 [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
 [ 1270.473240] -----------------------------------------------------
 [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
 [ 1270.474994]
 [ 1270.474994] and this task is already holding:
 [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
 [ 1270.476046] which would create a new lock dependency:
 [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
 [ 1270.476949]
 [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
 [ 1270.477553]  (&pool->lock/1){-.-.}
 ...
 [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
 [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
 ...
 [ 1270.494735]  Possible interrupt unsafe locking scenario:
 [ 1270.494735]
 [ 1270.495250]        CPU0                    CPU1
 [ 1270.495600]        ----                    ----
 [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
 [ 1270.496295]                                local_irq_disable();
 [ 1270.496753]                                lock(&pool->lock/1);
 [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
 [ 1270.497744]   <Interrupt>
 [ 1270.497948]     lock(&pool->lock/1);

, which will cause a irq inversion deadlock if the above lock scenario
happens.

The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.

Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
lockdep_assert_held() fail").

Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag.  The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.

This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
---
Hello,

Boqun, thanks for the patch and explanation and I shamelessly lifted
parts of your patch description.  I took an alternative approach
because there's no reason to make this any more complex when the issue
is essentially caused by abusing mutex.

The patch seems to work fine but, Lai, can you please review the
patch?

Thanks.

 kernel/workqueue.c |   39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 64d0edf..8739b6de 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 	 * attach_mutex to avoid changing binding state while
 	 * worker_attach_to_pool() is in progress.
 	 */
+	POOL_MANAGER_ACTIVE	= 1 << 0,	/* being managed */
 	POOL_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
 
 	/* worker flags */
@@ -165,7 +166,6 @@ struct worker_pool {
 						/* L: hash of busy workers */
 
 	/* see manage_workers() for details on the two manager mutexes */
-	struct mutex		manager_arb;	/* manager arbitration */
 	struct worker		*manager;	/* L: purely informational */
 	struct mutex		attach_mutex;	/* attach/detach exclusion */
 	struct list_head	workers;	/* A: attached workers */
@@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
 
 static DEFINE_MUTEX(wq_pool_mutex);	/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
+static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
 
 static LIST_HEAD(workqueues);		/* PR: list of all workqueues */
 static bool workqueue_freezing;		/* PL: have wqs started freezing? */
@@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
-	bool managing = mutex_is_locked(&pool->manager_arb);
+	bool managing = pool->flags & POOL_MANAGER_ACTIVE;
 	int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
 	int nr_busy = pool->nr_workers - nr_idle;
 
@@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
 {
 	struct worker_pool *pool = worker->pool;
 
-	/*
-	 * Anyone who successfully grabs manager_arb wins the arbitration
-	 * and becomes the manager.  mutex_trylock() on pool->manager_arb
-	 * failure while holding pool->lock reliably indicates that someone
-	 * else is managing the pool and the worker which failed trylock
-	 * can proceed to executing work items.  This means that anyone
-	 * grabbing manager_arb is responsible for actually performing
-	 * manager duties.  If manager_arb is grabbed and released without
-	 * actual management, the pool may stall indefinitely.
-	 */
-	if (!mutex_trylock(&pool->manager_arb))
+	if (pool->flags & POOL_MANAGER_ACTIVE)
 		return false;
+
+	pool->flags |= POOL_MANAGER_ACTIVE;
 	pool->manager = worker;
 
 	maybe_create_worker(pool);
 
 	pool->manager = NULL;
-	mutex_unlock(&pool->manager_arb);
+	pool->flags &= ~POOL_MANAGER_ACTIVE;
+	wake_up(&wq_manager_wait);
 	return true;
 }
 
@@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
 	setup_timer(&pool->mayday_timer, pool_mayday_timeout,
 		    (unsigned long)pool);
 
-	mutex_init(&pool->manager_arb);
 	mutex_init(&pool->attach_mutex);
 	INIT_LIST_HEAD(&pool->workers);
 
@@ -3318,13 +3311,14 @@ static void put_unbound_pool(struct worker_pool *pool)
 	hash_del(&pool->hash_node);
 
 	/*
-	 * Become the manager and destroy all workers.  Grabbing
-	 * manager_arb prevents @pool's workers from blocking on
-	 * attach_mutex.
+	 * Become the manager and destroy all workers.  Becoming manager
+	 * prevents @pool's workers from blocking on attach_mutex.
 	 */
-	mutex_lock(&pool->manager_arb);
-
 	spin_lock_irq(&pool->lock);
+	wait_event_lock_irq(wq_manager_wait,
+			    !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
+	pool->flags |= POOL_MANAGER_ACTIVE;
+
 	while ((worker = first_idle_worker(pool)))
 		destroy_worker(worker);
 	WARN_ON(pool->nr_workers || pool->nr_idle);
@@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool)
 	if (pool->detach_completion)
 		wait_for_completion(pool->detach_completion);
 
-	mutex_unlock(&pool->manager_arb);
+	spin_lock_irq(&pool->lock);
+	pool->flags &= ~POOL_MANAGER_ACTIVE;
+	wake_up(&wq_manager_wait);
+	spin_unlock_irq(&pool->lock);
 
 	/* shut down the timers */
 	del_timer_sync(&pool->idle_timer);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
@ 2017-10-09 14:21   ` Boqun Feng
  2017-10-09 14:47     ` Tejun Heo
  2017-10-09 15:02   ` Lai Jiangshan
  2017-10-09 15:04   ` [PATCH v2 " Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Boqun Feng @ 2017-10-09 14:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

[-- Attachment #1: Type: text/plain, Size: 8261 bytes --]

On Mon, Oct 09, 2017 at 01:21:04PM +0000, Tejun Heo wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
> 
>  [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>  [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
>  [ 1270.473240] -----------------------------------------------------
>  [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
>  [ 1270.474994]
>  [ 1270.474994] and this task is already holding:
>  [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
>  [ 1270.476046] which would create a new lock dependency:
>  [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
>  [ 1270.476949]
>  [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
>  [ 1270.477553]  (&pool->lock/1){-.-.}
>  ...
>  [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
>  [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
>  ...
>  [ 1270.494735]  Possible interrupt unsafe locking scenario:
>  [ 1270.494735]
>  [ 1270.495250]        CPU0                    CPU1
>  [ 1270.495600]        ----                    ----
>  [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
>  [ 1270.496295]                                local_irq_disable();
>  [ 1270.496753]                                lock(&pool->lock/1);
>  [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
>  [ 1270.497744]   <Interrupt>
>  [ 1270.497948]     lock(&pool->lock/1);
> 
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
> 
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
> held.
> 
> Unlocking mutex while holding an irq spinlock was never safe and this
> problem has been around forever but it never got noticed because the
> only time the mutex is usually trylocked while holding irqlock making
> actual failures very unlikely and lockdep annotation missed the
> condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
> lockdep_assert_held() fail").
> 
> Using mutex for pool->manager_arb has always been a bit of stretch.
> It primarily is an mechanism to arbitrate managership between workers
> which can easily be done with a pool flag.  The only reason it became
> a mutex is that pool destruction path wants to exclude parallel
> managing operations.
> 
> This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
> and make the destruction path wait for the current manager on a wait
> queue.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: stable@vger.kernel.org
> ---
> Hello,
> 
> Boqun, thanks for the patch and explanation and I shamelessly lifted
> parts of your patch description.  I took an alternative approach

That's find ;-)

> because there's no reason to make this any more complex when the issue
> is essentially caused by abusing mutex.
> 

Agreed.

> The patch seems to work fine but, Lai, can you please review the
> patch?
> 
> Thanks.
> 
>  kernel/workqueue.c |   39 ++++++++++++++++++---------------------
>  1 file changed, 18 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf..8739b6de 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -68,6 +68,7 @@ enum {
>  	 * attach_mutex to avoid changing binding state while
>  	 * worker_attach_to_pool() is in progress.
>  	 */
> +	POOL_MANAGER_ACTIVE	= 1 << 0,	/* being managed */
>  	POOL_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
>  
>  	/* worker flags */
> @@ -165,7 +166,6 @@ struct worker_pool {
>  						/* L: hash of busy workers */
>  
>  	/* see manage_workers() for details on the two manager mutexes */
> -	struct mutex		manager_arb;	/* manager arbitration */
>  	struct worker		*manager;	/* L: purely informational */
>  	struct mutex		attach_mutex;	/* attach/detach exclusion */
>  	struct list_head	workers;	/* A: attached workers */
> @@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
>  
>  static DEFINE_MUTEX(wq_pool_mutex);	/* protects pools and workqueues list */
>  static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
> +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */

I think this wait_queue_head better be a per-pool one rather than shared
among pools?

>  
>  static LIST_HEAD(workqueues);		/* PR: list of all workqueues */
>  static bool workqueue_freezing;		/* PL: have wqs started freezing? */
> @@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> -	bool managing = mutex_is_locked(&pool->manager_arb);
> +	bool managing = pool->flags & POOL_MANAGER_ACTIVE;
>  	int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
>  	int nr_busy = pool->nr_workers - nr_idle;
>  
> @@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
>  {
>  	struct worker_pool *pool = worker->pool;
>  
> -	/*
> -	 * Anyone who successfully grabs manager_arb wins the arbitration
> -	 * and becomes the manager.  mutex_trylock() on pool->manager_arb
> -	 * failure while holding pool->lock reliably indicates that someone
> -	 * else is managing the pool and the worker which failed trylock
> -	 * can proceed to executing work items.  This means that anyone
> -	 * grabbing manager_arb is responsible for actually performing
> -	 * manager duties.  If manager_arb is grabbed and released without
> -	 * actual management, the pool may stall indefinitely.
> -	 */
> -	if (!mutex_trylock(&pool->manager_arb))
> +	if (pool->flags & POOL_MANAGER_ACTIVE)
>  		return false;
> +
> +	pool->flags |= POOL_MANAGER_ACTIVE;
>  	pool->manager = worker;
>  
>  	maybe_create_worker(pool);
>  
>  	pool->manager = NULL;
> -	mutex_unlock(&pool->manager_arb);
> +	pool->flags &= ~POOL_MANAGER_ACTIVE;
> +	wake_up(&wq_manager_wait);
>  	return true;
>  }
>  
> @@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
>  	setup_timer(&pool->mayday_timer, pool_mayday_timeout,
>  		    (unsigned long)pool);
>  
> -	mutex_init(&pool->manager_arb);
>  	mutex_init(&pool->attach_mutex);
>  	INIT_LIST_HEAD(&pool->workers);
>  
> @@ -3318,13 +3311,14 @@ static void put_unbound_pool(struct worker_pool *pool)
>  	hash_del(&pool->hash_node);
>  
>  	/*
> -	 * Become the manager and destroy all workers.  Grabbing
> -	 * manager_arb prevents @pool's workers from blocking on
> -	 * attach_mutex.
> +	 * Become the manager and destroy all workers.  Becoming manager
> +	 * prevents @pool's workers from blocking on attach_mutex.
>  	 */
> -	mutex_lock(&pool->manager_arb);
> -
>  	spin_lock_irq(&pool->lock);
> +	wait_event_lock_irq(wq_manager_wait,
> +			    !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
> +	pool->flags |= POOL_MANAGER_ACTIVE;
> +
>  	while ((worker = first_idle_worker(pool)))
>  		destroy_worker(worker);
>  	WARN_ON(pool->nr_workers || pool->nr_idle);
> @@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool)
>  	if (pool->detach_completion)
>  		wait_for_completion(pool->detach_completion);
>  
> -	mutex_unlock(&pool->manager_arb);
> +	spin_lock_irq(&pool->lock);
> +	pool->flags &= ~POOL_MANAGER_ACTIVE;
> +	wake_up(&wq_manager_wait);
> +	spin_unlock_irq(&pool->lock);
>  

Is the above code necesarry? IIUC, we are going to free the pool
entirely, so whether manager is active is pointless here and no one is
waiting for the ->flags of *this* pool to be !POOL_MANAGER_ACTIVE.

Am I missing something subtle here?

Regards,
Boqun

>  	/* shut down the timers */
>  	del_timer_sync(&pool->idle_timer);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 14:21   ` Boqun Feng
@ 2017-10-09 14:47     ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 14:47 UTC (permalink / raw)
  To: Boqun Feng; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

Hello, Boqun.

On Mon, Oct 09, 2017 at 10:21:17PM +0800, Boqun Feng wrote:
> > +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
> 
> I think this wait_queue_head better be a per-pool one rather than shared
> among pools?

It should be fine either way.  All the involved operations are pretty
low frequency.

> > @@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool)
> >  	if (pool->detach_completion)
> >  		wait_for_completion(pool->detach_completion);
> >  
> > -	mutex_unlock(&pool->manager_arb);
> > +	spin_lock_irq(&pool->lock);
> > +	pool->flags &= ~POOL_MANAGER_ACTIVE;
> > +	wake_up(&wq_manager_wait);
> > +	spin_unlock_irq(&pool->lock);
> >  
> 
> Is the above code necesarry? IIUC, we are going to free the pool
> entirely, so whether manager is active is pointless here and no one is
> waiting for the ->flags of *this* pool to be !POOL_MANAGER_ACTIVE.
> 
> Am I missing something subtle here?

Ah, true.  I'll drop the above chunk.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
  2017-10-09 14:21   ` Boqun Feng
@ 2017-10-09 15:02   ` Lai Jiangshan
  2017-10-09 15:08     ` Tejun Heo
  2017-10-09 15:04   ` [PATCH v2 " Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Lai Jiangshan @ 2017-10-09 15:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Boqun Feng, LKML, Josef Bacik, Peter Zijlstra

On Mon, Oct 9, 2017 at 9:21 PM, Tejun Heo <tj@kernel.org> wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
>
>  [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>  [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
>  [ 1270.473240] -----------------------------------------------------
>  [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
>  [ 1270.474994]
>  [ 1270.474994] and this task is already holding:
>  [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
>  [ 1270.476046] which would create a new lock dependency:
>  [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
>  [ 1270.476949]
>  [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
>  [ 1270.477553]  (&pool->lock/1){-.-.}
>  ...
>  [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
>  [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
>  ...
>  [ 1270.494735]  Possible interrupt unsafe locking scenario:
>  [ 1270.494735]
>  [ 1270.495250]        CPU0                    CPU1
>  [ 1270.495600]        ----                    ----
>  [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
>  [ 1270.496295]                                local_irq_disable();
>  [ 1270.496753]                                lock(&pool->lock/1);
>  [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
>  [ 1270.497744]   <Interrupt>
>  [ 1270.497948]     lock(&pool->lock/1);
>
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
>
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
> held.
>
> Unlocking mutex while holding an irq spinlock was never safe and this
> problem has been around forever but it never got noticed because the
> only time the mutex is usually trylocked while holding irqlock making
> actual failures very unlikely and lockdep annotation missed the
> condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
> lockdep_assert_held() fail").
>
> Using mutex for pool->manager_arb has always been a bit of stretch.
> It primarily is an mechanism to arbitrate managership between workers
> which can easily be done with a pool flag.  The only reason it became
> a mutex is that pool destruction path wants to exclude parallel
> managing operations.
>
> This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
> and make the destruction path wait for the current manager on a wait
> queue.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: stable@vger.kernel.org
> ---
> Hello,
>
> Boqun, thanks for the patch and explanation and I shamelessly lifted
> parts of your patch description.  I took an alternative approach
> because there's no reason to make this any more complex when the issue
> is essentially caused by abusing mutex.
>
> The patch seems to work fine but, Lai, can you please review the
> patch?
>
> Thanks.
>
>  kernel/workqueue.c |   39 ++++++++++++++++++---------------------
>  1 file changed, 18 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf..8739b6de 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -68,6 +68,7 @@ enum {
>          * attach_mutex to avoid changing binding state while
>          * worker_attach_to_pool() is in progress.
>          */
> +       POOL_MANAGER_ACTIVE     = 1 << 0,       /* being managed */
>         POOL_DISASSOCIATED      = 1 << 2,       /* cpu can't serve workers */
>
>         /* worker flags */
> @@ -165,7 +166,6 @@ struct worker_pool {
>                                                 /* L: hash of busy workers */
>
>         /* see manage_workers() for details on the two manager mutexes */
> -       struct mutex            manager_arb;    /* manager arbitration */
>         struct worker           *manager;       /* L: purely informational */
>         struct mutex            attach_mutex;   /* attach/detach exclusion */
>         struct list_head        workers;        /* A: attached workers */
> @@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
>
>  static DEFINE_MUTEX(wq_pool_mutex);    /* protects pools and workqueues list */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
>
>  static LIST_HEAD(workqueues);          /* PR: list of all workqueues */
>  static bool workqueue_freezing;                /* PL: have wqs started freezing? */
> @@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> -       bool managing = mutex_is_locked(&pool->manager_arb);
> +       bool managing = pool->flags & POOL_MANAGER_ACTIVE;
>         int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
>         int nr_busy = pool->nr_workers - nr_idle;
>
> @@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
>  {
>         struct worker_pool *pool = worker->pool;
>
> -       /*
> -        * Anyone who successfully grabs manager_arb wins the arbitration
> -        * and becomes the manager.  mutex_trylock() on pool->manager_arb
> -        * failure while holding pool->lock reliably indicates that someone
> -        * else is managing the pool and the worker which failed trylock
> -        * can proceed to executing work items.  This means that anyone
> -        * grabbing manager_arb is responsible for actually performing
> -        * manager duties.  If manager_arb is grabbed and released without
> -        * actual management, the pool may stall indefinitely.
> -        */
> -       if (!mutex_trylock(&pool->manager_arb))
> +       if (pool->flags & POOL_MANAGER_ACTIVE)
>                 return false;
> +
> +       pool->flags |= POOL_MANAGER_ACTIVE;
>         pool->manager = worker;

Hello, tj

I don't find anything wrong with this patch so far.
It is good though I haven't tested it.

I was also thinking alternative code when reviewing.
The first is quite obvious. Testing POOL_MANAGER_ACTIVE
can be replaced by testing pool->manager.
And POOL_MANAGER_ACTIVE  is not needed. Isn't it?

The second thing is to make manage_workers()
and put_unbound_pool() exclusive.
Waiting event on POOL_MANAGER_ACTIVE(or pool->manager)
is one way. However, the pool's refcnt is not possible to
be dropped to zero now since the caller still hold the pool->lock
and some pwds of the works in the worklist. So the other way
to enforce the exclusive could be just doing
get_pwq(the first pwd of the worklist) and put_pwq() when
the manage_workers() done. And the code about
pool->manager_arb in put_unbound_pool() can be
simply removed.

I just want to send my thinking earlier and I didn't
think deeply enough.

What do you think.

thx
Lai


>
>         maybe_create_worker(pool);
>
>         pool->manager = NULL;
> -       mutex_unlock(&pool->manager_arb);
> +       pool->flags &= ~POOL_MANAGER_ACTIVE;
> +       wake_up(&wq_manager_wait);
>         return true;
>  }
>
> @@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
>         setup_timer(&pool->mayday_timer, pool_mayday_timeout,
>                     (unsigned long)pool);
>
> -       mutex_init(&pool->manager_arb);
>         mutex_init(&pool->attach_mutex);
>         INIT_LIST_HEAD(&pool->workers);
>
> @@ -3318,13 +3311,14 @@ static void put_unbound_pool(struct worker_pool *pool)
>         hash_del(&pool->hash_node);
>
>         /*
> -        * Become the manager and destroy all workers.  Grabbing
> -        * manager_arb prevents @pool's workers from blocking on
> -        * attach_mutex.
> +        * Become the manager and destroy all workers.  Becoming manager
> +        * prevents @pool's workers from blocking on attach_mutex.
>          */
> -       mutex_lock(&pool->manager_arb);
> -
>         spin_lock_irq(&pool->lock);
> +       wait_event_lock_irq(wq_manager_wait,
> +                           !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
> +       pool->flags |= POOL_MANAGER_ACTIVE;
> +
>         while ((worker = first_idle_worker(pool)))
>                 destroy_worker(worker);
>         WARN_ON(pool->nr_workers || pool->nr_idle);
> @@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool)
>         if (pool->detach_completion)
>                 wait_for_completion(pool->detach_completion);
>
> -       mutex_unlock(&pool->manager_arb);
> +       spin_lock_irq(&pool->lock);
> +       pool->flags &= ~POOL_MANAGER_ACTIVE;
> +       wake_up(&wq_manager_wait);
> +       spin_unlock_irq(&pool->lock);
>
>         /* shut down the timers */
>         del_timer_sync(&pool->idle_timer);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
  2017-10-09 14:21   ` Boqun Feng
  2017-10-09 15:02   ` Lai Jiangshan
@ 2017-10-09 15:04   ` Tejun Heo
  2017-10-10  9:55     ` Lai Jiangshan
  2017-10-10 14:15     ` Tejun Heo
  2 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 15:04 UTC (permalink / raw)
  To: Boqun Feng; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:

 [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
 [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
 [ 1270.473240] -----------------------------------------------------
 [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
 [ 1270.474994]
 [ 1270.474994] and this task is already holding:
 [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
 [ 1270.476046] which would create a new lock dependency:
 [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
 [ 1270.476949]
 [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
 [ 1270.477553]  (&pool->lock/1){-.-.}
 ...
 [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
 [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
 ...
 [ 1270.494735]  Possible interrupt unsafe locking scenario:
 [ 1270.494735]
 [ 1270.495250]        CPU0                    CPU1
 [ 1270.495600]        ----                    ----
 [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
 [ 1270.496295]                                local_irq_disable();
 [ 1270.496753]                                lock(&pool->lock/1);
 [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
 [ 1270.497744]   <Interrupt>
 [ 1270.497948]     lock(&pool->lock/1);

, which will cause a irq inversion deadlock if the above lock scenario
happens.

The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.

Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
lockdep_assert_held() fail").

Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag.  The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.

This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.

v2: Drop unnecessary flag clearing before pool destruction as
    suggested by Boqun.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
---
 kernel/workqueue.c |   37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 64d0edf..a2dccfe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 	 * attach_mutex to avoid changing binding state while
 	 * worker_attach_to_pool() is in progress.
 	 */
+	POOL_MANAGER_ACTIVE	= 1 << 0,	/* being managed */
 	POOL_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
 
 	/* worker flags */
@@ -165,7 +166,6 @@ struct worker_pool {
 						/* L: hash of busy workers */
 
 	/* see manage_workers() for details on the two manager mutexes */
-	struct mutex		manager_arb;	/* manager arbitration */
 	struct worker		*manager;	/* L: purely informational */
 	struct mutex		attach_mutex;	/* attach/detach exclusion */
 	struct list_head	workers;	/* A: attached workers */
@@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
 
 static DEFINE_MUTEX(wq_pool_mutex);	/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
+static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
 
 static LIST_HEAD(workqueues);		/* PR: list of all workqueues */
 static bool workqueue_freezing;		/* PL: have wqs started freezing? */
@@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
-	bool managing = mutex_is_locked(&pool->manager_arb);
+	bool managing = pool->flags & POOL_MANAGER_ACTIVE;
 	int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
 	int nr_busy = pool->nr_workers - nr_idle;
 
@@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
 {
 	struct worker_pool *pool = worker->pool;
 
-	/*
-	 * Anyone who successfully grabs manager_arb wins the arbitration
-	 * and becomes the manager.  mutex_trylock() on pool->manager_arb
-	 * failure while holding pool->lock reliably indicates that someone
-	 * else is managing the pool and the worker which failed trylock
-	 * can proceed to executing work items.  This means that anyone
-	 * grabbing manager_arb is responsible for actually performing
-	 * manager duties.  If manager_arb is grabbed and released without
-	 * actual management, the pool may stall indefinitely.
-	 */
-	if (!mutex_trylock(&pool->manager_arb))
+	if (pool->flags & POOL_MANAGER_ACTIVE)
 		return false;
+
+	pool->flags |= POOL_MANAGER_ACTIVE;
 	pool->manager = worker;
 
 	maybe_create_worker(pool);
 
 	pool->manager = NULL;
-	mutex_unlock(&pool->manager_arb);
+	pool->flags &= ~POOL_MANAGER_ACTIVE;
+	wake_up(&wq_manager_wait);
 	return true;
 }
 
@@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
 	setup_timer(&pool->mayday_timer, pool_mayday_timeout,
 		    (unsigned long)pool);
 
-	mutex_init(&pool->manager_arb);
 	mutex_init(&pool->attach_mutex);
 	INIT_LIST_HEAD(&pool->workers);
 
@@ -3318,13 +3311,15 @@ static void put_unbound_pool(struct worker_pool *pool)
 	hash_del(&pool->hash_node);
 
 	/*
-	 * Become the manager and destroy all workers.  Grabbing
-	 * manager_arb prevents @pool's workers from blocking on
-	 * attach_mutex.
+	 * Become the manager and destroy all workers.  This prevents
+	 * @pool's workers from blocking on attach_mutex.  We're the last
+	 * manager and @pool gets freed with the flag set.
 	 */
-	mutex_lock(&pool->manager_arb);
-
 	spin_lock_irq(&pool->lock);
+	wait_event_lock_irq(wq_manager_wait,
+			    !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
+	pool->flags |= POOL_MANAGER_ACTIVE;
+
 	while ((worker = first_idle_worker(pool)))
 		destroy_worker(worker);
 	WARN_ON(pool->nr_workers || pool->nr_idle);
@@ -3338,8 +3333,6 @@ static void put_unbound_pool(struct worker_pool *pool)
 	if (pool->detach_completion)
 		wait_for_completion(pool->detach_completion);
 
-	mutex_unlock(&pool->manager_arb);
-
 	/* shut down the timers */
 	del_timer_sync(&pool->idle_timer);
 	del_timer_sync(&pool->mayday_timer);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 15:02   ` Lai Jiangshan
@ 2017-10-09 15:08     ` Tejun Heo
  2017-10-09 15:14       ` Lai Jiangshan
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 15:08 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Boqun Feng, LKML, Josef Bacik, Peter Zijlstra

Hello,

On Mon, Oct 09, 2017 at 11:02:34PM +0800, Lai Jiangshan wrote:
> I was also thinking alternative code when reviewing.
> The first is quite obvious. Testing POOL_MANAGER_ACTIVE
> can be replaced by testing pool->manager.
> And POOL_MANAGER_ACTIVE  is not needed. Isn't it?

put_unbound_pool() doesn't have to be called from a kworker context
and we don't really have a kworker pointer to set pool->manager to.
We can use a bogus value and then update pool->manager dereferences
accordingly but I think it's cleaner to simply use a separate flag.

> The second thing is to make manage_workers()
> and put_unbound_pool() exclusive.
> Waiting event on POOL_MANAGER_ACTIVE(or pool->manager)
> is one way. However, the pool's refcnt is not possible to
> be dropped to zero now since the caller still hold the pool->lock

wait_event_lock_irq() drops the lock if the condition is not met
before going to sleep (otherwise it wouldn't be able to sleep).

> and some pwds of the works in the worklist. So the other way
> to enforce the exclusive could be just doing
> get_pwq(the first pwd of the worklist) and put_pwq() when
> the manage_workers() done. And the code about
> pool->manager_arb in put_unbound_pool() can be
> simply removed.

Yeah, that part is removed.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 15:08     ` Tejun Heo
@ 2017-10-09 15:14       ` Lai Jiangshan
  2017-10-09 15:33         ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Lai Jiangshan @ 2017-10-09 15:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Boqun Feng, LKML, Josef Bacik, Peter Zijlstra

On Mon, Oct 9, 2017 at 11:08 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Oct 09, 2017 at 11:02:34PM +0800, Lai Jiangshan wrote:
>> I was also thinking alternative code when reviewing.
>> The first is quite obvious. Testing POOL_MANAGER_ACTIVE
>> can be replaced by testing pool->manager.
>> And POOL_MANAGER_ACTIVE  is not needed. Isn't it?
>
> put_unbound_pool() doesn't have to be called from a kworker context
> and we don't really have a kworker pointer to set pool->manager to.
> We can use a bogus value and then update pool->manager dereferences
> accordingly but I think it's cleaner to simply use a separate flag.
>
>> The second thing is to make manage_workers()
>> and put_unbound_pool() exclusive.
>> Waiting event on POOL_MANAGER_ACTIVE(or pool->manager)
>> is one way. However, the pool's refcnt is not possible to
>> be dropped to zero now since the caller still hold the pool->lock
>
> wait_event_lock_irq() drops the lock if the condition is not met
> before going to sleep (otherwise it wouldn't be able to sleep).

I think just using get_pwq()/put_pwq() in manage_workers()
as the following said is simpler than using wait_event_lock_irq()

thanks
Lai

>
>> and some pwds of the works in the worklist. So the other way
>> to enforce the exclusive could be just doing
>> get_pwq(the first pwd of the worklist) and put_pwq() when
>> the manage_workers() done. And the code about
>> pool->manager_arb in put_unbound_pool() can be
>> simply removed.
>
> Yeah, that part is removed.
>
> Thanks!
>
> --
> tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-09  9:40 ` Lai Jiangshan
  2017-10-09 12:40   ` Boqun Feng
@ 2017-10-09 15:24   ` Peter Zijlstra
  2017-10-09 15:29     ` Tejun Heo
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-10-09 15:24 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Boqun Feng, LKML, Josef Bacik, Tejun Heo, Thomas Gleixner

On Mon, Oct 09, 2017 at 05:40:43PM +0800, Lai Jiangshan wrote:
> I didn't thought this kind of pattern is very seldom.  I remember I saw several.

>   mutex_lock();
>   do_something();
>   spin_lock_irq();
>   record_the_state_for_ do_something().
>    // keep the spin lock held to hold the state for do_more_things().
>   mutex_unlock(); // unlock() is suggested to be called when just exiting C.S.
>   do_more_things();
>   spin_unlock_irq();
> 
> Was all code of this pattern removed?
> Could it be possible that mutex will be changed to allow this?

So I think we did something similar to the rt_mutex in:

  b4abf91047cf ("rtmutex: Make wait_lock irq safe")

And I would not be entirely against doing the same for our normal mutex,
but I've not really had time to read/think through this thread.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()
  2017-10-09 15:24   ` Peter Zijlstra
@ 2017-10-09 15:29     ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Boqun Feng, LKML, Josef Bacik, Thomas Gleixner

Hello,

On Mon, Oct 09, 2017 at 05:24:49PM +0200, Peter Zijlstra wrote:
> So I think we did something similar to the rt_mutex in:
> 
>   b4abf91047cf ("rtmutex: Make wait_lock irq safe")
> 
> And I would not be entirely against doing the same for our normal mutex,
> but I've not really had time to read/think through this thread.

We may want to do that if there are other more valid cases but this
workqueue one shouldn't be the reason.  It's something which shouldn't
have been a mutex from the get-go.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 15:14       ` Lai Jiangshan
@ 2017-10-09 15:33         ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-09 15:33 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Boqun Feng, LKML, Josef Bacik, Peter Zijlstra

Hello, Lai.

On Mon, Oct 09, 2017 at 11:14:20PM +0800, Lai Jiangshan wrote:
> > wait_event_lock_irq() drops the lock if the condition is not met
> > before going to sleep (otherwise it wouldn't be able to sleep).
> 
> I think just using get_pwq()/put_pwq() in manage_workers()
> as the following said is simpler than using wait_event_lock_irq()

Hmm... let's stay with the current implementation for the fix as it's
more straight-forward.  We can simplify it later if that'd be
worthwhile.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 15:04   ` [PATCH v2 " Tejun Heo
@ 2017-10-10  9:55     ` Lai Jiangshan
  2017-10-10 14:15     ` Tejun Heo
  1 sibling, 0 replies; 19+ messages in thread
From: Lai Jiangshan @ 2017-10-10  9:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Boqun Feng, LKML, Josef Bacik, Peter Zijlstra

On Mon, Oct 9, 2017 at 11:04 PM, Tejun Heo <tj@kernel.org> wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
>
>  [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>  [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
>  [ 1270.473240] -----------------------------------------------------
>  [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
>  [ 1270.474994]
>  [ 1270.474994] and this task is already holding:
>  [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
>  [ 1270.476046] which would create a new lock dependency:
>  [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
>  [ 1270.476949]
>  [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
>  [ 1270.477553]  (&pool->lock/1){-.-.}
>  ...
>  [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
>  [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
>  ...
>  [ 1270.494735]  Possible interrupt unsafe locking scenario:
>  [ 1270.494735]
>  [ 1270.495250]        CPU0                    CPU1
>  [ 1270.495600]        ----                    ----
>  [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
>  [ 1270.496295]                                local_irq_disable();
>  [ 1270.496753]                                lock(&pool->lock/1);
>  [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
>  [ 1270.497744]   <Interrupt>
>  [ 1270.497948]     lock(&pool->lock/1);
>
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
>
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
> held.
>
> Unlocking mutex while holding an irq spinlock was never safe and this
> problem has been around forever but it never got noticed because the
> only time the mutex is usually trylocked while holding irqlock making
> actual failures very unlikely and lockdep annotation missed the
> condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
> lockdep_assert_held() fail").
>
> Using mutex for pool->manager_arb has always been a bit of stretch.
> It primarily is an mechanism to arbitrate managership between workers
> which can easily be done with a pool flag.  The only reason it became
> a mutex is that pool destruction path wants to exclude parallel
> managing operations.
>
> This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
> and make the destruction path wait for the current manager on a wait
> queue.
>
> v2: Drop unnecessary flag clearing before pool destruction as
>     suggested by Boqun.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: stable@vger.kernel.org
> ---
>  kernel/workqueue.c |   37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf..a2dccfe 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -68,6 +68,7 @@ enum {
>          * attach_mutex to avoid changing binding state while
>          * worker_attach_to_pool() is in progress.
>          */
> +       POOL_MANAGER_ACTIVE     = 1 << 0,       /* being managed */
>         POOL_DISASSOCIATED      = 1 << 2,       /* cpu can't serve workers */
>
>         /* worker flags */
> @@ -165,7 +166,6 @@ struct worker_pool {
>                                                 /* L: hash of busy workers */
>
>         /* see manage_workers() for details on the two manager mutexes */
> -       struct mutex            manager_arb;    /* manager arbitration */
>         struct worker           *manager;       /* L: purely informational */
>         struct mutex            attach_mutex;   /* attach/detach exclusion */
>         struct list_head        workers;        /* A: attached workers */
> @@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
>
>  static DEFINE_MUTEX(wq_pool_mutex);    /* protects pools and workqueues list */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
>
>  static LIST_HEAD(workqueues);          /* PR: list of all workqueues */
>  static bool workqueue_freezing;                /* PL: have wqs started freezing? */
> @@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> -       bool managing = mutex_is_locked(&pool->manager_arb);
> +       bool managing = pool->flags & POOL_MANAGER_ACTIVE;
>         int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
>         int nr_busy = pool->nr_workers - nr_idle;
>
> @@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
>  {
>         struct worker_pool *pool = worker->pool;
>
> -       /*
> -        * Anyone who successfully grabs manager_arb wins the arbitration
> -        * and becomes the manager.  mutex_trylock() on pool->manager_arb
> -        * failure while holding pool->lock reliably indicates that someone
> -        * else is managing the pool and the worker which failed trylock
> -        * can proceed to executing work items.  This means that anyone
> -        * grabbing manager_arb is responsible for actually performing
> -        * manager duties.  If manager_arb is grabbed and released without
> -        * actual management, the pool may stall indefinitely.
> -        */
> -       if (!mutex_trylock(&pool->manager_arb))
> +       if (pool->flags & POOL_MANAGER_ACTIVE)
>                 return false;
> +
> +       pool->flags |= POOL_MANAGER_ACTIVE;
>         pool->manager = worker;
>
>         maybe_create_worker(pool);
>
>         pool->manager = NULL;
> -       mutex_unlock(&pool->manager_arb);
> +       pool->flags &= ~POOL_MANAGER_ACTIVE;
> +       wake_up(&wq_manager_wait);
>         return true;
>  }
>
> @@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
>         setup_timer(&pool->mayday_timer, pool_mayday_timeout,
>                     (unsigned long)pool);
>
> -       mutex_init(&pool->manager_arb);
>         mutex_init(&pool->attach_mutex);
>         INIT_LIST_HEAD(&pool->workers);
>
> @@ -3318,13 +3311,15 @@ static void put_unbound_pool(struct worker_pool *pool)
>         hash_del(&pool->hash_node);
>
>         /*
> -        * Become the manager and destroy all workers.  Grabbing
> -        * manager_arb prevents @pool's workers from blocking on
> -        * attach_mutex.
> +        * Become the manager and destroy all workers.  This prevents
> +        * @pool's workers from blocking on attach_mutex.  We're the last
> +        * manager and @pool gets freed with the flag set.
>          */
> -       mutex_lock(&pool->manager_arb);
> -
>         spin_lock_irq(&pool->lock);
> +       wait_event_lock_irq(wq_manager_wait,
> +                           !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
> +       pool->flags |= POOL_MANAGER_ACTIVE;
> +
>         while ((worker = first_idle_worker(pool)))
>                 destroy_worker(worker);
>         WARN_ON(pool->nr_workers || pool->nr_idle);
> @@ -3338,8 +3333,6 @@ static void put_unbound_pool(struct worker_pool *pool)
>         if (pool->detach_completion)
>                 wait_for_completion(pool->detach_completion);
>
> -       mutex_unlock(&pool->manager_arb);
> -
>         /* shut down the timers */
>         del_timer_sync(&pool->idle_timer);
>         del_timer_sync(&pool->mayday_timer);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag
  2017-10-09 15:04   ` [PATCH v2 " Tejun Heo
  2017-10-10  9:55     ` Lai Jiangshan
@ 2017-10-10 14:15     ` Tejun Heo
  1 sibling, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2017-10-10 14:15 UTC (permalink / raw)
  To: Boqun Feng; +Cc: linux-kernel, Josef Bacik, Peter Zijlstra, Lai Jiangshan

On Mon, Oct 09, 2017 at 08:04:13AM -0700, Tejun Heo wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
> 
>  [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>  [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
>  [ 1270.473240] -----------------------------------------------------
>  [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
>  [ 1270.474994]
>  [ 1270.474994] and this task is already holding:
>  [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
>  [ 1270.476046] which would create a new lock dependency:
>  [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
>  [ 1270.476949]
>  [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
>  [ 1270.477553]  (&pool->lock/1){-.-.}
>  ...
>  [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
>  [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
>  ...
>  [ 1270.494735]  Possible interrupt unsafe locking scenario:
>  [ 1270.494735]
>  [ 1270.495250]        CPU0                    CPU1
>  [ 1270.495600]        ----                    ----
>  [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
>  [ 1270.496295]                                local_irq_disable();
>  [ 1270.496753]                                lock(&pool->lock/1);
>  [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
>  [ 1270.497744]   <Interrupt>
>  [ 1270.497948]     lock(&pool->lock/1);
> 
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
> 
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
> held.
> 
> Unlocking mutex while holding an irq spinlock was never safe and this
> problem has been around forever but it never got noticed because the
> only time the mutex is usually trylocked while holding irqlock making
> actual failures very unlikely and lockdep annotation missed the
> condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
> lockdep_assert_held() fail").
> 
> Using mutex for pool->manager_arb has always been a bit of stretch.
> It primarily is an mechanism to arbitrate managership between workers
> which can easily be done with a pool flag.  The only reason it became
> a mutex is that pool destruction path wants to exclude parallel
> managing operations.
> 
> This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
> and make the destruction path wait for the current manager on a wait
> queue.
> 
> v2: Drop unnecessary flag clearing before pool destruction as
>     suggested by Boqun.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: stable@vger.kernel.org

Applied to wq/for-4.14-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-10-10 14:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-08  9:02 [RFC] workqueue: Fix irq inversion deadlock in manage_workers() Boqun Feng
2017-10-08 19:03 ` Tejun Heo
2017-10-09  3:24   ` Boqun Feng
2017-10-09  6:42   ` Peter Zijlstra
2017-10-09 13:07     ` Tejun Heo
2017-10-09  9:40 ` Lai Jiangshan
2017-10-09 12:40   ` Boqun Feng
2017-10-09 15:24   ` Peter Zijlstra
2017-10-09 15:29     ` Tejun Heo
2017-10-09 13:21 ` [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager_arb mutex with a flag Tejun Heo
2017-10-09 14:21   ` Boqun Feng
2017-10-09 14:47     ` Tejun Heo
2017-10-09 15:02   ` Lai Jiangshan
2017-10-09 15:08     ` Tejun Heo
2017-10-09 15:14       ` Lai Jiangshan
2017-10-09 15:33         ` Tejun Heo
2017-10-09 15:04   ` [PATCH v2 " Tejun Heo
2017-10-10  9:55     ` Lai Jiangshan
2017-10-10 14:15     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.