[PATCH 1/2] workqueue: Catch more locking problems with flush

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
@ 2012-04-19  3:25 Stephen Boyd
  2012-04-19  3:25 ` [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop() Stephen Boyd
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Stephen Boyd @ 2012-04-19  3:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, netdev, Ben Dooks

If a workqueue is flushed but the work item is not scheduled to
run, lockdep checking will be circumvented. For example:

 static DEFINE_MUTEX(mutex);

 static void my_work(struct work_struct *w)
 {
         mutex_lock(&mutex);
         mutex_unlock(&mutex);
 }

 static DECLARE_WORK(work, my_work);

 static int __init start_test_module(void)
 {
         schedule_work(&work);
         return 0;
 }
 module_init(start_test_module);

 static void __exit stop_test_module(void)
 {
         mutex_lock(&mutex);
         flush_work(&work);
         mutex_unlock(&mutex);
 }
 module_exit(stop_test_module);

would only print a warning if the work item was actively running
when flush_work() was called. Otherwise flush_work() returns
early. In this trivial example nothing could go wrong, but if the
work item is schedule via an interrupt we could potentially have a
scenario where the work item is running just at the time flush_work()
is called. This could become a classic AB-BA locking problem.

Add a lockdep hint in flush_work() in the "work not running" case
so that we always catch this potential deadlock scenario.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
---
 kernel/workqueue.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 66ec08d..eb800df 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2513,8 +2513,11 @@ bool flush_work(struct work_struct *work)
 		wait_for_completion(&barr.done);
 		destroy_work_on_stack(&barr.work);
 		return true;
-	} else
+	} else {
+		lock_map_acquire(&work->lockdep_map);
+		lock_map_release(&work->lockdep_map);
 		return false;
+	}
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop()
  2012-04-19  3:25 [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Stephen Boyd
@ 2012-04-19  3:25 ` Stephen Boyd
  2012-04-21 19:34   ` David Miller
  2012-04-19  8:10 ` [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Yong Zhang
  2012-04-19 15:28 ` Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Stephen Boyd @ 2012-04-19  3:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Tejun Heo, Ben Dooks

There is a potential deadlock scenario when the ks8851 driver
is removed. The interrupt handler schedules a workqueue which
acquires a mutex that ks8851_net_stop() also acquires before
flushing the workqueue. Previously lockdep wouldn't be able
to find this problem but now that it has the support we can
trigger this lockdep warning by rmmoding the driver after
an ifconfig up.

Fix the possible deadlock by disabling the interrupts in 
the chip and then release the lock across the workqueue
flushing. The mutex is only there to proect the registers
anyway so this should be ok.

=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.21-00021-g8b33780-dirty #2911
-------------------------------------------------------
rmmod/125 is trying to acquire lock:
 ((&ks->irq_work)){+.+...}, at: [<c019e0b8>] flush_work+0x0/0xac

but task is already holding lock:
 (&ks->lock){+.+...}, at: [<bf00b850>] ks8851_net_stop+0x64/0x138 [ks8851]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&ks->lock){+.+...}:
       [<c01b89c8>] __lock_acquire+0x940/0x9f8
       [<c01b9058>] lock_acquire+0x10c/0x130
       [<c083dbec>] mutex_lock_nested+0x68/0x3dc
       [<bf00bd48>] ks8851_irq_work+0x24/0x46c [ks8851]
       [<c019c580>] process_one_work+0x2d8/0x518
       [<c019cb98>] worker_thread+0x220/0x3a0
       [<c01a2ad4>] kthread+0x88/0x94
       [<c0107008>] kernel_thread_exit+0x0/0x8

-> #0 ((&ks->irq_work)){+.+...}:
       [<c01b7984>] validate_chain+0x914/0x1018
       [<c01b89c8>] __lock_acquire+0x940/0x9f8
       [<c01b9058>] lock_acquire+0x10c/0x130
       [<c019e104>] flush_work+0x4c/0xac
       [<bf00b858>] ks8851_net_stop+0x6c/0x138 [ks8851]
       [<c06b209c>] __dev_close_many+0x98/0xcc
       [<c06b2174>] dev_close_many+0x68/0xd0
       [<c06b22ec>] rollback_registered_many+0xcc/0x2b8
       [<c06b2554>] rollback_registered+0x28/0x34
       [<c06b25b8>] unregister_netdevice_queue+0x58/0x7c
       [<c06b25f4>] unregister_netdev+0x18/0x20
       [<bf00c1f4>] ks8851_remove+0x64/0xb4 [ks8851]
       [<c049ddf0>] spi_drv_remove+0x18/0x1c
       [<c0468e98>] __device_release_driver+0x7c/0xbc
       [<c0468f64>] driver_detach+0x8c/0xb4
       [<c0467f00>] bus_remove_driver+0xb8/0xe8
       [<c01c1d20>] sys_delete_module+0x1e8/0x27c
       [<c0105ec0>] ret_fast_syscall+0x0/0x3c

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ks->lock);
                               lock((&ks->irq_work));
                               lock(&ks->lock);
  lock((&ks->irq_work));

 *** DEADLOCK ***

4 locks held by rmmod/125:
 #0:  (&__lockdep_no_validate__){+.+.+.}, at: [<c0468f44>] driver_detach+0x6c/0xb4
 #1:  (&__lockdep_no_validate__){+.+.+.}, at: [<c0468f50>] driver_detach+0x78/0xb4
 #2:  (rtnl_mutex){+.+.+.}, at: [<c06b25e8>] unregister_netdev+0xc/0x20
 #3:  (&ks->lock){+.+...}, at: [<bf00b850>] ks8851_net_stop+0x64/0x138 [ks8851]

Cc: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
---
 drivers/net/ethernet/micrel/ks8851.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/micrel/ks8851.c b/drivers/net/ethernet/micrel/ks8851.c
index c722aa60..20f50ec 100644
--- a/drivers/net/ethernet/micrel/ks8851.c
+++ b/drivers/net/ethernet/micrel/ks8851.c
@@ -889,16 +889,17 @@ static int ks8851_net_stop(struct net_device *dev)
 	netif_stop_queue(dev);
 
 	mutex_lock(&ks->lock);
+	/* turn off the IRQs and ack any outstanding */
+	ks8851_wrreg16(ks, KS_IER, 0x0000);
+	ks8851_wrreg16(ks, KS_ISR, 0xffff);
+	mutex_unlock(&ks->lock);
 
 	/* stop any outstanding work */
 	flush_work(&ks->irq_work);
 	flush_work(&ks->tx_work);
 	flush_work(&ks->rxctrl_work);
 
-	/* turn off the IRQs and ack any outstanding */
-	ks8851_wrreg16(ks, KS_IER, 0x0000);
-	ks8851_wrreg16(ks, KS_ISR, 0xffff);
-
+	mutex_lock(&ks->lock);
 	/* shutdown RX process */
 	ks8851_wrreg16(ks, KS_RXCR1, 0x0000);
 
@@ -907,6 +908,7 @@ static int ks8851_net_stop(struct net_device *dev)
 
 	/* set powermode to soft power down to save power */
 	ks8851_set_powermode(ks, PMECR_PM_SOFTDOWN);
+	mutex_unlock(&ks->lock);
 
 	/* ensure any queued tx buffers are dumped */
 	while (!skb_queue_empty(&ks->txq)) {
@@ -918,7 +920,6 @@ static int ks8851_net_stop(struct net_device *dev)
 		dev_kfree_skb(txb);
 	}
 
-	mutex_unlock(&ks->lock);
 	return 0;
 }
 
-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19  3:25 [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Stephen Boyd
  2012-04-19  3:25 ` [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop() Stephen Boyd
@ 2012-04-19  8:10 ` Yong Zhang
  2012-04-19 18:36   ` Stephen Boyd
  2012-04-19 15:28 ` Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Yong Zhang @ 2012-04-19  8:10 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Wed, Apr 18, 2012 at 08:25:57PM -0700, Stephen Boyd wrote:
> If a workqueue is flushed but the work item is not scheduled to
> run, lockdep checking will be circumvented. For example:
> 
>  static DEFINE_MUTEX(mutex);
> 
>  static void my_work(struct work_struct *w)
>  {
>          mutex_lock(&mutex);
>          mutex_unlock(&mutex);
>  }
> 
>  static DECLARE_WORK(work, my_work);
> 
>  static int __init start_test_module(void)
>  {
>          schedule_work(&work);
>          return 0;
>  }
>  module_init(start_test_module);
> 
>  static void __exit stop_test_module(void)
>  {
>          mutex_lock(&mutex);
>          flush_work(&work);
>          mutex_unlock(&mutex);
>  }
>  module_exit(stop_test_module);
> 
> would only print a warning if the work item was actively running
> when flush_work() was called. Otherwise flush_work() returns
> early. In this trivial example nothing could go wrong, but if the
> work item is schedule via an interrupt we could potentially have a
> scenario where the work item is running just at the time flush_work()

You mean flush_work() could be called in interupt? I don't it is
possible.

> is called. This could become a classic AB-BA locking problem.

I don't see how the deadlock happen, could you please be more
specific?

Thanks,
Yong


> 
> Add a lockdep hint in flush_work() in the "work not running" case
> so that we always catch this potential deadlock scenario.
> 
> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
> ---
>  kernel/workqueue.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 66ec08d..eb800df 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2513,8 +2513,11 @@ bool flush_work(struct work_struct *work)
>  		wait_for_completion(&barr.done);
>  		destroy_work_on_stack(&barr.work);
>  		return true;
> -	} else
> +	} else {
> +		lock_map_acquire(&work->lockdep_map);
> +		lock_map_release(&work->lockdep_map);
>  		return false;
> +	}
>  }
>  EXPORT_SYMBOL_GPL(flush_work);
>  
> -- 
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19  3:25 [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Stephen Boyd
  2012-04-19  3:25 ` [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop() Stephen Boyd
  2012-04-19  8:10 ` [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Yong Zhang
@ 2012-04-19 15:28 ` Tejun Heo
  2012-04-19 18:10   ` Stephen Boyd
  2 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2012-04-19 15:28 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, netdev, Ben Dooks

On Wed, Apr 18, 2012 at 08:25:57PM -0700, Stephen Boyd wrote:
> @@ -2513,8 +2513,11 @@ bool flush_work(struct work_struct *work)
>  		wait_for_completion(&barr.done);
>  		destroy_work_on_stack(&barr.work);
>  		return true;
> -	} else
> +	} else {
> +		lock_map_acquire(&work->lockdep_map);
> +		lock_map_release(&work->lockdep_map);
>  		return false;

We don't have this annotation when start_flush_work() succeeds either,
right?  IOW, would lockdep trigger when an actual deadlock happens?
If not, why not add the acquire/release() before flush_work() does
anything?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19 15:28 ` Tejun Heo
@ 2012-04-19 18:10   ` Stephen Boyd
  2012-04-20 17:35     ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Boyd @ 2012-04-19 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, netdev, Ben Dooks

On 04/19/12 08:28, Tejun Heo wrote:
> On Wed, Apr 18, 2012 at 08:25:57PM -0700, Stephen Boyd wrote:
>> @@ -2513,8 +2513,11 @@ bool flush_work(struct work_struct *work)
>>  		wait_for_completion(&barr.done);
>>  		destroy_work_on_stack(&barr.work);
>>  		return true;
>> -	} else
>> +	} else {
>> +		lock_map_acquire(&work->lockdep_map);
>> +		lock_map_release(&work->lockdep_map);
>>  		return false;
> We don't have this annotation when start_flush_work() succeeds either,
> right?  IOW, would lockdep trigger when an actual deadlock happens?

I believe it does although I haven't tested it.

> If not, why not add the acquire/release() before flush_work() does
> anything?
>

I was worried about causing false positive lockdep warnings in the case
that start_flush_work() succeeds and returns true. In that case, lockdep
is told about the cwq lockdep map:

static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
                             bool wait_executing)
{

        .....

        if (cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER)
                lock_map_acquire(&cwq->wq->lockdep_map);
        else
                lock_map_acquire_read(&cwq->wq->lockdep_map);


and so if we acquired the work->lockdep_map before the
cwq->wq->lockdep_map we would get a warning about ABBA between these two
lockdep maps. At least that is what I'm lead to believe when I look at
what process_one_work() is doing. Please correct me if I'm wrong.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19  8:10 ` [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Yong Zhang
@ 2012-04-19 18:36   ` Stephen Boyd
  2012-04-20  5:26     ` Yong Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Boyd @ 2012-04-19 18:36 UTC (permalink / raw)
  To: Yong Zhang; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On 04/19/12 01:10, Yong Zhang wrote:
> On Wed, Apr 18, 2012 at 08:25:57PM -0700, Stephen Boyd wrote:
>> If a workqueue is flushed but the work item is not scheduled to
>> run, lockdep checking will be circumvented. For example:
>>
>>  static DEFINE_MUTEX(mutex);
>>
>>  static void my_work(struct work_struct *w)
>>  {
>>          mutex_lock(&mutex);
>>          mutex_unlock(&mutex);
>>  }
>>
>>  static DECLARE_WORK(work, my_work);
>>
>>  static int __init start_test_module(void)
>>  {
>>          schedule_work(&work);
>>          return 0;
>>  }
>>  module_init(start_test_module);
>>
>>  static void __exit stop_test_module(void)
>>  {
>>          mutex_lock(&mutex);
>>          flush_work(&work);
>>          mutex_unlock(&mutex);
>>  }
>>  module_exit(stop_test_module);
>>
>> would only print a warning if the work item was actively running
>> when flush_work() was called. Otherwise flush_work() returns
>> early. In this trivial example nothing could go wrong, but if the
>> work item is schedule via an interrupt we could potentially have a
>> scenario where the work item is running just at the time flush_work()
> You mean flush_work() could be called in interupt? I don't it is
> possible.

No.

>
>> is called. This could become a classic AB-BA locking problem.
> I don't see how the deadlock happen, could you please be more
> specific?
>

Does looking at the second patch help? Basically schedule_work() can run
the callback right between the time the mutex is acquired and
flush_work() is called:

CPU0                        CPU1

<irq>
  schedule_work()           mutex_lock(&mutex)
<irq return>
    my_work()               flush_work() 
      mutex_lock(&mutex)    
      <deadlock>

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19 18:36   ` Stephen Boyd
@ 2012-04-20  5:26     ` Yong Zhang
  2012-04-20  6:01       ` Yong Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Yong Zhang @ 2012-04-20  5:26 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Thu, Apr 19, 2012 at 11:36:32AM -0700, Stephen Boyd wrote:
> Does looking at the second patch help? Basically schedule_work() can run
> the callback right between the time the mutex is acquired and
> flush_work() is called:
> 
> CPU0                        CPU1
> 
> <irq>
>   schedule_work()           mutex_lock(&mutex)
> <irq return>
>     my_work()               flush_work() 
>       mutex_lock(&mutex)    
>       <deadlock>

Get you point. It is a problem. But your patch could introduece false
positive since when flush_work() is called that very work may finish
running already.

So I think we need the lock_map_acquire()/lock_map_release() only when
the work is under processing, no?

Thanks,
Yong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  5:26     ` Yong Zhang
@ 2012-04-20  6:01       ` Yong Zhang
  2012-04-20  6:26         ` Stephen Boyd
  0 siblings, 1 reply; 19+ messages in thread
From: Yong Zhang @ 2012-04-20  6:01 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Fri, Apr 20, 2012 at 01:26:33PM +0800, Yong Zhang wrote:
> On Thu, Apr 19, 2012 at 11:36:32AM -0700, Stephen Boyd wrote:
> > Does looking at the second patch help? Basically schedule_work() can run
> > the callback right between the time the mutex is acquired and
> > flush_work() is called:
> > 
> > CPU0                        CPU1
> > 
> > <irq>
> >   schedule_work()           mutex_lock(&mutex)
> > <irq return>
> >     my_work()               flush_work() 
> >       mutex_lock(&mutex)    
> >       <deadlock>
> 
> Get you point. It is a problem. But your patch could introduece false
> positive since when flush_work() is called that very work may finish
> running already.
> 
> So I think we need the lock_map_acquire()/lock_map_release() only when
> the work is under processing, no?

But start_flush_work() has tried take care of this issue except it
doesn't add work->lockdep_map into the chain.

So does below patch help?

Thanks,
Yong

---
From: Yong Zhang <yong.zhang@windriver.com>
Date: Fri, 20 Apr 2012 13:44:16 +0800
Subject: [PATCH] workqueue:lockdep: make flush_work notice deadlock

Connet the lock chain by aquiring work->lockdep_map when
the tobe-flush work is running.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
---
 kernel/workqueue.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc867e8..c096b05 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2461,6 +2461,8 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
 		lock_map_acquire(&cwq->wq->lockdep_map);
 	else
 		lock_map_acquire_read(&cwq->wq->lockdep_map);
+	lock_map_acquire(&work->lockdep_map);
+	lock_map_release(&work->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
 	return true;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  6:01       ` Yong Zhang
@ 2012-04-20  6:26         ` Stephen Boyd
  2012-04-20  7:18           ` Yong Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Boyd @ 2012-04-20  6:26 UTC (permalink / raw)
  To: Yong Zhang; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On 4/19/2012 11:01 PM, Yong Zhang wrote:
> On Fri, Apr 20, 2012 at 01:26:33PM +0800, Yong Zhang wrote:
>> On Thu, Apr 19, 2012 at 11:36:32AM -0700, Stephen Boyd wrote:
>>> Does looking at the second patch help? Basically schedule_work() can run
>>> the callback right between the time the mutex is acquired and
>>> flush_work() is called:
>>>
>>> CPU0                        CPU1
>>>
>>> <irq>
>>>   schedule_work()           mutex_lock(&mutex)
>>> <irq return>
>>>     my_work()               flush_work() 
>>>       mutex_lock(&mutex)    
>>>       <deadlock>
>> Get you point. It is a problem. But your patch could introduece false
>> positive since when flush_work() is called that very work may finish
>> running already.
>>
>> So I think we need the lock_map_acquire()/lock_map_release() only when
>> the work is under processing, no?
> But start_flush_work() has tried take care of this issue except it
> doesn't add work->lockdep_map into the chain.
>
> So does below patch help?
>
[snip]
> @@ -2461,6 +2461,8 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
>  		lock_map_acquire(&cwq->wq->lockdep_map);
>  	else
>  		lock_map_acquire_read(&cwq->wq->lockdep_map);
> +	lock_map_acquire(&work->lockdep_map);
> +	lock_map_release(&work->lockdep_map);
>  	lock_map_release(&cwq->wq->lockdep_map);
>  
>  	return true;

No this doesn't help. The whole point of the patch is to get lockdep to
complain in the case where the work is not queued. That case is not a
false positive. We will get a lockdep warning if the work is running
(when start_flush_work() returns true) solely with the
lock_map_acquire() on cwq->wq->lockdep_map.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  6:26         ` Stephen Boyd
@ 2012-04-20  7:18           ` Yong Zhang
  2012-04-20  8:18             ` Stephen Boyd
  0 siblings, 1 reply; 19+ messages in thread
From: Yong Zhang @ 2012-04-20  7:18 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
> complain in the case where the work is not queued. That case is not a
> false positive. We will get a lockdep warning if the work is running

IIRC, flush_work() is just a nop when a work is not queued nor running.

> (when start_flush_work() returns true) solely with the
> lock_map_acquire() on cwq->wq->lockdep_map.

Yeah, that is the point we use lockdep to detect deadlock for workqueue.

But when looking at start_flush_work(), for some case
!(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
lock_map_acquire_read() is called, but recursive read is not added to
the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
is called, deadlock will not be detected. I hope you don't hit that
special case.

Thanks,
Yong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  7:18           ` Yong Zhang
@ 2012-04-20  8:18             ` Stephen Boyd
  2012-04-20  8:32               ` Yong Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Boyd @ 2012-04-20  8:18 UTC (permalink / raw)
  To: Yong Zhang; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On 4/20/2012 12:18 AM, Yong Zhang wrote:
> On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
>> complain in the case where the work is not queued. That case is not a
>> false positive. We will get a lockdep warning if the work is running
> IIRC, flush_work() is just a nop when a work is not queued nor running.

Agreed, but it's better to always print a lockdep warning instead of
only when the deadlock is going to occur.

>
>> (when start_flush_work() returns true) solely with the
>> lock_map_acquire() on cwq->wq->lockdep_map.
> Yeah, that is the point we use lockdep to detect deadlock for workqueue.
>
> But when looking at start_flush_work(), for some case
> !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
> lock_map_acquire_read() is called, but recursive read is not added to
> the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
> is called, deadlock will not be detected. I hope you don't hit that
> special case.

Hmm. Originally I had what you suggested in my patch but I left it out
because I wasn't sure if it would cause false positives. Do you see any
possibility for false positives? I'll add it back in if not.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  8:18             ` Stephen Boyd
@ 2012-04-20  8:32               ` Yong Zhang
  2012-04-21  0:32                 ` Yong Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Yong Zhang @ 2012-04-20  8:32 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Fri, Apr 20, 2012 at 01:18:19AM -0700, Stephen Boyd wrote:
> On 4/20/2012 12:18 AM, Yong Zhang wrote:
> > On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
> >> complain in the case where the work is not queued. That case is not a
> >> false positive. We will get a lockdep warning if the work is running
> > IIRC, flush_work() is just a nop when a work is not queued nor running.
> 
> Agreed, but it's better to always print a lockdep warning instead of
> only when the deadlock is going to occur.

It will IMHO.

> 
> >
> >> (when start_flush_work() returns true) solely with the
> >> lock_map_acquire() on cwq->wq->lockdep_map.
> > Yeah, that is the point we use lockdep to detect deadlock for workqueue.
> >
> > But when looking at start_flush_work(), for some case
> > !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
> > lock_map_acquire_read() is called, but recursive read is not added to
> > the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
> > is called, deadlock will not be detected. I hope you don't hit that
> > special case.
> 
> Hmm. Originally I had what you suggested in my patch but I left it out
> because I wasn't sure if it would cause false positives.
> Do you see any
> possibility for false positives?

No, I don't. My test indeed show nothing (just build and boot).

>I'll add it back in if not.

It's great if you can try it :)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-19 18:10   ` Stephen Boyd
@ 2012-04-20 17:35     ` Tejun Heo
  2012-04-20 23:15       ` Stephen Boyd
  2012-04-21  0:28       ` [PATCHv2] " Stephen Boyd
  0 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2012-04-20 17:35 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, netdev, Ben Dooks

Hello,

On Thu, Apr 19, 2012 at 11:10:41AM -0700, Stephen Boyd wrote:
> On 04/19/12 08:28, Tejun Heo wrote:
> > On Wed, Apr 18, 2012 at 08:25:57PM -0700, Stephen Boyd wrote:
> >> @@ -2513,8 +2513,11 @@ bool flush_work(struct work_struct *work)
> >>  		wait_for_completion(&barr.done);
> >>  		destroy_work_on_stack(&barr.work);
> >>  		return true;
> >> -	} else
> >> +	} else {
> >> +		lock_map_acquire(&work->lockdep_map);
> >> +		lock_map_release(&work->lockdep_map);
> >>  		return false;
> > We don't have this annotation when start_flush_work() succeeds either,
> > right?  IOW, would lockdep trigger when an actual deadlock happens?
> 
> I believe it does although I haven't tested it.

How does it do that?  While wq->lockdep_map would be able to detect
some of the chaining, the read acquire paths probably would miss some
other.  In general, wq->lockdep_map is used to express dependencies
regarding workqueue flushing (and the self flushing) and it would
probably be better to express work item dependencies explicitly using
work->lockdep_map even if it becomes redundant through wq->lockdep_map
sometimes.

> > If not, why not add the acquire/release() before flush_work() does
> > anything?
> 
> I was worried about causing false positive lockdep warnings in the case
> that start_flush_work() succeeds and returns true. In that case, lockdep
> is told about the cwq lockdep map:
> 
> static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
>                              bool wait_executing)
> {
> 
>         .....
> 
>         if (cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER)
>                 lock_map_acquire(&cwq->wq->lockdep_map);
>         else
>                 lock_map_acquire_read(&cwq->wq->lockdep_map);
> 
> 
> and so if we acquired the work->lockdep_map before the
> cwq->wq->lockdep_map we would get a warning about ABBA between these two
> lockdep maps. At least that is what I'm lead to believe when I look at
> what process_one_work() is doing. Please correct me if I'm wrong.

All that's necessary is acquiring and releasing work->lockdep_map.
There's no need to nest start_flush_work() inside it.  Without
nesting, there's nothing to worry about ABBA lockdeps.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20 17:35     ` Tejun Heo
@ 2012-04-20 23:15       ` Stephen Boyd
  2012-04-21  0:28       ` [PATCHv2] " Stephen Boyd
  1 sibling, 0 replies; 19+ messages in thread
From: Stephen Boyd @ 2012-04-20 23:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, netdev, Ben Dooks

On 4/20/2012 10:35 AM, Tejun Heo wrote:
>
> All that's necessary is acquiring and releasing work->lockdep_map.
> There's no need to nest start_flush_work() inside it.  Without
> nesting, there's nothing to worry about ABBA lockdeps.

Yes I see. This patch should work then?

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5abf42f..038cf64 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2506,6 +2506,9 @@ bool flush_work(struct work_struct *work)
 {
        struct wq_barrier barr;

+       lock_map_acquire(&work->lockdep_map);
+       lock_map_release(&work->lockdep_map);
+
        if (start_flush_work(work, &barr, true)) {
                wait_for_completion(&barr.done);
                destroy_work_on_stack(&barr.work);


-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv2] workqueue: Catch more locking problems with flush_work()
  2012-04-20 17:35     ` Tejun Heo
  2012-04-20 23:15       ` Stephen Boyd
@ 2012-04-21  0:28       ` Stephen Boyd
  2012-04-21  0:34         ` Yong Zhang
  2012-04-23 18:07         ` Tejun Heo
  1 sibling, 2 replies; 19+ messages in thread
From: Stephen Boyd @ 2012-04-21  0:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, netdev, Ben Dooks, Yong Zhang

If a workqueue is flushed with flush_work() lockdep checking can
be circumvented. For example:

 static DEFINE_MUTEX(mutex);

 static void my_work(struct work_struct *w)
 {
         mutex_lock(&mutex);
         mutex_unlock(&mutex);
 }

 static DECLARE_WORK(work, my_work);

 static int __init start_test_module(void)
 {
         schedule_work(&work);
         return 0;
 }
 module_init(start_test_module);

 static void __exit stop_test_module(void)
 {
         mutex_lock(&mutex);
         flush_work(&work);
         mutex_unlock(&mutex);
 }
 module_exit(stop_test_module);

would not always print a warning when flush_work() was called.
In this trivial example nothing could go wrong since we are
guaranteed module_init() and module_exit() don't run concurrently,
but if the work item is schedule asynchronously we could have a
scenario where the work item is running just at the time flush_work()
is called resulting in a classic ABBA locking problem.

Add a lockdep hint by acquiring and releasing the work item
lockdep_map in flush_work() so that we always catch this
potential deadlock scenario.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
---
 kernel/workqueue.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5abf42f..038cf64 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2506,6 +2506,9 @@ bool flush_work(struct work_struct *work)
 {
 	struct wq_barrier barr;
 
+	lock_map_acquire(&work->lockdep_map);
+	lock_map_release(&work->lockdep_map);
+
 	if (start_flush_work(work, &barr, true)) {
 		wait_for_completion(&barr.done);
 		destroy_work_on_stack(&barr.work);
-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] workqueue: Catch more locking problems with flush_work()
  2012-04-20  8:32               ` Yong Zhang
@ 2012-04-21  0:32                 ` Yong Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Yong Zhang @ 2012-04-21  0:32 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, Tejun Heo, netdev, Ben Dooks

On Fri, Apr 20, 2012 at 4:32 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 01:18:19AM -0700, Stephen Boyd wrote:
>> On 4/20/2012 12:18 AM, Yong Zhang wrote:
>> > On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
>> >> complain in the case where the work is not queued. That case is not a
>> >> false positive. We will get a lockdep warning if the work is running
>> > IIRC, flush_work() is just a nop when a work is not queued nor running.
>>
>> Agreed, but it's better to always print a lockdep warning instead of
>> only when the deadlock is going to occur.
>
> It will IMHO.

After reconsidering this issue, I recognize I'm wrong from the beginning.
My suggestion will only warn on the real deadlock. It doesn't follow
what lockdep want to achieve. Sorry for that.

BTW, your latest patch should work :)

Thanks,
Yong

>
>>
>> >
>> >> (when start_flush_work() returns true) solely with the
>> >> lock_map_acquire() on cwq->wq->lockdep_map.
>> > Yeah, that is the point we use lockdep to detect deadlock for workqueue.
>> >
>> > But when looking at start_flush_work(), for some case
>> > !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
>> > lock_map_acquire_read() is called, but recursive read is not added to
>> > the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
>> > is called, deadlock will not be detected. I hope you don't hit that
>> > special case.
>>
>> Hmm. Originally I had what you suggested in my patch but I left it out
>> because I wasn't sure if it would cause false positives.
>> Do you see any
>> possibility for false positives?
>
> No, I don't. My test indeed show nothing (just build and boot).
>
>>I'll add it back in if not.
>
> It's great if you can try it :)
>
> Thanks,
> Yong



-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2] workqueue: Catch more locking problems with flush_work()
  2012-04-21  0:28       ` [PATCHv2] " Stephen Boyd
@ 2012-04-21  0:34         ` Yong Zhang
  2012-04-23 18:07         ` Tejun Heo
  1 sibling, 0 replies; 19+ messages in thread
From: Yong Zhang @ 2012-04-21  0:34 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: Tejun Heo, linux-kernel, netdev, Ben Dooks

On Sat, Apr 21, 2012 at 8:28 AM, Stephen Boyd <sboyd@codeaurora.org> wrote:
> If a workqueue is flushed with flush_work() lockdep checking can
> be circumvented. For example:
>
>  static DEFINE_MUTEX(mutex);
>
>  static void my_work(struct work_struct *w)
>  {
>         mutex_lock(&mutex);
>         mutex_unlock(&mutex);
>  }
>
>  static DECLARE_WORK(work, my_work);
>
>  static int __init start_test_module(void)
>  {
>         schedule_work(&work);
>         return 0;
>  }
>  module_init(start_test_module);
>
>  static void __exit stop_test_module(void)
>  {
>         mutex_lock(&mutex);
>         flush_work(&work);
>         mutex_unlock(&mutex);
>  }
>  module_exit(stop_test_module);
>
> would not always print a warning when flush_work() was called.
> In this trivial example nothing could go wrong since we are
> guaranteed module_init() and module_exit() don't run concurrently,
> but if the work item is schedule asynchronously we could have a
> scenario where the work item is running just at the time flush_work()
> is called resulting in a classic ABBA locking problem.
>
> Add a lockdep hint by acquiring and releasing the work item
> lockdep_map in flush_work() so that we always catch this
> potential deadlock scenario.
>
> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>

Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>

> ---
>  kernel/workqueue.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 5abf42f..038cf64 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2506,6 +2506,9 @@ bool flush_work(struct work_struct *work)
>  {
>        struct wq_barrier barr;
>
> +       lock_map_acquire(&work->lockdep_map);
> +       lock_map_release(&work->lockdep_map);
> +
>        if (start_flush_work(work, &barr, true)) {
>                wait_for_completion(&barr.done);
>                destroy_work_on_stack(&barr.work);
> --
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>



-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop()
  2012-04-19  3:25 ` [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop() Stephen Boyd
@ 2012-04-21 19:34   ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2012-04-21 19:34 UTC (permalink / raw)
  To: sboyd; +Cc: linux-kernel, netdev, tj, ben-linux

From: Stephen Boyd <sboyd@codeaurora.org>
Date: Wed, 18 Apr 2012 20:25:58 -0700

> There is a potential deadlock scenario when the ks8851 driver
> is removed. The interrupt handler schedules a workqueue which
> acquires a mutex that ks8851_net_stop() also acquires before
> flushing the workqueue. Previously lockdep wouldn't be able
> to find this problem but now that it has the support we can
> trigger this lockdep warning by rmmoding the driver after
> an ifconfig up.
> 
> Fix the possible deadlock by disabling the interrupts in 
> the chip and then release the lock across the workqueue
> flushing. The mutex is only there to proect the registers
> anyway so this should be ok.
 ...
> Cc: Ben Dooks <ben-linux@fluff.org>
> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>

Applied.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2] workqueue: Catch more locking problems with flush_work()
  2012-04-21  0:28       ` [PATCHv2] " Stephen Boyd
  2012-04-21  0:34         ` Yong Zhang
@ 2012-04-23 18:07         ` Tejun Heo
  1 sibling, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2012-04-23 18:07 UTC (permalink / raw)
  To: Stephen Boyd; +Cc: linux-kernel, netdev, Ben Dooks, Yong Zhang

On Fri, Apr 20, 2012 at 05:28:50PM -0700, Stephen Boyd wrote:
> If a workqueue is flushed with flush_work() lockdep checking can
> be circumvented. For example:
> 
>  static DEFINE_MUTEX(mutex);
> 
>  static void my_work(struct work_struct *w)
>  {
>          mutex_lock(&mutex);
>          mutex_unlock(&mutex);
>  }
> 
>  static DECLARE_WORK(work, my_work);
> 
>  static int __init start_test_module(void)
>  {
>          schedule_work(&work);
>          return 0;
>  }
>  module_init(start_test_module);
> 
>  static void __exit stop_test_module(void)
>  {
>          mutex_lock(&mutex);
>          flush_work(&work);
>          mutex_unlock(&mutex);
>  }
>  module_exit(stop_test_module);
> 
> would not always print a warning when flush_work() was called.
> In this trivial example nothing could go wrong since we are
> guaranteed module_init() and module_exit() don't run concurrently,
> but if the work item is schedule asynchronously we could have a
> scenario where the work item is running just at the time flush_work()
> is called resulting in a classic ABBA locking problem.
> 
> Add a lockdep hint by acquiring and releasing the work item
> lockdep_map in flush_work() so that we always catch this
> potential deadlock scenario.
> 
> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>

Applied to wq/for-3.5.  Let's see whether it triggers spuriously.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-04-23 18:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-19  3:25 [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Stephen Boyd
2012-04-19  3:25 ` [PATCH 2/2] ks8851: Fix mutex deadlock in ks8851_net_stop() Stephen Boyd
2012-04-21 19:34   ` David Miller
2012-04-19  8:10 ` [PATCH 1/2] workqueue: Catch more locking problems with flush_work() Yong Zhang
2012-04-19 18:36   ` Stephen Boyd
2012-04-20  5:26     ` Yong Zhang
2012-04-20  6:01       ` Yong Zhang
2012-04-20  6:26         ` Stephen Boyd
2012-04-20  7:18           ` Yong Zhang
2012-04-20  8:18             ` Stephen Boyd
2012-04-20  8:32               ` Yong Zhang
2012-04-21  0:32                 ` Yong Zhang
2012-04-19 15:28 ` Tejun Heo
2012-04-19 18:10   ` Stephen Boyd
2012-04-20 17:35     ` Tejun Heo
2012-04-20 23:15       ` Stephen Boyd
2012-04-21  0:28       ` [PATCHv2] " Stephen Boyd
2012-04-21  0:34         ` Yong Zhang
2012-04-23 18:07         ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.