linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [2.6.20] BUG: workqueue leaked lock
@ 2007-03-13 16:50 Folkert van Heusden
  2007-03-15 19:06 ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Folkert van Heusden @ 2007-03-13 16:50 UTC (permalink / raw)
  To: linux-kernel

...
[ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
[ 1756.728392] 2 locks held by nfsd4/3577:
[ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
[ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
[ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
[ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
[ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
[ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
[ 1756.729547]  =======================
[ 1792.436492] svc: unknown version (0 for prog 100003, nfsd)
[ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[ 1846.683701]     last function: laundromat_main+0x0/0x69 [nfsd]
[ 1846.683832] 2 locks held by nfsd4/3577:
[ 1846.683885]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1846.683980]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1846.683988]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[ 1846.683994]  [<c1003d7f>] show_trace+0x12/0x14
[ 1846.683997]  [<c1003e79>] dump_stack+0x16/0x18
[ 1846.684001]  [<c102c2e8>] run_workqueue+0x167/0x170
[ 1846.684006]  [<c102c437>] worker_thread+0x146/0x165
[ 1846.684012]  [<c102f797>] kthread+0x97/0xc4
[ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
[ 1846.684027]  =======================
[ 1936.638930] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[ 1936.638992]     last function: laundromat_main+0x0/0x69 [nfsd]
[ 1936.639141] 2 locks held by nfsd4/3577:
[ 1936.639195]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1936.639292]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[ 1936.639301]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[ 1936.639306]  [<c1003d7f>] show_trace+0x12/0x14
[ 1936.639311]  [<c1003e79>] dump_stack+0x16/0x18
[ 1936.639314]  [<c102c2e8>] run_workqueue+0x167/0x170
[ 1936.639321]  [<c102c437>] worker_thread+0x146/0x165
[ 1936.639325]  [<c102f797>] kthread+0x97/0xc4
[ 1936.639329]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
[ 1936.639332]  =======================

this repeats several times

also:

[  137.324255] =======================================================
[  137.324359] [ INFO: possible circular locking dependency detected ]
[  137.324412] 2.6.20 #2
[  137.324457] -------------------------------------------------------
[  137.324510] tor/4857 is trying to acquire lock:
[  137.324559]  (tty_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.324765]
[  137.324766] but task is already holding lock:
[  137.324859]  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.325067]
[  137.325069] which lock already depends on the new lock.
[  137.325071]
[  137.325206]
[  137.325208] the existing dependency chain (in reverse order) is:
[  137.325300]
[  137.325301] -> #4 (&s->s_dquot.dqptr_sem){----}:
[  137.325501]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.325852]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.326197]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.326538]        [<c1037624>] lock_acquire+0x62/0x81
[  137.326887]        [<c1032b5f>] down_read+0x2b/0x3d
[  137.327241]        [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.327588]        [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
[  137.327935]        [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
[  137.328280]        [<c10abf67>] ext3_alloc_branch+0x50/0x21b
[  137.328622]        [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
[  137.328980]        [<c10ac76e>] ext3_getblk+0x97/0x228
[  137.329330]        [<c10ac919>] ext3_bread+0x1a/0x78
[  137.329672]        [<c10b1bc2>] ext3_mkdir+0xf4/0x270
[  137.330022]        [<c1072a6a>] vfs_mkdir+0xb3/0x161
[  137.330368]        [<c1072ba4>] sys_mkdirat+0x8c/0xc4
[  137.330714]        [<c1072bfc>] sys_mkdir+0x20/0x22
[  137.331063]        [<c1002f78>] syscall_call+0x7/0xb
[  137.331406]        [<ffffffff>] 0xffffffff
[  137.331771]
[  137.331772] -> #3 (&ei->truncate_mutex){--..}:
[  137.331979]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.332332]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.332676]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.333025]        [<c1037624>] lock_acquire+0x62/0x81
[  137.333370]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.333930]        [<c1205b88>] mutex_lock+0x8/0xa
[  137.334271]        [<c10ae14f>] ext3_truncate+0x170/0x468
[  137.334613]        [<c1058e72>] vmtruncate+0xa6/0x116
[  137.334949]        [<c107d587>] inode_setattr+0x145/0x16c
[  137.334613]        [<c1058e72>] vmtruncate+0xa6/0x116
[  137.334949]        [<c107d587>] inode_setattr+0x145/0x16c
[  137.335286]        [<c10af071>] ext3_setattr+0x150/0x22f
[  137.335627]        [<c107d909>] notify_change+0x35b/0x392
[  137.335968]        [<c10679d2>] do_truncate+0x52/0x75
[  137.336305]        [<c1071f5e>] may_open+0x1ec/0x231
[  137.336642]        [<c107211c>] open_namei+0xda/0x59b
[  137.336975]        [<c106863a>] do_filp_open+0x2c/0x53
[  137.337310]        [<c106896f>] do_sys_open+0x52/0xd8
[  137.337645]        [<c1068a11>] sys_open+0x1c/0x1e
[  137.337980]        [<c1002f78>] syscall_call+0x7/0xb
[  137.338315]        [<ffffffff>] 0xffffffff
[  137.338665]
[  137.338666] -> #2 (&inode->i_alloc_sem){--..}:
[  137.338864]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.339200]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.339535]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.339874]        [<c1037624>] lock_acquire+0x62/0x81
[  137.340207]        [<c1032bf5>] down_write+0x2b/0x45
[  137.340545]        [<c107d890>] notify_change+0x2e2/0x392
[  137.340886]        [<c10679d2>] do_truncate+0x52/0x75
[  137.341222]        [<c1071f5e>] may_open+0x1ec/0x231
[  137.341557]        [<c107211c>] open_namei+0xda/0x59b
[  137.341898]        [<c106863a>] do_filp_open+0x2c/0x53
[  137.342236]        [<c106896f>] do_sys_open+0x52/0xd8
[  137.342572]        [<c1068a11>] sys_open+0x1c/0x1e
[  137.343109]        [<c1002f78>] syscall_call+0x7/0xb
[  137.343444]        [<ffffffff>] 0xffffffff
[  137.343792]
[  137.343793] -> #1 (&sysfs_inode_imutex_key){--..}:
[  137.343988]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.344320]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.344655]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.344986]        [<c1037624>] lock_acquire+0x62/0x81
[  137.345321]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.345658]        [<c1205b88>] mutex_lock+0x8/0xa
[  137.345991]        [<c10a5447>] sysfs_hash_and_remove+0x43/0x11c
[  137.346328]        [<c10a5ef7>] sysfs_remove_file+0xd/0x12
[  137.346660]        [<c114b00c>] device_remove_file+0x32/0x44
[  137.346992]        [<c114b6d9>] device_del+0x174/0x1d2
[  137.347325]        [<c114b742>] device_unregister+0xb/0x15
[  137.347661]        [<c114b93d>] device_destroy+0x8d/0x9a
[  137.347994]        [<c1137a38>] vcs_remove_sysfs+0x1c/0x38
[  137.348328]        [<c113e0d8>] con_close+0x5e/0x6b
[  137.348661]        [<c1130103>] release_dev+0x4c4/0x6ce
[  137.348999]        [<c113077b>] tty_release+0x12/0x1c
[  137.349332]        [<c106a471>] __fput+0x145/0x16e
[  137.349671]        [<c106a32a>] fput+0x19/0x1b
[  137.350004]        [<c1068a92>] filp_close+0x3c/0x75
[  137.350338]        [<c1068b34>] sys_close+0x69/0xb4
[  137.350671]        [<c1002f78>] syscall_call+0x7/0xb
[  137.351003]        [<ffffffff>] 0xffffffff
[  137.351352]
[  137.351354] -> #0 (tty_mutex){--..}:
[  137.351544]        [<c1035398>] check_prev_add+0x34/0x206
[  137.352086]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.352421]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.352759]        [<c1037624>] lock_acquire+0x62/0x81
[  137.353090]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.353426]        [<c1205b88>] mutex_lock+0x8/0xa
[  137.353761]        [<c1098d79>] print_warning+0x8c/0x15d
[  137.354096]        [<c1099537>] dquot_alloc_space+0x184/0x189
[  137.354434]        [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
[  137.354773]        [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
[  137.355109]        [<c10abf67>] ext3_alloc_branch+0x50/0x21b
[  137.355446]        [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
[  137.355783]        [<c10ac66c>] ext3_get_block+0x78/0xe3
[  137.356121]        [<c108a239>] __block_prepare_write+0x168/0x3fd
[  137.356459]        [<c108ad10>] block_prepare_write+0x28/0x3b
[  137.356800]        [<c10acbcd>] ext3_prepare_write+0x3f/0x18d
[  137.357138]        [<c104dbe5>] generic_file_buffered_write+0x197/0x5e6
[  137.357474]        [<c104e2e2>] __generic_file_aio_write_nolock+0x2ae/0x5ad
[  137.357817]        [<c104e6ea>] generic_file_aio_write+0x58/0xc4
[  137.358156]        [<c10aa723>] ext3_file_write+0x2d/0xba
[  137.357817]        [<c104e6ea>] generic_file_aio_write+0x58/0xc4
[  137.358156]        [<c10aa723>] ext3_file_write+0x2d/0xba
[  137.358495]        [<c1069441>] do_sync_write+0xc7/0x116
[  137.358838]        [<c1069612>] vfs_write+0x182/0x187
[  137.359176]        [<c10696b8>] sys_write+0x3d/0x64
[  137.359513]        [<c1002f78>] syscall_call+0x7/0xb
[  137.359848]        [<ffffffff>] 0xffffffff
[  137.360200]
[  137.360202] other info that might help us debug this:
[  137.360204]
[  137.360335] 3 locks held by tor/4857:
[  137.360382]  #0:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.360625]  #1:  (&ei->truncate_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.361072]  #2:  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.361313]
[  137.361314] stack backtrace:
[  137.361401]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[  137.361488]  [<c1003d7f>] show_trace+0x12/0x14
[  137.361571]  [<c1003e79>] dump_stack+0x16/0x18
[  137.361663]  [<c1034d6d>] print_circular_bug_tail+0x6f/0x71
[  137.361752]  [<c1035398>] check_prev_add+0x34/0x206
[  137.361838]  [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.362786]  [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.362877]  [<c1037624>] lock_acquire+0x62/0x81
[  137.362960]  [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.363047]  [<c1205b88>] mutex_lock+0x8/0xa
[  137.363132]  [<c1098d79>] print_warning+0x8c/0x15d
[  137.363217]  [<c1099537>] dquot_alloc_space+0x184/0x189
[  137.363302]  [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
[  137.363391]  [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
[  137.363476]  [<c10abf67>] ext3_alloc_branch+0x50/0x21b
[  137.363561]  [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
[  137.363650]  [<c10ac66c>] ext3_get_block+0x78/0xe3
[  137.363735]  [<c108a239>] __block_prepare_write+0x168/0x3fd
[  137.363822]  [<c108ad10>] block_prepare_write+0x28/0x3b
[  137.363908]  [<c10acbcd>] ext3_prepare_write+0x3f/0x18d
[  137.363996]  [<c104dbe5>] generic_file_buffered_write+0x197/0x5e6
[  137.364084]  [<c104e2e2>] __generic_file_aio_write_nolock+0x2ae/0x5ad
[  137.364171]  [<c104e6ea>] generic_file_aio_write+0x58/0xc4
[  137.364254]  [<c10aa723>] ext3_file_write+0x2d/0xba
[  137.364340]  [<c1069441>] do_sync_write+0xc7/0x116
[  137.364425]  [<c1069612>] vfs_write+0x182/0x187
[  137.364511]  [<c10696b8>] sys_write+0x3d/0x64
[  137.364595]  [<c1002f78>] syscall_call+0x7/0xb
[  137.364682]  =======================
[  137.471738] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[  137.471812]     last function: laundromat_main+0x0/0x69 [nfsd]
[  137.471923] 2 locks held by nfsd4/3577:
[  137.471971]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.472209]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.472445]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[  137.472528]  [<c1003d7f>] show_trace+0x12/0x14
[  137.472613]  [<c1003e79>] dump_stack+0x16/0x18
[  137.472695]  [<c102c2e8>] run_workqueue+0x167/0x170
[  137.472779]  [<c102c437>] worker_thread+0x146/0x165
[  137.472861]  [<c102f797>] kthread+0x97/0xc4
[  137.472943]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
[  137.473026]  =======================


Folkert van Heusden

-- 
MultiTail is a versatile tool for watching logfiles and output of
commands. Filtering, coloring, merging, diff-view, etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-13 16:50 [2.6.20] BUG: workqueue leaked lock Folkert van Heusden
@ 2007-03-15 19:06 ` Andrew Morton
  2007-03-15 19:17   ` Folkert van Heusden
  2007-03-16  8:41   ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-15 19:06 UTC (permalink / raw)
  To: Folkert van Heusden; +Cc: linux-kernel, Oleg Nesterov, Neil Brown

> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> ...
> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> [ 1756.728392] 2 locks held by nfsd4/3577:
> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> [ 1756.729547]  =======================
> [ 1792.436492] svc: unknown version (0 for prog 100003, nfsd)
> [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> [ 1846.683701]     last function: laundromat_main+0x0/0x69 [nfsd]
> [ 1846.683832] 2 locks held by nfsd4/3577:
> [ 1846.683885]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [ 1846.683980]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [ 1846.683988]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> [ 1846.683994]  [<c1003d7f>] show_trace+0x12/0x14
> [ 1846.683997]  [<c1003e79>] dump_stack+0x16/0x18
> [ 1846.684001]  [<c102c2e8>] run_workqueue+0x167/0x170
> [ 1846.684006]  [<c102c437>] worker_thread+0x146/0x165
> [ 1846.684012]  [<c102f797>] kthread+0x97/0xc4
> [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10

Oleg, that's a fairly incomprehensible message we have in there.  Can you
please explain what it means?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-15 19:06 ` Andrew Morton
@ 2007-03-15 19:17   ` Folkert van Heusden
  2007-03-16 14:49     ` Jarek Poplawski
  2007-03-20 11:17     ` dquot.c: possible circular locking " Jarek Poplawski
  2007-03-16  8:41   ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Folkert van Heusden @ 2007-03-15 19:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Oleg Nesterov, Neil Brown

> > On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > ...
> > [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
...
> > [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> Oleg, that's a fairly incomprehensible message we have in there.  Can you
> please explain what it means?

Haha ok :-)

Good, since I run 2.6.20 with these debugging switches switched on, I
get occasionally errors like these. I get ALWAYS the following error
when the system first boots when the TOR executable is started:

[  137.324255] =======================================================
[  137.324359] [ INFO: possible circular locking dependency detected ]
[  137.324412] 2.6.20 #2
[  137.324457] -------------------------------------------------------
[  137.324510] tor/4857 is trying to acquire lock:
[  137.324559]  (tty_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.324765]
[  137.324766] but task is already holding lock:
[  137.324859]  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.325067]
[  137.325069] which lock already depends on the new lock.
[  137.325071]
[  137.325206]
[  137.325208] the existing dependency chain (in reverse order) is:
[  137.325300]
[  137.325301] -> #4 (&s->s_dquot.dqptr_sem){----}:
[  137.325501]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.325852]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.326197]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.326538]        [<c1037624>] lock_acquire+0x62/0x81
[  137.326887]        [<c1032b5f>] down_read+0x2b/0x3d
[  137.327241]        [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.327588]        [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
[  137.327935]        [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
[  137.328280]        [<c10abf67>] ext3_alloc_branch+0x50/0x21b
[  137.328622]        [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
[  137.328980]        [<c10ac76e>] ext3_getblk+0x97/0x228
[  137.329330]        [<c10ac919>] ext3_bread+0x1a/0x78
[  137.329672]        [<c10b1bc2>] ext3_mkdir+0xf4/0x270
[  137.330022]        [<c1072a6a>] vfs_mkdir+0xb3/0x161
[  137.330368]        [<c1072ba4>] sys_mkdirat+0x8c/0xc4
[  137.330714]        [<c1072bfc>] sys_mkdir+0x20/0x22
[  137.331063]        [<c1002f78>] syscall_call+0x7/0xb
[  137.331406]        [<ffffffff>] 0xffffffff
[  137.331771]
[  137.331772] -> #3 (&ei->truncate_mutex){--..}:
[  137.331979]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.332332]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.332676]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.333025]        [<c1037624>] lock_acquire+0x62/0x81
[  137.333370]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.333930]        [<c1205b88>] mutex_lock+0x8/0xa
[  137.334271]        [<c10ae14f>] ext3_truncate+0x170/0x468
[  137.334613]        [<c1058e72>] vmtruncate+0xa6/0x116
[  137.334949]        [<c107d587>] inode_setattr+0x145/0x16c
[  137.335286]        [<c10af071>] ext3_setattr+0x150/0x22f
[  137.335627]        [<c107d909>] notify_change+0x35b/0x392
[  137.335968]        [<c10679d2>] do_truncate+0x52/0x75
[  137.336305]        [<c1071f5e>] may_open+0x1ec/0x231
[  137.336642]        [<c107211c>] open_namei+0xda/0x59b
[  137.336975]        [<c106863a>] do_filp_open+0x2c/0x53
[  137.337310]        [<c106896f>] do_sys_open+0x52/0xd8
[  137.337645]        [<c1068a11>] sys_open+0x1c/0x1e
[  137.337980]        [<c1002f78>] syscall_call+0x7/0xb
[  137.338315]        [<ffffffff>] 0xffffffff
[  137.338665]
[  137.338666] -> #2 (&inode->i_alloc_sem){--..}:
[  137.338864]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.339200]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.339535]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.339200]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.339535]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.339874]        [<c1037624>] lock_acquire+0x62/0x81
[  137.340207]        [<c1032bf5>] down_write+0x2b/0x45
[  137.340545]        [<c107d890>] notify_change+0x2e2/0x392
[  137.340886]        [<c10679d2>] do_truncate+0x52/0x75
[  137.341222]        [<c1071f5e>] may_open+0x1ec/0x231
[  137.341557]        [<c107211c>] open_namei+0xda/0x59b
[  137.341898]          [<c1068a11>] sys_open+0x1c/0x1e
[  137.343109]        [<c1002f78>] syscall_call+0x7/0xb
[  137.343444]        [<ffffffff>] 0xffffffff
[  137.343792]
[  137.343793] -> #1 (&sysfs_inode_imutex_key){--..}:
[  137.343988]        [<c10354b8>] check_prev_add+0x154/0x206
[  137.344320]        [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.344655]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.344986]        [<c1037624>] lock_acquire+0x62/0x81
[  137.345321]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.345658]        [<c1205b88>] mutex_lock+0x8/0xa
[  137.345991]        [<c10a5447>] sysfs_hash_and_remove+0x43/0x11c
[  137.346328]        [<c10a5ef7>] sysfs_remove_file+0xd/0x12
[  137.346660]        [<c114b00c>] device_remove_file+0x32/0x44
[  137.346992]        [<c114b6d9>] device_del+0x174/0x1d2
[  137.347325]        [<c114b742>] device_unregister+0xb/0x15
[  137.347661]        [<c114b93d>] device_destroy+0x8d/0x9a
[  137.347994]        [<c1137a38>] vcs_remove_sysfs+0x1c/0x38
[  137.348328]        [<c113e0d8>] con_close+0x5e/0x6b
[  137.348661]        [<c1130103>] release_dev+0x4c4/0x6ce
[  137.348999]        [<c113077b>] tty_release+0x12/0x1c
[  137.349332]         [<c10aa723>] ext3_file_write+0x2d/0xba
[  137.358495]        [<c1069441>] do_sync_write+0xc7/0x116
[  137.358838]        [<c1069612>] vfs_write+0x182/0x187
[  137.359176]        [<c10696b8>] sys_write+0x3d/0x64
[  137.359513]        [<c1002f78>] syscall_call+0x7/0xb
[  137.359848]        [<ffffffff>] 0xffffffff
[  137.360200]
[  137.360202] other info that might help us debug this:
[  137.360204]
[  137.360335] 3 locks held by tor/4857:
[  137.360382]  #0:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.360625]  #1:  (&ei->truncate_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.361072]  #2:  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
[  137.361313]
[  137.361314] stack backtrace:
[  137.361401]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[  137.361488]  [<c1003d7f>] show_trace+0x12/0x14
[  137.361571]  [<c1003e79>] dump_stack+0x16/0x18
[  137.361663]  [<c1034d6d>] print_circular_bug_tail+0x6f/0x71
[  137.361752]  [<c1035398>] check_prev_add+0x34/0x206
[  137.361838]  [<c10355d4>] check_prevs_add+0x6a/0xd5
[  137.362786]  [<c1036dd2>] __lock_acquire+0x61c/0xa05
[  137.362877]  [<c1037624>] lock_acquire+0x62/0x81
[  137.362960]  [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
[  137.363047]  [<c1205b88>] mutex_lock+0x8/0xa
[  137.363132]  [<c1098d79>] print_warning+0x8c/0x15d
[  137.363217]  [<c1099537>] dquot_alloc_space+0x184/0x189
[  137.363302]  [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
[  137.363391]  [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
[  137.363476]  [<c10abf67>] ext3_alloc_branch+0x50/0x21b
[  137.363561]  [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
[  137.363650]  [<c10ac66c>] ext3_get_block+0x78/0xe3
[  137.363735]  [<c108a239>] __block_prepare_write+0x168/0x3fd
[  137.363822]  [<c108ad10>] block_prepare_write+0x28/0x3b
[  137.363908]  [<c10acbcd>] ext3_prepare_write+0x3f/0x18d
[  137.363996]  [<c104dbe5>] generic_file_buffered_write+0x197/0x5e6
[  137.364084]  [<c104e2e2>] __generic_file_aio_write_nolock+0x2ae/0x5ad
[  137.364171]  [<c104e6ea>] generic_file_aio_write+0x58/0xc4
[  137.364254]  [<c10aa723>] ext3_file_write+0x2d/0xba
[  137.364340]  [<c1069441>] do_sync_write+0xc7/0x116
[  137.364425]  [<c1069612>] vfs_write+0x182/0x187
[  137.364511]  [<c10696b8>] sys_write+0x3d/0x64
[  137.364595]  [<c1002f78>] syscall_call+0x7/0xb
[  137.364682]  =======================


Now apart from this reproducable problem, I suddenly also get these errors in dmesg:

[  137.471738] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[  137.471812]     last function: laundromat_main+0x0/0x69 [nfsd]
[  137.471923] 2 locks held by nfsd4/3577:
[  137.471971]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.472209]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[  137.472445]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[  137.472528]  [<c1003d7f>] show_trace+0x12/0x14
[  137.472613]  [<c1003e79>] dump_stack+0x16/0x18
[  137.472695]  [<c102c2e8>] run_workqueue+0x167/0x170
[  137.472779]  [<c102c437>] worker_thread+0x146/0x165
[  137.472861]  [<c102f797>] kthread+0x97/0xc4
[  137.472943]  [<c1003bdb>] kernel_thread_helper+0x7/0x10

These happen on 2 times per minute and then once in 2 minutes.

The last one (while writing this message) is:

[182932.590394] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
[182932.590450]     last function: laundromat_main+0x0/0x69 [nfsd]
[182932.590557] 2 locks held by nfsd4/3577:
[182932.590600]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[182932.590830]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
[182932.591061]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
[182932.591145]  [<c1003d7f>] show_trace+0x12/0x14
[182932.591234]  [<c1003e79>] dump_stack+0x16/0x18
[182932.591316]  [<c102c2e8>] run_workqueue+0x167/0x170
[182932.591399]  [<c102c437>] worker_thread+0x146/0x165
[182932.591482]  [<c102f797>] kthread+0x97/0xc4
[182932.591564]  [<c1003bdb>] kernel_thread_helper+0x7/0x10

Config can be found here:
http://keetweej.vanheusden.com/~folkert/config-2.6.20.txt

System is a P4 @ 3.2GHz with 2GB of ram. IDE harddisks, 5 mounts mounted
on a server connected via a 1Gb lan. The other end (the server) runs
2.6.19.1.

On the problem-system this is the output of nfsstat:

Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
1833       611        611        0          0

Server nfs v2:
null         getattr      setattr      root         lookup       readlink
611     100% 0         0% 0         0% 0         0% 0         0% 0         0%
read         wrcache      write        create       remove       rename
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
link         symlink      mkdir        rmdir        readdir      fsstat
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%

Server nfs v3:
null         getattr      setattr      lookup       access       readlink
611     100% 0         0% 0         0% 0         0% 0         0% 0         0%
read         write        create       mkdir        symlink      mknod
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
fsstat       fsinfo       pathconf     commit
0         0% 0         0% 0         0% 0         0%

Server nfs v4:
null         compound
611     100% 0         0%

Server nfs v4 operations:
op0-unused   op1-unused   op2-future   access       close        commit
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
create       delegpurge   delegreturn  getattr      getfh        link
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
lock         lockt        locku        lookup       lookup_root  nverify
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
open         openattr     open_conf    open_dgrd    putfh        putpubfh
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
putrootfh    read         readdir      readlink     remove       rename
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
renew        restorefh    savefh       secinfo      setattr      setcltid
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
setcltidconf verify       write        rellockowner
0         0% 0         0% 0         0% 0         0%

Client rpc stats:
calls      retrans    authrefrsh
1404874    5502       0

Client nfs v2:
null         getattr      setattr      root         lookup       readlink
0         0% 157240   11% 11360     0% 0         0% 707396   50% 20434     1%
read         wrcache      write        create       remove       rename
18        0% 0         0% 471048   33% 2900      0% 323       0% 2888      0%
link         symlink      mkdir        rmdir        readdir      fsstat
0         0% 2         0% 60        0% 32        0% 27463     1% 3709      0%

Client nfs v4:
null         read         write        commit       open         open_conf
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
open_noat    open_dgrd    close        setattr      fsinfo       renew
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
setclntid    confirm      lock         lockt        locku        access
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
getattr      lookup       lookup_root  remove       rename       link
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
symlink      create       pathconf     statfs       readlink     readdir
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
server_caps  delegreturn
0         0% 0         0%


Folkert van Heusden

-- 
www.biglumber.com <- site where one can exchange PGP key signatures 
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-15 19:06 ` Andrew Morton
  2007-03-15 19:17   ` Folkert van Heusden
@ 2007-03-16  8:41   ` Peter Zijlstra
  2007-03-16 11:39     ` Andrew Morton
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2007-03-16  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Folkert van Heusden, linux-kernel, Oleg Nesterov, Neil Brown

On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > ...
> > [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > [ 1756.728392] 2 locks held by nfsd4/3577:
> > [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > [ 1756.729547]  =======================
> > [ 1792.436492] svc: unknown version (0 for prog 100003, nfsd)
> > [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > [ 1846.683701]     last function: laundromat_main+0x0/0x69 [nfsd]
> > [ 1846.683832] 2 locks held by nfsd4/3577:
> > [ 1846.683885]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > [ 1846.683980]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > [ 1846.683988]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > [ 1846.683994]  [<c1003d7f>] show_trace+0x12/0x14
> > [ 1846.683997]  [<c1003e79>] dump_stack+0x16/0x18
> > [ 1846.684001]  [<c102c2e8>] run_workqueue+0x167/0x170
> > [ 1846.684006]  [<c102c437>] worker_thread+0x146/0x165
> > [ 1846.684012]  [<c102f797>] kthread+0x97/0xc4
> > [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> 
> Oleg, that's a fairly incomprehensible message we have in there.  Can you
> please explain what it means?

I think I'm responsible for this message (commit
d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
function executed by the workqueue (here laundromat_main) leaked an
atomic context or is still holding locks (2 locks in this case).




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-16  8:41   ` Peter Zijlstra
@ 2007-03-16 11:39     ` Andrew Morton
  2007-03-19  6:24       ` Neil Brown
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2007-03-16 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Folkert van Heusden, linux-kernel, Oleg Nesterov, Neil Brown

On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > > On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > > ...
> > > [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > [ 1756.728392] 2 locks held by nfsd4/3577:
> > > [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > > [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > > [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > > [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > > [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > > [ 1756.729547]  =======================
> > > [ 1792.436492] svc: unknown version (0 for prog 100003, nfsd)
> > > [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > [ 1846.683701]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > [ 1846.683832] 2 locks held by nfsd4/3577:
> > > [ 1846.683885]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > [ 1846.683980]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > [ 1846.683988]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > [ 1846.683994]  [<c1003d7f>] show_trace+0x12/0x14
> > > [ 1846.683997]  [<c1003e79>] dump_stack+0x16/0x18
> > > [ 1846.684001]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > [ 1846.684006]  [<c102c437>] worker_thread+0x146/0x165
> > > [ 1846.684012]  [<c102f797>] kthread+0x97/0xc4
> > > [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > 
> > Oleg, that's a fairly incomprehensible message we have in there.  Can you
> > please explain what it means?
> 
> I think I'm responsible for this message (commit
> d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
> function executed by the workqueue (here laundromat_main) leaked an
> atomic context or is still holding locks (2 locks in this case).
> 

OK.  That's not necessarily a bug: one could envisage a (weird) piece of
code which takes a lock then releases it on a later workqueue invokation. 
But I'm not sure that nfs4_laundromat() is actually supposed to be doing
anything like that.

Then again, maybe it is: it seems to be waddling through a directory under
the control of a little state machine, with timeouts.

Neil: help?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-15 19:17   ` Folkert van Heusden
@ 2007-03-16 14:49     ` Jarek Poplawski
  2007-03-20 11:17     ` dquot.c: possible circular locking " Jarek Poplawski
  1 sibling, 0 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-16 14:49 UTC (permalink / raw)
  To: Folkert van Heusden
  Cc: Andrew Morton, linux-kernel, Oleg Nesterov, Neil Brown

On 15-03-2007 20:17, Folkert van Heusden wrote:
>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
>>> ...
>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> ...
>>> [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
>> Oleg, that's a fairly incomprehensible message we have in there.  Can you
>> please explain what it means?
> 
> Haha ok :-)
> 
> Good, since I run 2.6.20 with these debugging switches switched on, I
> get occasionally errors like these. I get ALWAYS the following error
> when the system first boots when the TOR executable is started:
> 
> [  137.324255] =======================================================
> [  137.324359] [ INFO: possible circular locking dependency detected ]
> [  137.324412] 2.6.20 #2
> [  137.324457] -------------------------------------------------------

Hi,

IMHO you should now switch off lockdep ("Lock debuging: ..." options),
until you update or get some patch for this circular problem.  Your
logs should be sufficient to nfsd people and lockdep turns off checking
after this anyway. I don't know if this is a known and real locking
problem, but it could be interesting now, if workqueue error would
reapear. It seems nfsd code is breaking somewhere with locks held -
maybe because of problems with lockdep's error handling.

Regards,
Jarek P.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-16 11:39     ` Andrew Morton
@ 2007-03-19  6:24       ` Neil Brown
  2007-03-20  9:37         ` [PATCH] " Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: Neil Brown @ 2007-03-19  6:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Folkert van Heusden, linux-kernel, Oleg Nesterov,
	J. Bruce Fields

On Friday March 16, akpm@linux-foundation.org wrote:
> 
> OK.  That's not necessarily a bug: one could envisage a (weird) piece of
> code which takes a lock then releases it on a later workqueue invokation. 
> But I'm not sure that nfs4_laundromat() is actually supposed to be doing
> anything like that.
> 
> Then again, maybe it is: it seems to be waddling through a directory under
> the control of a little state machine, with timeouts.
> 
> Neil: help?

I'm quite certain that laundromat_main does *not* leave client_mutex
locked as the last thing it does is call nfs4_unlock_state which is
	mutex_unlock(&client_mutex);
To me, that raises some doubt about whether the lock leak check is
working properly...
It is somewhat harder to track locking of i_mutex, but it seems to me
that every time it is taken, it is released again shortly afterwards.

So I think this must be a problem with leak detection, not with NFSd.

NeilBrown


> On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > > > On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > > > ...
> > > > [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > > [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > > [ 1756.728392] 2 locks held by nfsd4/3577:
> > > > [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > > [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > > > [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > > > [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > > [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > > > [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > > > [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > > > [ 1756.729547]  =======================
> > > > [ 1792.436492] svc: unknown version (0 for prog 100003, nfsd)
> > > > [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > > [ 1846.683701]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > > [ 1846.683832] 2 locks held by nfsd4/3577:
> > > > [ 1846.683885]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > [ 1846.683980]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > [ 1846.683988]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > > [ 1846.683994]  [<c1003d7f>] show_trace+0x12/0x14
> > > > [ 1846.683997]  [<c1003e79>] dump_stack+0x16/0x18
> > > > [ 1846.684001]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > > [ 1846.684006]  [<c102c437>] worker_thread+0x146/0x165
> > > > [ 1846.684012]  [<c102f797>] kthread+0x97/0xc4
> > > > [ 1846.684023]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > > 
> > > Oleg, that's a fairly incomprehensible message we have in there.  Can you
> > > please explain what it means?
> > 
> > I think I'm responsible for this message (commit
> > d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
> > function executed by the workqueue (here laundromat_main) leaked an
> > atomic context or is still holding locks (2 locks in this case).
> > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-19  6:24       ` Neil Brown
@ 2007-03-20  9:37         ` Jarek Poplawski
  2007-03-20 16:07           ` Oleg Nesterov
  0 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-20  9:37 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Peter Zijlstra, Folkert van Heusden, linux-kernel,
	Oleg Nesterov, J. Bruce Fields, Ingo Molnar

On 19-03-2007 07:24, Neil Brown wrote:
> On Friday March 16, akpm@linux-foundation.org wrote:
>> OK.  That's not necessarily a bug: one could envisage a (weird) piece of
>> code which takes a lock then releases it on a later workqueue invokation. 
>> But I'm not sure that nfs4_laundromat() is actually supposed to be doing
>> anything like that.
>>
>> Then again, maybe it is: it seems to be waddling through a directory under
>> the control of a little state machine, with timeouts.
>>
>> Neil: help?
> 
> I'm quite certain that laundromat_main does *not* leave client_mutex
> locked as the last thing it does is call nfs4_unlock_state which is
> 	mutex_unlock(&client_mutex);
> To me, that raises some doubt about whether the lock leak check is
> working properly...
> It is somewhat harder to track locking of i_mutex, but it seems to me
> that every time it is taken, it is released again shortly afterwards.
> 
> So I think this must be a problem with leak detection, not with NFSd.
> 
> NeilBrown
> 
> 
>> On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>
>>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
>>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
>>>>> ...
>>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
>>>>> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
>>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
>>>>> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
>>>>> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
>>>>> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
>>>>> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
>>>>> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
>>>>> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
>>>>> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
>>>>> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
>>>>> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
>>>>> [ 1756.729547]  =======================
...
>>> I think I'm responsible for this message (commit
>>> d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
>>> function executed by the workqueue (here laundromat_main) leaked an
>>> atomic context or is still holding locks (2 locks in this case).

This check is valid with keventd, but it looks like nfsd runs
kthread by itself. I'm not sure it's illegal to hold locks then,
so, if I'm not wrong with this, here is my patch proposal (for
testing) to take this into consideration.

Reported-by: Folkert van Heusden <folkert@vanheusden.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.21-rc4-git4-/kernel/workqueue.c 2.6.21-rc4-git4/kernel/workqueue.c
--- 2.6.21-rc4-git4-/kernel/workqueue.c	2007-02-04 19:44:54.000000000 +0100
+++ 2.6.21-rc4-git4/kernel/workqueue.c	2007-03-20 09:30:46.000000000 +0100
@@ -316,6 +316,7 @@ static void run_workqueue(struct cpu_wor
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
 		work_func_t f = work->func;
+		int ld;
 
 		list_del_init(cwq->worklist.next);
 		spin_unlock_irqrestore(&cwq->lock, flags);
@@ -323,13 +324,15 @@ static void run_workqueue(struct cpu_wor
 		BUG_ON(get_wq_data(work) != cwq);
 		if (!test_bit(WORK_STRUCT_NOAUTOREL, work_data_bits(work)))
 			work_release(work);
+		ld = lockdep_depth(current);
+
 		f(work);
 
-		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+		if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {
 			printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
-					"%s/0x%08x/%d\n",
+					"%s/0x%08x/%d/%d\n",
 					current->comm, preempt_count(),
-				       	current->pid);
+				       	current->pid, ld);
 			printk(KERN_ERR "    last function: ");
 			print_symbol("%s\n", (unsigned long)f);
 			debug_show_held_locks(current);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-15 19:17   ` Folkert van Heusden
  2007-03-16 14:49     ` Jarek Poplawski
@ 2007-03-20 11:17     ` Jarek Poplawski
  2007-03-20 11:22       ` Jarek Poplawski
  1 sibling, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-20 11:17 UTC (permalink / raw)
  To: Folkert van Heusden
  Cc: Andrew Morton, linux-kernel, Oleg Nesterov, Neil Brown,
	Christoph Hellwig, Jan Kara

On 15-03-2007 20:17, Folkert van Heusden wrote:
>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
...
> Haha ok :-)
> 
> Good, since I run 2.6.20 with these debugging switches switched on, I
> get occasionally errors like these. I get ALWAYS the following error
> when the system first boots when the TOR executable is started:
> 
> [  137.324255] =======================================================
> [  137.324359] [ INFO: possible circular locking dependency detected ]
> [  137.324412] 2.6.20 #2
> [  137.324457] -------------------------------------------------------
> [  137.324510] tor/4857 is trying to acquire lock:
> [  137.324559]  (tty_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [  137.324765]
> [  137.324766] but task is already holding lock:
> [  137.324859]  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
> [  137.325067]
> [  137.325069] which lock already depends on the new lock.

IMHO lockdep found that two locks are taken in different order:

-> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
-> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()

Probably print_warning() and flush_warnings() should be reworked
to do printing without dqptr_sem or locking order should be changed.
I hope somebody from this CC can work it out better.

Regards,
Jarek P.


> [  137.325071]
> [  137.325206]
> [  137.325208] the existing dependency chain (in reverse order) is:
> [  137.325300]
> [  137.325301] -> #4 (&s->s_dquot.dqptr_sem){----}:
> [  137.325501]        [<c10354b8>] check_prev_add+0x154/0x206
> [  137.325852]        [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.326197]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.326538]        [<c1037624>] lock_acquire+0x62/0x81
> [  137.326887]        [<c1032b5f>] down_read+0x2b/0x3d
> [  137.327241]        [<c1099403>] dquot_alloc_space+0x50/0x189
> [  137.327588]        [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
> [  137.327935]        [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
> [  137.328280]        [<c10abf67>] ext3_alloc_branch+0x50/0x21b
> [  137.328622]        [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
> [  137.328980]        [<c10ac76e>] ext3_getblk+0x97/0x228
> [  137.329330]        [<c10ac919>] ext3_bread+0x1a/0x78
> [  137.329672]        [<c10b1bc2>] ext3_mkdir+0xf4/0x270
> [  137.330022]        [<c1072a6a>] vfs_mkdir+0xb3/0x161
> [  137.330368]        [<c1072ba4>] sys_mkdirat+0x8c/0xc4
> [  137.330714]        [<c1072bfc>] sys_mkdir+0x20/0x22
> [  137.331063]        [<c1002f78>] syscall_call+0x7/0xb
> [  137.331406]        [<ffffffff>] 0xffffffff
> [  137.331771]
> [  137.331772] -> #3 (&ei->truncate_mutex){--..}:
> [  137.331979]        [<c10354b8>] check_prev_add+0x154/0x206
> [  137.332332]        [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.332676]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.333025]        [<c1037624>] lock_acquire+0x62/0x81
> [  137.333370]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
> [  137.333930]        [<c1205b88>] mutex_lock+0x8/0xa
> [  137.334271]        [<c10ae14f>] ext3_truncate+0x170/0x468
> [  137.334613]        [<c1058e72>] vmtruncate+0xa6/0x116
> [  137.334949]        [<c107d587>] inode_setattr+0x145/0x16c
> [  137.335286]        [<c10af071>] ext3_setattr+0x150/0x22f
> [  137.335627]        [<c107d909>] notify_change+0x35b/0x392
> [  137.335968]        [<c10679d2>] do_truncate+0x52/0x75
> [  137.336305]        [<c1071f5e>] may_open+0x1ec/0x231
> [  137.336642]        [<c107211c>] open_namei+0xda/0x59b
> [  137.336975]        [<c106863a>] do_filp_open+0x2c/0x53
> [  137.337310]        [<c106896f>] do_sys_open+0x52/0xd8
> [  137.337645]        [<c1068a11>] sys_open+0x1c/0x1e
> [  137.337980]        [<c1002f78>] syscall_call+0x7/0xb
> [  137.338315]        [<ffffffff>] 0xffffffff
> [  137.338665]
> [  137.338666] -> #2 (&inode->i_alloc_sem){--..}:
> [  137.338864]        [<c10354b8>] check_prev_add+0x154/0x206
> [  137.339200]        [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.339535]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.339200]        [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.339535]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.339874]        [<c1037624>] lock_acquire+0x62/0x81
> [  137.340207]        [<c1032bf5>] down_write+0x2b/0x45
> [  137.340545]        [<c107d890>] notify_change+0x2e2/0x392
> [  137.340886]        [<c10679d2>] do_truncate+0x52/0x75
> [  137.341222]        [<c1071f5e>] may_open+0x1ec/0x231
> [  137.341557]        [<c107211c>] open_namei+0xda/0x59b
> [  137.341898]          [<c1068a11>] sys_open+0x1c/0x1e
> [  137.343109]        [<c1002f78>] syscall_call+0x7/0xb
> [  137.343444]        [<ffffffff>] 0xffffffff
> [  137.343792]
> [  137.343793] -> #1 (&sysfs_inode_imutex_key){--..}:
> [  137.343988]        [<c10354b8>] check_prev_add+0x154/0x206
> [  137.344320]        [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.344655]        [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.344986]        [<c1037624>] lock_acquire+0x62/0x81
> [  137.345321]        [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
> [  137.345658]        [<c1205b88>] mutex_lock+0x8/0xa
> [  137.345991]        [<c10a5447>] sysfs_hash_and_remove+0x43/0x11c
> [  137.346328]        [<c10a5ef7>] sysfs_remove_file+0xd/0x12
> [  137.346660]        [<c114b00c>] device_remove_file+0x32/0x44
> [  137.346992]        [<c114b6d9>] device_del+0x174/0x1d2
> [  137.347325]        [<c114b742>] device_unregister+0xb/0x15
> [  137.347661]        [<c114b93d>] device_destroy+0x8d/0x9a
> [  137.347994]        [<c1137a38>] vcs_remove_sysfs+0x1c/0x38
> [  137.348328]        [<c113e0d8>] con_close+0x5e/0x6b
> [  137.348661]        [<c1130103>] release_dev+0x4c4/0x6ce
> [  137.348999]        [<c113077b>] tty_release+0x12/0x1c
> [  137.349332]         [<c10aa723>] ext3_file_write+0x2d/0xba
> [  137.358495]        [<c1069441>] do_sync_write+0xc7/0x116
> [  137.358838]        [<c1069612>] vfs_write+0x182/0x187
> [  137.359176]        [<c10696b8>] sys_write+0x3d/0x64
> [  137.359513]        [<c1002f78>] syscall_call+0x7/0xb
> [  137.359848]        [<ffffffff>] 0xffffffff
> [  137.360200]
> [  137.360202] other info that might help us debug this:
> [  137.360204]
> [  137.360335] 3 locks held by tor/4857:
> [  137.360382]  #0:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [  137.360625]  #1:  (&ei->truncate_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> [  137.361072]  #2:  (&s->s_dquot.dqptr_sem){----}, at: [<c1099403>] dquot_alloc_space+0x50/0x189
> [  137.361313]
> [  137.361314] stack backtrace:
> [  137.361401]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> [  137.361488]  [<c1003d7f>] show_trace+0x12/0x14
> [  137.361571]  [<c1003e79>] dump_stack+0x16/0x18
> [  137.361663]  [<c1034d6d>] print_circular_bug_tail+0x6f/0x71
> [  137.361752]  [<c1035398>] check_prev_add+0x34/0x206
> [  137.361838]  [<c10355d4>] check_prevs_add+0x6a/0xd5
> [  137.362786]  [<c1036dd2>] __lock_acquire+0x61c/0xa05
> [  137.362877]  [<c1037624>] lock_acquire+0x62/0x81
> [  137.362960]  [<c1205c09>] __mutex_lock_slowpath+0x75/0x28c
> [  137.363047]  [<c1205b88>] mutex_lock+0x8/0xa
> [  137.363132]  [<c1098d79>] print_warning+0x8c/0x15d
> [  137.363217]  [<c1099537>] dquot_alloc_space+0x184/0x189
> [  137.363302]  [<c10a98e6>] ext3_new_blocks+0x44b/0x5a2
> [  137.363391]  [<c10abe78>] ext3_alloc_blocks+0x40/0xdf
> [  137.363476]  [<c10abf67>] ext3_alloc_branch+0x50/0x21b
> [  137.363561]  [<c10ac445>] ext3_get_blocks_handle+0x1b8/0x367
> [  137.363650]  [<c10ac66c>] ext3_get_block+0x78/0xe3
> [  137.363735]  [<c108a239>] __block_prepare_write+0x168/0x3fd
> [  137.363822]  [<c108ad10>] block_prepare_write+0x28/0x3b
> [  137.363908]  [<c10acbcd>] ext3_prepare_write+0x3f/0x18d
> [  137.363996]  [<c104dbe5>] generic_file_buffered_write+0x197/0x5e6
> [  137.364084]  [<c104e2e2>] __generic_file_aio_write_nolock+0x2ae/0x5ad
> [  137.364171]  [<c104e6ea>] generic_file_aio_write+0x58/0xc4
> [  137.364254]  [<c10aa723>] ext3_file_write+0x2d/0xba
> [  137.364340]  [<c1069441>] do_sync_write+0xc7/0x116
> [  137.364425]  [<c1069612>] vfs_write+0x182/0x187
> [  137.364511]  [<c10696b8>] sys_write+0x3d/0x64
> [  137.364595]  [<c1002f78>] syscall_call+0x7/0xb
> [  137.364682]  =======================
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 11:17     ` dquot.c: possible circular locking " Jarek Poplawski
@ 2007-03-20 11:22       ` Jarek Poplawski
  2007-03-20 11:31         ` Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-20 11:22 UTC (permalink / raw)
  To: Folkert van Heusden
  Cc: Andrew Morton, linux-kernel, Oleg Nesterov, Neil Brown,
	Christoph Hellwig, Jan Kara

On Tue, Mar 20, 2007 at 12:17:01PM +0100, Jarek Poplawski wrote:
...
> IMHO lockdep found that two locks are taken in different order:
> 
> -> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
> -> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()

Should be:
-> #0: 1) dqptr_sem 2) tty_mutex in dquot_alloc_space() with print_warning()

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 11:22       ` Jarek Poplawski
@ 2007-03-20 11:31         ` Jarek Poplawski
  2007-03-20 12:19           ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-20 11:31 UTC (permalink / raw)
  To: Folkert van Heusden
  Cc: Andrew Morton, linux-kernel, Oleg Nesterov, Neil Brown,
	Christoph Hellwig, Jan Kara

On Tue, Mar 20, 2007 at 12:22:53PM +0100, Jarek Poplawski wrote:
> On Tue, Mar 20, 2007 at 12:17:01PM +0100, Jarek Poplawski wrote:
> ...
> > IMHO lockdep found that two locks are taken in different order:
> > 
> > -> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
> > -> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()

Once more - should be:
 -> #1: 1) tty_mutex in con_close() 2) dqptr_sem (somewhere later)
 -> #0: 1) dqptr_sem 2) tty_mutex in dquot_alloc_space() with print_warning()
 
Very sorry!!!
Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 11:31         ` Jarek Poplawski
@ 2007-03-20 12:19           ` Jan Kara
  2007-03-20 13:35             ` Arjan van de Ven
  2007-03-20 13:44             ` Jarek Poplawski
  0 siblings, 2 replies; 35+ messages in thread
From: Jan Kara @ 2007-03-20 12:19 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Folkert van Heusden, Andrew Morton, linux-kernel, Oleg Nesterov,
	Neil Brown, Christoph Hellwig

On Tue 20-03-07 12:31:51, Jarek Poplawski wrote:
> On Tue, Mar 20, 2007 at 12:22:53PM +0100, Jarek Poplawski wrote:
> > On Tue, Mar 20, 2007 at 12:17:01PM +0100, Jarek Poplawski wrote:
> > ...
> > > IMHO lockdep found that two locks are taken in different order:
> > > 
> > > -> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
> > > -> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()
> 
> Once more - should be:
>  -> #1: 1) tty_mutex in con_close() 2) dqptr_sem (somewhere later)
>  -> #0: 1) dqptr_sem 2) tty_mutex in dquot_alloc_space() with print_warning()
  Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
acquired under dqptr_sem in quota code. But looking at the path from
con_close() there's another inversion with i_mutex which is also acquired
along the path for sysfs. And we can hardly get rid of it in the quota code.
  Now none of these is a real deadlock as quota should never call
print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
suppose tty_mutex is above i_mutex because of those sysfs calls and it
seems sysfs must be called under tty_mutex because of races with
init_dev(). So it's not easy to get rid of that dependency either.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 12:19           ` Jan Kara
@ 2007-03-20 13:35             ` Arjan van de Ven
  2007-03-20 14:21               ` Jan Kara
  2007-03-20 13:44             ` Jarek Poplawski
  1 sibling, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2007-03-20 13:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jarek Poplawski, Folkert van Heusden, Andrew Morton,
	linux-kernel, Oleg Nesterov, Neil Brown, Christoph Hellwig



>   Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
> acquired under dqptr_sem in quota code. But looking at the path from
> con_close() there's another inversion with i_mutex which is also acquired
> along the path for sysfs. And we can hardly get rid of it in the quota code.
>   Now none of these is a real deadlock as quota should never call
> print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
> suppose tty_mutex is above i_mutex because of those sysfs calls and it
> seems sysfs must be called under tty_mutex because of races with
> init_dev(). So it's not easy to get rid of that dependency either.

maybe a far more serious option: Why on EARTH is the quota code going to
TTY's directly? That's just WRONG. Maybe it wasn't 10 years ago, but
nowadays most people use graphical user interfaces and the like...

can we please just get rid of this instead?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 12:19           ` Jan Kara
  2007-03-20 13:35             ` Arjan van de Ven
@ 2007-03-20 13:44             ` Jarek Poplawski
  2007-03-20 14:00               ` Jan Kara
  1 sibling, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-20 13:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Folkert van Heusden, Andrew Morton, linux-kernel, Oleg Nesterov,
	Neil Brown, Christoph Hellwig

On Tue, Mar 20, 2007 at 01:19:09PM +0100, Jan Kara wrote:
> On Tue 20-03-07 12:31:51, Jarek Poplawski wrote:
> > On Tue, Mar 20, 2007 at 12:22:53PM +0100, Jarek Poplawski wrote:
> > > On Tue, Mar 20, 2007 at 12:17:01PM +0100, Jarek Poplawski wrote:
> > > ...
> > > > IMHO lockdep found that two locks are taken in different order:
> > > > 
> > > > -> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
> > > > -> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()
> > 
> > Once more - should be:
> >  -> #1: 1) tty_mutex in con_close() 2) dqptr_sem (somewhere later)
> >  -> #0: 1) dqptr_sem 2) tty_mutex in dquot_alloc_space() with print_warning()
>   Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
> acquired under dqptr_sem in quota code. But looking at the path from
> con_close() there's another inversion with i_mutex which is also acquired
> along the path for sysfs. And we can hardly get rid of it in the quota code.
>   Now none of these is a real deadlock as quota should never call
> print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
> suppose tty_mutex is above i_mutex because of those sysfs calls and it
> seems sysfs must be called under tty_mutex because of races with
> init_dev(). So it's not easy to get rid of that dependency either.

I wonder if this cannot be done with a workqueue (message to a buffer,
maybe after try_lock on tty_mutex) and BTW isn't there any "modern"
way to queue console messages like these?

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 13:44             ` Jarek Poplawski
@ 2007-03-20 14:00               ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2007-03-20 14:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Folkert van Heusden, Andrew Morton, linux-kernel, Oleg Nesterov,
	Neil Brown, Christoph Hellwig

On Tue 20-03-07 14:44:46, Jarek Poplawski wrote:
> On Tue, Mar 20, 2007 at 01:19:09PM +0100, Jan Kara wrote:
> > On Tue 20-03-07 12:31:51, Jarek Poplawski wrote:
> > > On Tue, Mar 20, 2007 at 12:22:53PM +0100, Jarek Poplawski wrote:
> > > > On Tue, Mar 20, 2007 at 12:17:01PM +0100, Jarek Poplawski wrote:
> > > > ...
> > > > > IMHO lockdep found that two locks are taken in different order:
> > > > > 
> > > > > -> #1: 1) tty_mutex in con_console() 2) dqptr_sem (somewhere later)
> > > > > -> #0: 1) dqptr_sem 2) tty_console in dquot_alloc_space() with print_warning()
> > > 
> > > Once more - should be:
> > >  -> #1: 1) tty_mutex in con_close() 2) dqptr_sem (somewhere later)
> > >  -> #0: 1) dqptr_sem 2) tty_mutex in dquot_alloc_space() with print_warning()
> >   Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
> > acquired under dqptr_sem in quota code. But looking at the path from
> > con_close() there's another inversion with i_mutex which is also acquired
> > along the path for sysfs. And we can hardly get rid of it in the quota code.
> >   Now none of these is a real deadlock as quota should never call
> > print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
> > suppose tty_mutex is above i_mutex because of those sysfs calls and it
> > seems sysfs must be called under tty_mutex because of races with
> > init_dev(). So it's not easy to get rid of that dependency either.
> 
> I wonder if this cannot be done with a workqueue (message to a buffer,
> maybe after try_lock on tty_mutex) and BTW isn't there any "modern"
> way to queue console messages like these?
  I don't know about any such way...

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 14:21               ` Jan Kara
@ 2007-03-20 14:18                 ` Arjan van de Ven
  0 siblings, 0 replies; 35+ messages in thread
From: Arjan van de Ven @ 2007-03-20 14:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jarek Poplawski, Folkert van Heusden, Andrew Morton,
	linux-kernel, Oleg Nesterov, Neil Brown, Christoph Hellwig

On Tue, 2007-03-20 at 15:21 +0100, Jan Kara wrote:
> On Tue 20-03-07 14:35:10, Arjan van de Ven wrote:
> > 
> > 
> > >   Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
> > > acquired under dqptr_sem in quota code. But looking at the path from
> > > con_close() there's another inversion with i_mutex which is also acquired
> > > along the path for sysfs. And we can hardly get rid of it in the quota code.
> > >   Now none of these is a real deadlock as quota should never call
> > > print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
> > > suppose tty_mutex is above i_mutex because of those sysfs calls and it
> > > seems sysfs must be called under tty_mutex because of races with
> > > init_dev(). So it's not easy to get rid of that dependency either.
> > 
> > maybe a far more serious option: Why on EARTH is the quota code going to
> > TTY's directly? That's just WRONG. Maybe it wasn't 10 years ago, but
> > nowadays most people use graphical user interfaces and the like...
>   We've been discussing this sometimes back in August ;)
> (http://lkml.org/lkml/2006/8/8/237) and we've decided to leave the code in.
> The only reason why I think it should stay in is the existence of quota
> softlimits. There it's nice to warn the user and there's no other way to
> propagate this information into userspace (as the write succeeds).
>   One solution would be to leave the warning to some userspace process
> (like warnquota) run from cron but still I'm not sure we should change the
> behavior.

or send a uevent or something


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: dquot.c: possible circular locking Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 13:35             ` Arjan van de Ven
@ 2007-03-20 14:21               ` Jan Kara
  2007-03-20 14:18                 ` Arjan van de Ven
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2007-03-20 14:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jarek Poplawski, Folkert van Heusden, Andrew Morton,
	linux-kernel, Oleg Nesterov, Neil Brown, Christoph Hellwig

On Tue 20-03-07 14:35:10, Arjan van de Ven wrote:
> 
> 
> >   Yes, I was looking at it. Hmm, we can possibly get rid of tty_mutex being
> > acquired under dqptr_sem in quota code. But looking at the path from
> > con_close() there's another inversion with i_mutex which is also acquired
> > along the path for sysfs. And we can hardly get rid of it in the quota code.
> >   Now none of these is a real deadlock as quota should never call
> > print_warning() for sysfs (it doesn't use quota) but still it's nasty. I
> > suppose tty_mutex is above i_mutex because of those sysfs calls and it
> > seems sysfs must be called under tty_mutex because of races with
> > init_dev(). So it's not easy to get rid of that dependency either.
> 
> maybe a far more serious option: Why on EARTH is the quota code going to
> TTY's directly? That's just WRONG. Maybe it wasn't 10 years ago, but
> nowadays most people use graphical user interfaces and the like...
  We've been discussing this sometimes back in August ;)
(http://lkml.org/lkml/2006/8/8/237) and we've decided to leave the code in.
The only reason why I think it should stay in is the existence of quota
softlimits. There it's nice to warn the user and there's no other way to
propagate this information into userspace (as the write succeeds).
  One solution would be to leave the warning to some userspace process
(like warnquota) run from cron but still I'm not sure we should change the
behavior.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20  9:37         ` [PATCH] " Jarek Poplawski
@ 2007-03-20 16:07           ` Oleg Nesterov
  2007-03-21  8:05             ` Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2007-03-20 16:07 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On 03/20, Jarek Poplawski wrote:
>
> >> On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >>
> >>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> >>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> >>>>> ...
> >>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> >>>>> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> >>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
> >>>>> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> >>>>> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> >>>>> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> >>>>> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> >>>>> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> >>>>> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> >>>>> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> >>>>> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> >>>>> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> >>>>> [ 1756.729547]  =======================
> ...
> >>> I think I'm responsible for this message (commit
> >>> d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
> >>> function executed by the workqueue (here laundromat_main) leaked an
> >>> atomic context or is still holding locks (2 locks in this case).
> 
> This check is valid with keventd, but it looks like nfsd runs
> kthread by itself. I'm not sure it's illegal to hold locks then,

nfsd creates laundry_wq by itself, yes, but cwq->thread runs with
lockdep_depth() == 0. Unless we have a bug with lockdep_depth(),
lockdep_depth() != 0 means that work->func() returns with a lock
held (or it can flush its own workqueue under lock, but in that case
we should have a different trace).

Personally I agree with Andrew:
>
> > OK.  That's not necessarily a bug: one could envisage a (weird) piece of
> > code which takes a lock then releases it on a later workqueue invokation.

> @@ -323,13 +324,15 @@ static void run_workqueue(struct cpu_wor
>  		BUG_ON(get_wq_data(work) != cwq);
>  		if (!test_bit(WORK_STRUCT_NOAUTOREL, work_data_bits(work)))
>  			work_release(work);
> +		ld = lockdep_depth(current);
> +
>  		f(work);
>  
> -		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> +		if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {

and with this change we will also have a BUG report on "then releases it on a
later workqueue invokation".

Btw. __nfs4_state_shutdown() does cancel_delayed_work(&laundromat_work).
This is unneeded and imho confusing. laundry_wq was already destroyed,
it would be BUG to have a pending delayed work.

Oleg.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-20 16:07           ` Oleg Nesterov
@ 2007-03-21  8:05             ` Jarek Poplawski
  2007-03-21 14:46               ` Oleg Nesterov
  0 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-21  8:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Tue, Mar 20, 2007 at 07:07:59PM +0300, Oleg Nesterov wrote:
> On 03/20, Jarek Poplawski wrote:
...
> > >>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > >>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > >>>>> ...
> > >>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > >>>>> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > >>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
> > >>>>> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > >>>>> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > >>>>> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > >>>>> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > >>>>> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > >>>>> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > >>>>> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > >>>>> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > >>>>> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > >>>>> [ 1756.729547]  =======================
...
> > This check is valid with keventd, but it looks like nfsd runs
> > kthread by itself. I'm not sure it's illegal to hold locks then,
> 
> nfsd creates laundry_wq by itself, yes, but cwq->thread runs with
> lockdep_depth() == 0. Unless we have a bug with lockdep_depth(),
> lockdep_depth() != 0 means that work->func() returns with a lock
> held (or it can flush its own workqueue under lock, but in that case
> we should have a different trace).

IMHO you can only say this thread is supposed to run with
lockdep_depth() == 0. lockdep_depth is counted within a process,
which starts before f(), so the only way to say f() leaked locks
is to check these locks before and after f().

> 
> Personally I agree with Andrew:
> >
> > > OK.  That's not necessarily a bug: one could envisage a (weird) piece of
> > > code which takes a lock then releases it on a later workqueue invokation.

But this code is named here as laundromat_main and it doesn't
seem to work like this.

> 
> > @@ -323,13 +324,15 @@ static void run_workqueue(struct cpu_wor
> >  		BUG_ON(get_wq_data(work) != cwq);
> >  		if (!test_bit(WORK_STRUCT_NOAUTOREL, work_data_bits(work)))
> >  			work_release(work);
> > +		ld = lockdep_depth(current);
> > +
> >  		f(work);
> >  
> > -		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> > +		if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {
> 
> and with this change we will also have a BUG report on "then releases it on a
> later workqueue invokation".

Then we could say at least this code is weird (buggy - in my opinion).
This patch doesn't change the way the "standard" code is treated,
so I cannot see any possibility to get it worse then now.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21  8:05             ` Jarek Poplawski
@ 2007-03-21 14:46               ` Oleg Nesterov
  2007-03-21 15:16                 ` Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2007-03-21 14:46 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On 03/21, Jarek Poplawski wrote:
>
> On Tue, Mar 20, 2007 at 07:07:59PM +0300, Oleg Nesterov wrote:
> > On 03/20, Jarek Poplawski wrote:
> ...
> > > >>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > > >>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > > >>>>> ...
> > > >>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > >>>>> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > >>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
> > > >>>>> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > >>>>> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > >>>>> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > >>>>> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > > >>>>> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > > >>>>> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > >>>>> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > > >>>>> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > > >>>>> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > > >>>>> [ 1756.729547]  =======================
> ...
> > > This check is valid with keventd, but it looks like nfsd runs
> > > kthread by itself. I'm not sure it's illegal to hold locks then,
> > 
> > nfsd creates laundry_wq by itself, yes, but cwq->thread runs with
> > lockdep_depth() == 0. Unless we have a bug with lockdep_depth(),
> > lockdep_depth() != 0 means that work->func() returns with a lock
> > held (or it can flush its own workqueue under lock, but in that case
> > we should have a different trace).
> 
> IMHO you can only say this thread is supposed to run with
> lockdep_depth() == 0. lockdep_depth is counted within a process,
> which starts before f(), so the only way to say f() leaked locks
> is to check these locks before and after f().

Sorry, I can't understand you. lockdep_depth is counted within a process,
which starts before f(), yes. This process is cwq->thread, it was forked
during create_workqueue(). It does not take any locks directly, only by
calling work->func(). laundry_wq doesn't differ from keventd_wq or any other
wq in this sense. nfsd does not "runs kthread by itself", it inserts the
work and wakes up cwq->thread.

> > Personally I agree with Andrew:
> > >
> > > > OK.  That's not necessarily a bug: one could envisage a (weird) piece of
> > > > code which takes a lock then releases it on a later workqueue invokation.
> 
> But this code is named here as laundromat_main and it doesn't
> seem to work like this.

This means we have a problem with leak detection.

> > > +		ld = lockdep_depth(current);
> > > +
> > >  		f(work);
> > >  
> > > -		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> > > +		if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {
> > 
> > and with this change we will also have a BUG report on "then releases it on a
> > later workqueue invokation".
> 
> Then we could say at least this code is weird (buggy - in my opinion).
> This patch doesn't change the way the "standard" code is treated,
> so I cannot see any possibility to get it worse then now.

I didn't mean this patch makes things worse (except it conflicts with other
patches in -mm tree). In fact, it may improve the diagnostics. My point is
that this patch afaics has nothing to do with the discussed problem.

Oleg.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21 14:46               ` Oleg Nesterov
@ 2007-03-21 15:16                 ` Jarek Poplawski
  2007-03-21 15:17                   ` Folkert van Heusden
  2007-03-21 15:29                   ` Oleg Nesterov
  0 siblings, 2 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-21 15:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Wed, Mar 21, 2007 at 05:46:20PM +0300, Oleg Nesterov wrote:
> On 03/21, Jarek Poplawski wrote:
> >
> > On Tue, Mar 20, 2007 at 07:07:59PM +0300, Oleg Nesterov wrote:
> > > On 03/20, Jarek Poplawski wrote:
> > ...
> > > > >>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
> > > > >>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@vanheusden.com> wrote:
> > > > >>>>> ...
> > > > >>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
> > > > >>>>> [ 1756.728271]     last function: laundromat_main+0x0/0x69 [nfsd]
> > > > >>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
> > > > >>>>> [ 1756.728435]  #0:  (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > >>>>> [ 1756.728679]  #1:  (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
> > > > >>>>> [ 1756.728923]  [<c1003d57>] show_trace_log_lvl+0x1a/0x30
> > > > >>>>> [ 1756.729015]  [<c1003d7f>] show_trace+0x12/0x14
> > > > >>>>> [ 1756.729103]  [<c1003e79>] dump_stack+0x16/0x18
> > > > >>>>> [ 1756.729187]  [<c102c2e8>] run_workqueue+0x167/0x170
> > > > >>>>> [ 1756.729276]  [<c102c437>] worker_thread+0x146/0x165
> > > > >>>>> [ 1756.729368]  [<c102f797>] kthread+0x97/0xc4
> > > > >>>>> [ 1756.729456]  [<c1003bdb>] kernel_thread_helper+0x7/0x10
> > > > >>>>> [ 1756.729547]  =======================
> > ...
> > > > This check is valid with keventd, but it looks like nfsd runs
> > > > kthread by itself. I'm not sure it's illegal to hold locks then,
> > > 
> > > nfsd creates laundry_wq by itself, yes, but cwq->thread runs with
> > > lockdep_depth() == 0. Unless we have a bug with lockdep_depth(),
> > > lockdep_depth() != 0 means that work->func() returns with a lock
> > > held (or it can flush its own workqueue under lock, but in that case
> > > we should have a different trace).
> > 
> > IMHO you can only say this thread is supposed to run with
> > lockdep_depth() == 0. lockdep_depth is counted within a process,
> > which starts before f(), so the only way to say f() leaked locks
> > is to check these locks before and after f().
> 
> Sorry, I can't understand you. lockdep_depth is counted within a process,
> which starts before f(), yes. This process is cwq->thread, it was forked
> during create_workqueue(). It does not take any locks directly, only by
> calling work->func(). laundry_wq doesn't differ from keventd_wq or any other
> wq in this sense. nfsd does not "runs kthread by itself", it inserts the
> work and wakes up cwq->thread.

I think we know how it all should start and go. If only analyzing
the code could be enough, this current check of lockdep_depth()
is unnecessary too, because laundromat_code code looks as good
as run_workqueue code. I send it for testing and I mean it -
something strange is going here, so some checks should be added
- I didn't say mine is the right and will certainly help.  So, if
you have another idea for looking after it, let us know.

> 
> > > Personally I agree with Andrew:
> > > >
> > > > > OK.  That's not necessarily a bug: one could envisage a (weird) piece of
> > > > > code which takes a lock then releases it on a later workqueue invokation.
> > 
> > But this code is named here as laundromat_main and it doesn't
> > seem to work like this.
> 
> This means we have a problem with leak detection.

I totally agree!

> 
> > > > +		ld = lockdep_depth(current);
> > > > +
> > > >  		f(work);
> > > >  
> > > > -		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
> > > > +		if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {
> > > 
> > > and with this change we will also have a BUG report on "then releases it on a
> > > later workqueue invokation".
> > 
> > Then we could say at least this code is weird (buggy - in my opinion).
> > This patch doesn't change the way the "standard" code is treated,
> > so I cannot see any possibility to get it worse then now.
> 
> I didn't mean this patch makes things worse (except it conflicts with other
> patches in -mm tree). In fact, it may improve the diagnostics. My point is
> that this patch afaics has nothing to do with the discussed problem.

I thought you are trying to diagnose a bug and maybe Folkert could test,
if this patch makes any difference. Sorry, if I missed the real problem.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21 15:16                 ` Jarek Poplawski
@ 2007-03-21 15:17                   ` Folkert van Heusden
  2007-03-21 15:29                   ` Oleg Nesterov
  1 sibling, 0 replies; 35+ messages in thread
From: Folkert van Heusden @ 2007-03-21 15:17 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Oleg Nesterov, Neil Brown, Andrew Morton, Peter Zijlstra,
	linux-kernel, J. Bruce Fields, Ingo Molnar

> I thought you are trying to diagnose a bug and maybe Folkert could test,
> if this patch makes any difference. Sorry, if I missed the real problem.

Yes please let me know when I can test a patch.


Folkert van Heusden

-- 
www.biglumber.com <- site where one can exchange PGP key signatures 
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21 15:16                 ` Jarek Poplawski
  2007-03-21 15:17                   ` Folkert van Heusden
@ 2007-03-21 15:29                   ` Oleg Nesterov
  2007-03-21 18:16                     ` Oleg Nesterov
  1 sibling, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2007-03-21 15:29 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On 03/21, Jarek Poplawski wrote:
>
> > Sorry, I can't understand you. lockdep_depth is counted within a process,
> > which starts before f(), yes. This process is cwq->thread, it was forked
> > during create_workqueue(). It does not take any locks directly, only by
> > calling work->func(). laundry_wq doesn't differ from keventd_wq or any other
> > wq in this sense. nfsd does not "runs kthread by itself", it inserts the
> > work and wakes up cwq->thread.
> 
> I think we know how it all should start and go. If only analyzing
> the code could be enough, this current check of lockdep_depth()
> is unnecessary too, because laundromat_code code looks as good
> as run_workqueue code. I send it for testing and I mean it -
> something strange is going here, so some checks should be added
> - I didn't say mine is the right and will certainly help.

Ah, sorry, I misunderstood you. I had (my fault) a false impression
that you propose this patch as a "fix", while we seem to agree that
the real source of this problem remains unknown.

> So, if
> you have another idea for looking after it, let us know.

No, I don't.

Oleg.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21 15:29                   ` Oleg Nesterov
@ 2007-03-21 18:16                     ` Oleg Nesterov
  2007-03-22  6:11                       ` [PATCH] lockdep: lockdep_depth vs. debug_locks " Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: Oleg Nesterov @ 2007-03-21 18:16 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On 03/21, Oleg Nesterov wrote:
>
> On 03/21, Jarek Poplawski wrote:
> >
> > So, if
> > you have another idea for looking after it, let us know.
>
> No, I don't.

Perhaps we can add lockdep_depth() checks to laundromat_main/nfs4_laundromat
to rule out workqueue.c. According to the trace, we return with client_mutex
held, this seems impossible. So we can also print the value of mutex_is_locked()
if lockdep_depth() != 0.

Oleg.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-21 18:16                     ` Oleg Nesterov
@ 2007-03-22  6:11                       ` Jarek Poplawski
  2007-03-22  6:28                         ` Andrew Morton
                                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  6:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar


Here is some joke:

[PATCH] lockdep: lockdep_depth vs. debug_locks

lockdep really shouldn't be used when debug_locks == 0!


Reported-by: Folkert van Heusden <folkert@vanheusden.com>
Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 2.6.21-rc4-git4/include/linux/lockdep.h
--- 2.6.21-rc4-git4-/include/linux/lockdep.h	2007-03-20 20:24:17.000000000 +0100
+++ 2.6.21-rc4-git4/include/linux/lockdep.h	2007-03-21 22:32:41.000000000 +0100
@@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_
 
 # define INIT_LOCKDEP				.lockdep_recursion = 0,
 
-#define lockdep_depth(tsk)	((tsk)->lockdep_depth)
+#define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
 
 #else /* !LOCKDEP */
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  6:11                       ` [PATCH] lockdep: lockdep_depth vs. debug_locks " Jarek Poplawski
@ 2007-03-22  6:28                         ` Andrew Morton
  2007-03-22  7:06                           ` Jarek Poplawski
                                             ` (2 more replies)
  2007-03-22  6:57                         ` [PATCH] lockdep: debug_show_all_locks & debug_show_held_locks vs. debug_locks Jarek Poplawski
  2007-03-22  7:22                         ` [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock Peter Zijlstra
  2 siblings, 3 replies; 35+ messages in thread
From: Andrew Morton @ 2007-03-22  6:28 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Oleg Nesterov, Neil Brown, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Thu, 22 Mar 2007 07:11:19 +0100 Jarek Poplawski <jarkao2@o2.pl> wrote:

> 
> Here is some joke:
> 
> [PATCH] lockdep: lockdep_depth vs. debug_locks
> 
> lockdep really shouldn't be used when debug_locks == 0!
> 

This isn't a very good changelog.

> 
> Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> ---
> 
> diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 2.6.21-rc4-git4/include/linux/lockdep.h
> --- 2.6.21-rc4-git4-/include/linux/lockdep.h	2007-03-20 20:24:17.000000000 +0100
> +++ 2.6.21-rc4-git4/include/linux/lockdep.h	2007-03-21 22:32:41.000000000 +0100
> @@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_
>  
>  # define INIT_LOCKDEP				.lockdep_recursion = 0,
>  
> -#define lockdep_depth(tsk)	((tsk)->lockdep_depth)
> +#define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
>  
>  #else /* !LOCKDEP */
>  

What problem does this solve, and how does it solve it?

I assume that some codepath is incrementing ->lockdep_depth even when
debug_locks==0.  Isn't that wrong of it?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
  2007-03-22  6:11                       ` [PATCH] lockdep: lockdep_depth vs. debug_locks " Jarek Poplawski
  2007-03-22  6:28                         ` Andrew Morton
@ 2007-03-22  6:57                         ` Jarek Poplawski
  2007-03-22  7:23                           ` Peter Zijlstra
  2007-03-22  7:22                         ` [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock Peter Zijlstra
  2 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  6:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Neil Brown, Andrew Morton, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar


And here is some addition.

[PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
 
lockdep's data shouldn't be used when debug_locks == 0
because it's not updated after this, so it's more misleading
than helpful.

PS: probably lockdep's current-> fields should be reset after
it turns debug_locks off: so, after printing a bug report, but
before return from exported functions, but there are really
a lot of these possibilities (e.g. after DEBUG_LOCKS_WARN_ON),
so, something could be missed. (Of course direct use of this
fields isn't recommended either.)

Reported-by: Folkert van Heusden <folkert@vanheusden.com>
Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.21-rc4-git4-/kernel/lockdep.c 2.6.21-rc4-git4/kernel/lockdep.c
--- 2.6.21-rc4-git4-/kernel/lockdep.c	2007-03-21 22:46:26.000000000 +0100
+++ 2.6.21-rc4-git4/kernel/lockdep.c	2007-03-21 23:05:17.000000000 +0100
@@ -2742,6 +2742,10 @@ void debug_show_all_locks(void)
 	int count = 10;
 	int unlock = 1;
 
+	if (unlikely(!debug_locks)) {
+		printk("INFO: lockdep is turned off.\n");
+		return;
+	}
 	printk("\nShowing all locks held in the system:\n");
 
 	/*
@@ -2785,6 +2789,10 @@ EXPORT_SYMBOL_GPL(debug_show_all_locks);
 
 void debug_show_held_locks(struct task_struct *task)
 {
+	if (unlikely(!debug_locks)) {
+		printk("INFO: lockdep is turned off.\n");
+		return;
+	}
 	lockdep_print_held_locks(task);
 }
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  6:28                         ` Andrew Morton
@ 2007-03-22  7:06                           ` Jarek Poplawski
  2007-03-22  7:23                             ` Jarek Poplawski
  2007-03-22  7:13                           ` Jarek Poplawski
  2007-03-22  8:26                           ` Jarek Poplawski
  2 siblings, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  7:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Neil Brown, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Wed, Mar 21, 2007 at 10:28:02PM -0800, Andrew Morton wrote:
> On Thu, 22 Mar 2007 07:11:19 +0100 Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > 
> > Here is some joke:
> > 
> > [PATCH] lockdep: lockdep_depth vs. debug_locks
> > 
> > lockdep really shouldn't be used when debug_locks == 0!
> > 
> 
> This isn't a very good changelog.
> 
> > 
> > Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> > Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> > 
> > ---
> > 
> > diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 2.6.21-rc4-git4/include/linux/lockdep.h
> > --- 2.6.21-rc4-git4-/include/linux/lockdep.h	2007-03-20 20:24:17.000000000 +0100
> > +++ 2.6.21-rc4-git4/include/linux/lockdep.h	2007-03-21 22:32:41.000000000 +0100
> > @@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_
> >  
> >  # define INIT_LOCKDEP				.lockdep_recursion = 0,
> >  
> > -#define lockdep_depth(tsk)	((tsk)->lockdep_depth)
> > +#define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
> >  
> >  #else /* !LOCKDEP */
> >  
> 
> What problem does this solve, and how does it solve it?
> 
> I assume that some codepath is incrementing ->lockdep_depth even when
> debug_locks==0.  Isn't that wrong of it?
> 

This should definitely solve this problem - as it was said
a few times before lockdep stops registering locks after
a bug, so even the lock which caused the warning isn't
reported. Here lockdep found a bug in a workqueue function,
so after this no lock or unlock isn't counted nor reported.

I think Ingo can tell more.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  6:28                         ` Andrew Morton
  2007-03-22  7:06                           ` Jarek Poplawski
@ 2007-03-22  7:13                           ` Jarek Poplawski
  2007-03-22  8:26                           ` Jarek Poplawski
  2 siblings, 0 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Neil Brown, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Wed, Mar 21, 2007 at 10:28:02PM -0800, Andrew Morton wrote:
...
> I assume that some codepath is incrementing ->lockdep_depth even when
> debug_locks==0.  Isn't that wrong of it?
> 

lockdep simply stops to update lockdep_depth just after (during)
a bug or a WARN.

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  6:11                       ` [PATCH] lockdep: lockdep_depth vs. debug_locks " Jarek Poplawski
  2007-03-22  6:28                         ` Andrew Morton
  2007-03-22  6:57                         ` [PATCH] lockdep: debug_show_all_locks & debug_show_held_locks vs. debug_locks Jarek Poplawski
@ 2007-03-22  7:22                         ` Peter Zijlstra
  2007-03-22  9:06                           ` Ingo Molnar
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2007-03-22  7:22 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Oleg Nesterov, Neil Brown, Andrew Morton, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Thu, 2007-03-22 at 07:11 +0100, Jarek Poplawski wrote:
> Here is some joke:
> 
> [PATCH] lockdep: lockdep_depth vs. debug_locks
> 
> lockdep really shouldn't be used when debug_locks == 0!

This happens then lockdep reports a fatal error, at which point
it will stop tracking locks and leave whatever state was there.

> Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

This looks sane, thanks for figuring this out.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
> 
> diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 2.6.21-rc4-git4/include/linux/lockdep.h
> --- 2.6.21-rc4-git4-/include/linux/lockdep.h	2007-03-20 20:24:17.000000000 +0100
> +++ 2.6.21-rc4-git4/include/linux/lockdep.h	2007-03-21 22:32:41.000000000 +0100
> @@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_
>  
>  # define INIT_LOCKDEP				.lockdep_recursion = 0,
>  
> -#define lockdep_depth(tsk)	((tsk)->lockdep_depth)
> +#define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
>  
>  #else /* !LOCKDEP */
>  


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  7:06                           ` Jarek Poplawski
@ 2007-03-22  7:23                             ` Jarek Poplawski
  0 siblings, 0 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  7:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Neil Brown, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Thu, Mar 22, 2007 at 08:06:44AM +0100, Jarek Poplawski wrote:
...
> This should definitely solve this problem - as it was said
> a few times before lockdep stops registering locks after
> a bug, so even the lock which caused the warning isn't
> reported. Here lockdep found a bug in a workqueue function,

More precisely:

lockdep found a bug during a run of workqueue function - this
could be also caused by a bug from other code running
simultaneusly.

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
  2007-03-22  6:57                         ` [PATCH] lockdep: debug_show_all_locks & debug_show_held_locks vs. debug_locks Jarek Poplawski
@ 2007-03-22  7:23                           ` Peter Zijlstra
  2007-03-22  9:06                             ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2007-03-22  7:23 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Oleg Nesterov, Neil Brown, Andrew Morton, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Thu, 2007-03-22 at 07:57 +0100, Jarek Poplawski wrote:
> And here is some addition.
> 
> [PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
>  
> lockdep's data shouldn't be used when debug_locks == 0
> because it's not updated after this, so it's more misleading
> than helpful.
> 
> PS: probably lockdep's current-> fields should be reset after
> it turns debug_locks off: so, after printing a bug report, but
> before return from exported functions, but there are really
> a lot of these possibilities (e.g. after DEBUG_LOCKS_WARN_ON),
> so, something could be missed. (Of course direct use of this
> fields isn't recommended either.)
> 
> Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
> 
> diff -Nurp 2.6.21-rc4-git4-/kernel/lockdep.c 2.6.21-rc4-git4/kernel/lockdep.c
> --- 2.6.21-rc4-git4-/kernel/lockdep.c	2007-03-21 22:46:26.000000000 +0100
> +++ 2.6.21-rc4-git4/kernel/lockdep.c	2007-03-21 23:05:17.000000000 +0100
> @@ -2742,6 +2742,10 @@ void debug_show_all_locks(void)
>  	int count = 10;
>  	int unlock = 1;
>  
> +	if (unlikely(!debug_locks)) {
> +		printk("INFO: lockdep is turned off.\n");
> +		return;
> +	}
>  	printk("\nShowing all locks held in the system:\n");
>  
>  	/*
> @@ -2785,6 +2789,10 @@ EXPORT_SYMBOL_GPL(debug_show_all_locks);
>  
>  void debug_show_held_locks(struct task_struct *task)
>  {
> +	if (unlikely(!debug_locks)) {
> +		printk("INFO: lockdep is turned off.\n");
> +		return;
> +	}
>  	lockdep_print_held_locks(task);
>  }
>  


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  6:28                         ` Andrew Morton
  2007-03-22  7:06                           ` Jarek Poplawski
  2007-03-22  7:13                           ` Jarek Poplawski
@ 2007-03-22  8:26                           ` Jarek Poplawski
  2 siblings, 0 replies; 35+ messages in thread
From: Jarek Poplawski @ 2007-03-22  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Neil Brown, Peter Zijlstra, Folkert van Heusden,
	linux-kernel, J. Bruce Fields, Ingo Molnar

On Wed, Mar 21, 2007 at 10:28:02PM -0800, Andrew Morton wrote:
> On Thu, 22 Mar 2007 07:11:19 +0100 Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > 
> > Here is some joke:
> > 
> > [PATCH] lockdep: lockdep_depth vs. debug_locks
> > 
> > lockdep really shouldn't be used when debug_locks == 0!
> > 
> 
> This isn't a very good changelog.

Actually it's not a changelog (it's a joke).

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
  2007-03-22  7:22                         ` [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock Peter Zijlstra
@ 2007-03-22  9:06                           ` Ingo Molnar
  0 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-03-22  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jarek Poplawski, Oleg Nesterov, Neil Brown, Andrew Morton,
	Folkert van Heusden, linux-kernel, J. Bruce Fields


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-03-22 at 07:11 +0100, Jarek Poplawski wrote:
> > Here is some joke:
> > 
> > [PATCH] lockdep: lockdep_depth vs. debug_locks
> > 
> > lockdep really shouldn't be used when debug_locks == 0!
> 
> This happens then lockdep reports a fatal error, at which point
> it will stop tracking locks and leave whatever state was there.
> 
> > Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> > Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> This looks sane, thanks for figuring this out.
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
  2007-03-22  7:23                           ` Peter Zijlstra
@ 2007-03-22  9:06                             ` Ingo Molnar
  0 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-03-22  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jarek Poplawski, Oleg Nesterov, Neil Brown, Andrew Morton,
	Folkert van Heusden, linux-kernel, J. Bruce Fields


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-03-22 at 07:57 +0100, Jarek Poplawski wrote:
> > And here is some addition.
> > 
> > [PATCH] lockdep: debug_show_all_locks &  debug_show_held_locks vs. debug_locks
> >  
> > lockdep's data shouldn't be used when debug_locks == 0
> > because it's not updated after this, so it's more misleading
> > than helpful.
> > 
> > PS: probably lockdep's current-> fields should be reset after
> > it turns debug_locks off: so, after printing a bug report, but
> > before return from exported functions, but there are really
> > a lot of these possibilities (e.g. after DEBUG_LOCKS_WARN_ON),
> > so, something could be missed. (Of course direct use of this
> > fields isn't recommended either.)
> > 
> > Reported-by: Folkert van Heusden <folkert@vanheusden.com>
> > Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-03-22  9:07 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-13 16:50 [2.6.20] BUG: workqueue leaked lock Folkert van Heusden
2007-03-15 19:06 ` Andrew Morton
2007-03-15 19:17   ` Folkert van Heusden
2007-03-16 14:49     ` Jarek Poplawski
2007-03-20 11:17     ` dquot.c: possible circular locking " Jarek Poplawski
2007-03-20 11:22       ` Jarek Poplawski
2007-03-20 11:31         ` Jarek Poplawski
2007-03-20 12:19           ` Jan Kara
2007-03-20 13:35             ` Arjan van de Ven
2007-03-20 14:21               ` Jan Kara
2007-03-20 14:18                 ` Arjan van de Ven
2007-03-20 13:44             ` Jarek Poplawski
2007-03-20 14:00               ` Jan Kara
2007-03-16  8:41   ` Peter Zijlstra
2007-03-16 11:39     ` Andrew Morton
2007-03-19  6:24       ` Neil Brown
2007-03-20  9:37         ` [PATCH] " Jarek Poplawski
2007-03-20 16:07           ` Oleg Nesterov
2007-03-21  8:05             ` Jarek Poplawski
2007-03-21 14:46               ` Oleg Nesterov
2007-03-21 15:16                 ` Jarek Poplawski
2007-03-21 15:17                   ` Folkert van Heusden
2007-03-21 15:29                   ` Oleg Nesterov
2007-03-21 18:16                     ` Oleg Nesterov
2007-03-22  6:11                       ` [PATCH] lockdep: lockdep_depth vs. debug_locks " Jarek Poplawski
2007-03-22  6:28                         ` Andrew Morton
2007-03-22  7:06                           ` Jarek Poplawski
2007-03-22  7:23                             ` Jarek Poplawski
2007-03-22  7:13                           ` Jarek Poplawski
2007-03-22  8:26                           ` Jarek Poplawski
2007-03-22  6:57                         ` [PATCH] lockdep: debug_show_all_locks & debug_show_held_locks vs. debug_locks Jarek Poplawski
2007-03-22  7:23                           ` Peter Zijlstra
2007-03-22  9:06                             ` Ingo Molnar
2007-03-22  7:22                         ` [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock Peter Zijlstra
2007-03-22  9:06                           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).