* Xfs lockdep warning with for-dave-for-4.6 branch @ 2016-05-12 5:53 Qu Wenruo 2016-05-12 5:57 ` Darrick J. Wong 0 siblings, 1 reply; 37+ messages in thread From: Qu Wenruo @ 2016-05-12 5:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: xfs Hi Darrick, Thanks for your reflink work for xfs first, that's quite a good and useful feature, also helps to debug btrfs problems. (Without that, there is no good reference for reflink behavior) But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs with xfstests, kernel report some strange lockdep from xfs: ------ run fstests generic/175 at 2016-05-12 13:22:06 BTRFS: device fsid 3d5c9c3b-2d08-4f0b-9663-00a88cd218da devid 1 transid 3 /dev/sdb6 BTRFS info (device sdb6): disk space caching is enabled BTRFS: has skinny extents BTRFS: flagging fs with big metadata feature BTRFS: creating UUID tree BTRFS: device fsid bb75eb48-4c5f-4b75-a41d-f642d70c7294 devid 1 transid 3 /dev/sdb6 BTRFS info (device sdb6): disk space caching is enabled BTRFS: has skinny extents BTRFS: flagging fs with big metadata feature BTRFS: creating UUID tree ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 [<ffffffff81244459>] SyS_ioctl+0x79/0x90 [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f irq event stamp: 510775 hardirqs last enabled at (510775): [<ffffffff812245d0>] __slab_alloc+0x50/0x70 hardirqs last disabled at (510774): [<ffffffff812245ae>] __slab_alloc+0x2e/0x70 softirqs last enabled at (510506): [<ffffffff810c8ea8>] __do_softirq+0x358/0x430 softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: #0: (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>] shrink_slab.part.63.constprop.79+0x48/0x4e0 #1: (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>] trylock_super+0x1b/0x50 #2: (sb_internal){.+.+.?}, at: [<ffffffff812327f4>] __sb_start_write+0xb4/0xf0 stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 Call Trace: [<ffffffff813a14f9>] dump_stack+0x4b/0x72 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] [<ffffffff8124d7d5>] evict+0xc5/0x190 [<ffffffff8124d8d9>] dispose_list+0x39/0x60 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 [<ffffffff810e6882>] kthread+0xf2/0x110 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 hrtimer: interrupt took 4824925 ns BTRFS info (device sdb6): disk space caching is enabled BTRFS: has skinny extents ------ The test machine is using normal xfs (without -m reflink=1) as its root. As you can see, it's running generic/175 on *BTRFS*, not *XFS*, but still lockdep warning from xfs. Hopes the output could help you. Thanks, Qu _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-12 5:53 Xfs lockdep warning with for-dave-for-4.6 branch Qu Wenruo @ 2016-05-12 5:57 ` Darrick J. Wong 2016-05-12 8:03 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Darrick J. Wong @ 2016-05-12 5:57 UTC (permalink / raw) To: Qu Wenruo; +Cc: xfs On Thu, May 12, 2016 at 01:53:44PM +0800, Qu Wenruo wrote: > Hi Darrick, > > Thanks for your reflink work for xfs first, that's quite a good and useful > feature, also helps to debug btrfs problems. > (Without that, there is no good reference for reflink behavior) > > But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs > with xfstests, kernel report some strange lockdep from xfs: > > ------ > run fstests generic/175 at 2016-05-12 13:22:06 > BTRFS: device fsid 3d5c9c3b-2d08-4f0b-9663-00a88cd218da devid 1 transid 3 > /dev/sdb6 > BTRFS info (device sdb6): disk space caching is enabled > BTRFS: has skinny extents > BTRFS: flagging fs with big metadata feature > BTRFS: creating UUID tree > BTRFS: device fsid bb75eb48-4c5f-4b75-a41d-f642d70c7294 devid 1 transid 3 > /dev/sdb6 > BTRFS info (device sdb6): disk space caching is enabled > BTRFS: has skinny extents > BTRFS: flagging fs with big metadata feature > BTRFS: creating UUID tree > > > ================================= > [ INFO: inconsistent lock state ] > 4.5.0-rc2+ #4 Tainted: G O > --------------------------------- > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] > xfs_ilock+0x177/0x200 [xfs] > {RECLAIM_FS-ON-R} state was registered at: > [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 > [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 > [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 > [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] > [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] > [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] > [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] > [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] > [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 > [<ffffffff81244459>] SyS_ioctl+0x79/0x90 > [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f > irq event stamp: 510775 > hardirqs last enabled at (510775): [<ffffffff812245d0>] > __slab_alloc+0x50/0x70 > hardirqs last disabled at (510774): [<ffffffff812245ae>] > __slab_alloc+0x2e/0x70 > softirqs last enabled at (510506): [<ffffffff810c8ea8>] > __do_softirq+0x358/0x430 > softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&xfs_nondir_ilock_class); > <Interrupt> > lock(&xfs_nondir_ilock_class); > > *** DEADLOCK *** > > 3 locks held by kswapd0/543: > #0: (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>] > shrink_slab.part.63.constprop.79+0x48/0x4e0 > #1: (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>] > trylock_super+0x1b/0x50 > #2: (sb_internal){.+.+.?}, at: [<ffffffff812327f4>] > __sb_start_write+0xb4/0xf0 > > stack backtrace: > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox > 12/01/2006 > ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 > ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 > ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 > Call Trace: > [<ffffffff813a14f9>] dump_stack+0x4b/0x72 > [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 > [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 > [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 > [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 > [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 > [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] > [<ffffffff8124d7d5>] evict+0xc5/0x190 > [<ffffffff8124d8d9>] dispose_list+0x39/0x60 > [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 > [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 > [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 > [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 > [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 > [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 > [<ffffffff810e6882>] kthread+0xf2/0x110 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > hrtimer: interrupt took 4824925 ns > > BTRFS info (device sdb6): disk space caching is enabled > BTRFS: has skinny extents > ------ > > The test machine is using normal xfs (without -m reflink=1) as its root. > As you can see, it's running generic/175 on *BTRFS*, not *XFS*, but still > lockdep warning from xfs. That's ... odd. We shouldn't ever be in the refcountbt code if reflink isn't enabled. Welp, I'll have a look in the morning. --D > > Hopes the output could help you. > > Thanks, > Qu > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-12 5:57 ` Darrick J. Wong @ 2016-05-12 8:03 ` Dave Chinner 2016-05-13 16:03 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-05-12 8:03 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Qu Wenruo, Michal Hocko, xfs [ cc Michal Hocko, just so he can see a lockdep reclaim state false positive ] On Wed, May 11, 2016 at 10:57:56PM -0700, Darrick J. Wong wrote: > On Thu, May 12, 2016 at 01:53:44PM +0800, Qu Wenruo wrote: > > Hi Darrick, > > > > Thanks for your reflink work for xfs first, that's quite a good and useful > > feature, also helps to debug btrfs problems. > > (Without that, there is no good reference for reflink behavior) > > > > But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs > > with xfstests, kernel report some strange lockdep from xfs: ..... > > ================================= > > [ INFO: inconsistent lock state ] > > 4.5.0-rc2+ #4 Tainted: G O > > --------------------------------- > > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. > > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: > > (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] > > xfs_ilock+0x177/0x200 [xfs] > > {RECLAIM_FS-ON-R} state was registered at: > > [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 > > [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 > > [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 > > [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] > > [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] > > [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] > > [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] > > [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] > > [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] > > [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 > > [<ffffffff81244459>] SyS_ioctl+0x79/0x90 > > [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. > > irq event stamp: 510775 > > hardirqs last enabled at (510775): [<ffffffff812245d0>] > > __slab_alloc+0x50/0x70 > > hardirqs last disabled at (510774): [<ffffffff812245ae>] > > __slab_alloc+0x2e/0x70 > > softirqs last enabled at (510506): [<ffffffff810c8ea8>] > > __do_softirq+0x358/0x430 > > softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0 > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > > CPU0 > > ---- > > lock(&xfs_nondir_ilock_class); > > <Interrupt> > > lock(&xfs_nondir_ilock_class); So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock... > > *** DEADLOCK *** > > > > 3 locks held by kswapd0/543: > > #0: (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>] > > shrink_slab.part.63.constprop.79+0x48/0x4e0 > > #1: (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>] > > trylock_super+0x1b/0x50 > > #2: (sb_internal){.+.+.?}, at: [<ffffffff812327f4>] > > __sb_start_write+0xb4/0xf0 > > > > stack backtrace: > > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 > > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox > > 12/01/2006 > > ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 > > ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 > > ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 > > Call Trace: > > [<ffffffff813a14f9>] dump_stack+0x4b/0x72 > > [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 > > [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 > > [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 > > [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 > > [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 > > [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] > > [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 > > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > > [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > > [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 > > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > > [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > > [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > > [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] > > [<ffffffff8124d7d5>] evict+0xc5/0x190 > > [<ffffffff8124d8d9>] dispose_list+0x39/0x60 > > [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 > > [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 > > [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 > > [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 > > [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 > > [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 > > [<ffffffff810e6882>] kthread+0xf2/0x110 > > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > > [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 > > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > > hrtimer: interrupt took 4824925 ns The reclaim -> lock context that it's complaining about here is on an inode being reclaimed - it has no active references and so, by definition, cannot deadlock with a context holding an active reference to an inode ilock. Hence there cannot possibly be a deadlock here, but we can't tell lockdep that easily in any way without going back to the bad old ways of creating a new lockdep class for inode ilocks the moment they enter ->evict. This then disables "entire lifecycle" lockdep checking on the xfs inode ilock, which is why we got rid of it in the first place. Our "traditional" solution to this is to add KM_NOFS to the allocation to force GFP_NOFS allocation in all contexts so that we shut up lockdep and people don't keep endlessly reporting the same non-problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-12 8:03 ` Dave Chinner @ 2016-05-13 16:03 ` Michal Hocko 2016-05-16 10:41 ` Peter Zijlstra 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-05-13 16:03 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-mm, Qu Wenruo, Peter Zijlstra, xfs, Darrick J. Wong On Thu 12-05-16 18:03:21, Dave Chinner wrote: > [ cc Michal Hocko, just so he can see a lockdep reclaim state false > positive ] Thank you for CCing me! I am sorry I didn't follow up on the previous discussion but I got side tracked by something else. I have tried to cook up something really simply. I didn't get to test it at all and it might be completely broken but I just wanted to throw an idea for the discussion. I am CCing Peter as well - he might have a better idea (the reference to the full email is in the changelog. Is something like the following correct/acceptable? This is on top of my scope gfp_nofs patch I have posted recently but I can reorder them if this looks ok. --- >From 9e980c2342f355e2bf1e12839c51e8e304e8842e Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 13 May 2016 17:47:31 +0200 Subject: [PATCH] lockdep: allow to disable reclaim lockup detection The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 [<ffffffff81244459>] SyS_ioctl+0x79/0x90 [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 Call Trace: [<ffffffff813a14f9>] dump_stack+0x4b/0x72 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] [<ffffffff8124d7d5>] evict+0xc5/0x190 [<ffffffff8124d8d9>] dispose_list+0x39/0x60 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 [<ffffffff810e6882>] kthread+0xf2/0x110 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 To quote Dave: " Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock... " This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce lockdep_trace_alloc_{disable,enable} which would tell __lockdep_trace_alloc to skip an allocation request even when it has __GFP_FS enabled. This means that the false positive shouldn't be generated. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/lockdep.h | 10 ++++++++++ include/linux/sched.h | 1 + kernel/locking/lockdep.c | 4 ++++ 3 files changed, 15 insertions(+) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 4dca42fd32f5..4b04bf9ab560 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -354,6 +354,14 @@ static inline void lock_set_subclass(struct lockdep_map *lock, extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); +static inline void lockdep_trace_alloc_disable(void) +{ + current->lockdep_reclaim_disabled = 1; +} +static inline void lockdep_trace_alloc_enable(void) +{ + current->lockdep_reclaim_disabled = 0; +} extern void lock_pin_lock(struct lockdep_map *lock); extern void lock_unpin_lock(struct lockdep_map *lock); @@ -392,6 +400,8 @@ static inline void lockdep_on(void) # define lockdep_set_current_reclaim_state(g) do { } while (0) # define lockdep_clear_current_reclaim_state() do { } while (0) # define lockdep_trace_alloc(g) do { } while (0) +# define lockdep_trace_alloc_disable() do { } while (0) +# define lockdep_trace_alloc_enable() do { } while (0) # define lockdep_init() do { } while (0) # define lockdep_info() do { } while (0) # define lockdep_init_map(lock, name, key, sub) \ diff --git a/include/linux/sched.h b/include/linux/sched.h index e9521dc0475f..ec643916bad3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1644,6 +1644,7 @@ struct task_struct { unsigned int lockdep_recursion; struct held_lock held_locks[MAX_LOCK_DEPTH]; gfp_t lockdep_reclaim_gfp; + bool lockdep_reclaim_disabled; #endif #ifdef CONFIG_UBSAN unsigned int in_ubsan; diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index f60124d0871c..cf6ead3e7014 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2753,6 +2753,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS)) return; + /* The caller explicitly asked to disable reclaim recursion tracking */ + if (curr->lockdep_reclaim_disabled) + return; + /* * Oi! Can't be having __GFP_FS allocations with IRQs disabled. */ -- 2.8.1 -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-13 16:03 ` Michal Hocko @ 2016-05-16 10:41 ` Peter Zijlstra 2016-05-16 13:05 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-16 10:41 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote: > To quote Dave: > " > Ignoring whether reflink should be doing anything or not, that's a > "xfs_refcountbt_init_cursor() gets called both outside and inside > transactions" lockdep false positive case. The problem here is > lockdep has seen this allocation from within a transaction, hence a > GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. > Also note that we have an active reference to this inode. So the only thing that distinguishes the good from the bad case is that reference; how should that then not do anything? > So, because the reclaim annotations overload the interrupt level > detections and it's seen the inode ilock been taken in reclaim > ("interrupt") context, this triggers a reclaim context warning where > it thinks it is unsafe to do this allocation in GFP_KERNEL context > holding the inode ilock... > " > > This sounds like a fundamental problem of the reclaim lock detection. > It is really impossible to annotate such a special usecase IMHO unless > the reclaim lockup detection is reworked completely. How would you like to see it done? The interrupt model works well for reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not an 'interrupt' like thing? > Until then it > is much better to provide a way to add "I know what I am doing flag" > and mark problematic places. This would prevent from abusing GFP_NOFS > flag which has a runtime effect even on configurations which have > lockdep disabled. So without more context; no. The mail you referenced mentions: "The reclaim -> lock context that it's complaining about here is on an inode being reclaimed - it has no active references and so, by definition, cannot deadlock with a context holding an active reference to an inode ilock. Hence there cannot possibly be a deadlock here, but we can't tell lockdep that easily in any way without going back to the bad old ways of creating a new lockdep class for inode ilocks the moment they enter ->evict. This then disables "entire lifecycle" lockdep checking on the xfs inode ilock, which is why we got rid of it in the first place." But fails to explain the problems with the 'old' approach. So clearly this is a 'problem' that has existed for quite a while, so I don't see any need to rush half baked solutions either. Please better explain things. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-16 10:41 ` Peter Zijlstra @ 2016-05-16 13:05 ` Michal Hocko 2016-05-16 13:25 ` Peter Zijlstra 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-05-16 13:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong On Mon 16-05-16 12:41:30, Peter Zijlstra wrote: > On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote: [...] > > So, because the reclaim annotations overload the interrupt level > > detections and it's seen the inode ilock been taken in reclaim > > ("interrupt") context, this triggers a reclaim context warning where > > it thinks it is unsafe to do this allocation in GFP_KERNEL context > > holding the inode ilock... > > " > > > > This sounds like a fundamental problem of the reclaim lock detection. > > It is really impossible to annotate such a special usecase IMHO unless > > the reclaim lockup detection is reworked completely. > > How would you like to see it done? The interrupt model works well for > reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not > an 'interrupt' like thing? Unfortunately I do not have any good ideas. It would basically require to allow marking the lockdep context transaction specific AFAIU somehow and tell that there is no real dependency between !GFP_NOWAIT and 'interrupt' context. IIRC Dave's emails they have tried that by using lockdep classes and that turned out to be an overly complex maze which still doesn't work 100% reliably. > > Until then it > > is much better to provide a way to add "I know what I am doing flag" > > and mark problematic places. This would prevent from abusing GFP_NOFS > > flag which has a runtime effect even on configurations which have > > lockdep disabled. > > So without more context; no. The mail you referenced mentions: > > "The reclaim -> lock context that it's complaining about here is on > an inode being reclaimed - it has no active references and so, by > definition, cannot deadlock with a context holding an active > reference to an inode ilock. Hence there cannot possibly be a > deadlock here, but we can't tell lockdep that easily in any way > without going back to the bad old ways of creating a new lockdep > class for inode ilocks the moment they enter ->evict. This then > disables "entire lifecycle" lockdep checking on the xfs inode ilock, > which is why we got rid of it in the first place." > > But fails to explain the problems with the 'old' approach. > > So clearly this is a 'problem' that has existed for quite a while, so I > don't see any need to rush half baked solutions either. Well, at least my motivation for _some_ solution here is that xfs has worked around this deficiency by forcing GFP_NOFS also for contexts which are perfectly OK to do __GFP_FS allocation. And that in turn leads to other issues which I would really like to sort out. So the idea was to give xfs another way to express that workaround that would be a noop without lockdep configured. > Please better explain things. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-16 13:05 ` Michal Hocko @ 2016-05-16 13:25 ` Peter Zijlstra 2016-05-16 23:10 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-16 13:25 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong On Mon, May 16, 2016 at 03:05:19PM +0200, Michal Hocko wrote: > On Mon 16-05-16 12:41:30, Peter Zijlstra wrote: > > On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote: > > How would you like to see it done? The interrupt model works well for > > reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not > > an 'interrupt' like thing? > > Unfortunately I do not have any good ideas. It would basically require > to allow marking the lockdep context transaction specific AFAIU somehow > and tell that there is no real dependency between !GFP_NOWAIT and > 'interrupt' context. But here is; direct reclaim is very much an 'interrupt' in the normal program flow. But the problem here appears to be that at some points we 'know' things cannot get reclaimed because stuff we didn't tell lockdep about (its got references), and sure then it don't work right. But that doesn't mean that the 'interrupt' model is wrong. > IIRC Dave's emails they have tried that by using lockdep classes and > that turned out to be an overly complex maze which still doesn't work > 100% reliably. So that would be the: > > but we can't tell lockdep that easily in any way > > without going back to the bad old ways of creating a new lockdep > > class for inode ilocks the moment they enter ->evict. This then > > disables "entire lifecycle" lockdep checking on the xfs inode ilock, > > which is why we got rid of it in the first place." > > > > But fails to explain the problems with the 'old' approach. > > > > So clearly this is a 'problem' that has existed for quite a while, so I > > don't see any need to rush half baked solutions either. > > Well, at least my motivation for _some_ solution here is that xfs has > worked around this deficiency by forcing GFP_NOFS also for contexts which > are perfectly OK to do __GFP_FS allocation. And that in turn leads to > other issues which I would really like to sort out. So the idea was to > give xfs another way to express that workaround that would be a noop > without lockdep configured. Right, that's unfortunate. But I would really like to understand the problem with the classes vs lifecycle thing. Is there an email explaining that somewhere? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-16 13:25 ` Peter Zijlstra @ 2016-05-16 23:10 ` Dave Chinner 2016-05-17 14:49 ` Peter Zijlstra 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-05-16 23:10 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong, Michal Hocko On Mon, May 16, 2016 at 03:25:41PM +0200, Peter Zijlstra wrote: > On Mon, May 16, 2016 at 03:05:19PM +0200, Michal Hocko wrote: > > On Mon 16-05-16 12:41:30, Peter Zijlstra wrote: > > > On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote: > > IIRC Dave's emails they have tried that by using lockdep classes and > > that turned out to be an overly complex maze which still doesn't work > > 100% reliably. > > So that would be the: > > > > but we can't tell lockdep that easily in any way > > > without going back to the bad old ways of creating a new lockdep > > > class for inode ilocks the moment they enter ->evict. This then > > > disables "entire lifecycle" lockdep checking on the xfs inode ilock, > > > which is why we got rid of it in the first place." > > > > > > But fails to explain the problems with the 'old' approach. > > > > > > So clearly this is a 'problem' that has existed for quite a while, so I > > > don't see any need to rush half baked solutions either. > > > > Well, at least my motivation for _some_ solution here is that xfs has > > worked around this deficiency by forcing GFP_NOFS also for contexts which > > are perfectly OK to do __GFP_FS allocation. And that in turn leads to > > other issues which I would really like to sort out. So the idea was to > > give xfs another way to express that workaround that would be a noop > > without lockdep configured. > > Right, that's unfortunate. But I would really like to understand the > problem with the classes vs lifecycle thing. > > Is there an email explaining that somewhere? Years ago (i.e. last time I bothered mentioning that lockdep didn't cover these cases) but buggered if I can find a reference. We used to have iolock classes. Added in commit 033da48 ("xfs: reset the i_iolock lock class in the reclaim path") in 2010, removed by commit 4f59af7 ("xfs: remove iolock lock classes") a couple of years later. We needed distinct lock classes above and below reclaim because the same locks were taken above and below memory allocation, and the reclaimed inode recycling made lockdep think it had a loop in it's graph: inode lookup memory reclaim inode lookup ------------ -------------- ------------ not found allocate inode init inode take active reference return to caller ..... inode shrinker find inode with no active reference ->evict start transaction take inode locks ..... commit transaction mark for reclaim ..... find inactive inode in reclaimable state remove from reclaim list re-initialise inode state take active reference return to caller So you can see that an inode can go through the reclaim context without being freed but having taken locks, but then immediately reused in a non-reclaim context. Hence we had to split the two lockdep contexts and re-initialise the lock contexts in the different allocation paths. The reason we don't have lock clases for the ilock is that we aren't supposed to call memory reclaim with that lock held in exclusive mode. This is because reclaim can run transactions, and that may need to flush dirty inodes to make progress. Flushing dirty inode requires taking the ilock in shared mode. In the code path that was reported, we hold the ilock in /shared/ mode with no transaction context (we are doing a read-only operation). This means we can run transactions in memory reclaim because a) we can't deadlock on the inode we hold locks on, and b) transaction reservations will be able to make progress as we don't hold any locks it can block on. Sure, we can have multiple lock classes so that this can be done, but that then breaks lock order checking between the two contexts. Both contexts require locking order and transactional behaviour to be identical and, from this perspective, class separation is effectively no different from turning lockdep off. For the ilock, the number of places where the ilock is held over GFP_KERNEL allocations is pretty small. Hence we've simply added GFP_NOFS to those allocations to - effectively - annotate those allocations as "lockdep causes problems here". There are probably 30-35 allocations in XFS that explicitly use KM_NOFS - some of these are masking lockdep false positive reports. The reason why we had lock classes for the *iolock* (not the ilock) was because ->evict processing used to require the iolock for attribute removal. This caused all sorts of problems with GFP_KERNEL allocations in the IO path (e.g. allocate a page cache page under the IO lock, direct reclaim would trigger lockdep warnings - there's another reason why XFS always used GFP_NOFS contexts for page cache allocations) which were all false positives because the inode holding the lock has active references and hence will never be accessed by the reclaim path. The only workable solution iwe could come up with at the time this was reported was to this was to split the lock classes. In the end, like pretty much all the complex lockdep false positives we've had to deal in XFS, we've ended up changing the locking or allocation contexts because that's been far easier than trying to make annotations cover everything or convince other people that lockdep annotations are insufficient. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-16 23:10 ` Dave Chinner @ 2016-05-17 14:49 ` Peter Zijlstra 2016-05-17 22:35 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-17 14:49 UTC (permalink / raw) To: Dave Chinner Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar Thanks for writing all that down Dave! On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > The reason we don't have lock clases for the ilock is that we aren't > supposed to call memory reclaim with that lock held in exclusive > mode. This is because reclaim can run transactions, and that may > need to flush dirty inodes to make progress. Flushing dirty inode > requires taking the ilock in shared mode. > > In the code path that was reported, we hold the ilock in /shared/ > mode with no transaction context (we are doing a read-only > operation). This means we can run transactions in memory reclaim > because a) we can't deadlock on the inode we hold locks on, and b) > transaction reservations will be able to make progress as we don't > hold any locks it can block on. Just to clarify; I read the above as that we cannot block on recursive shared locks, is this correct? Because we can in fact block on down_read()+down_read() just fine, so if you're assuming that, then something's busted. Otherwise, I'm not quite reading it right, which is, given the complexity of that stuff, entirely possible. The other possible reading is that we cannot deadlock on the inode we hold locks on because we hold a reference on it; and the reference avoids the inode from being reclaimed. But then the whole shared/exclusive thing doesn't seem to make sense. > For the ilock, the number of places where the ilock is held over > GFP_KERNEL allocations is pretty small. Hence we've simply added > GFP_NOFS to those allocations to - effectively - annotate those > allocations as "lockdep causes problems here". There are probably > 30-35 allocations in XFS that explicitly use KM_NOFS - some of these > are masking lockdep false positive reports. > In the end, like pretty much all the complex lockdep false positives > we've had to deal in XFS, we've ended up changing the locking or > allocation contexts because that's been far easier than trying to > make annotations cover everything or convince other people that > lockdep annotations are insufficient. Well, I don't mind creating lockdep annotations; but explanations of the exact details always go a long way towards helping me come up with something. While going over the code; I see there's complaining about MAX_SUBCLASSES being too small. Would it help if we doubled it? We cannot grow the thing without limits, but doubling it should be possible I think. In any case; would something like this work for you? Its entirely untested, but the idea is to mark an entire class to skip reclaim validation, instead of marking individual sites. We have to do the subclass loop because; as per the comment with XFS_ILOCK_* we use all 8 subclasses. --- fs/xfs/xfs_super.c | 13 +++++++++++++ include/linux/lockdep.h | 8 +++++++- kernel/locking/lockdep.c | 47 +++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 187e14b696c2..ea55f87edad8 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -985,6 +985,19 @@ xfs_fs_inode_init_once( "xfsino", ip->i_ino); mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER, "xfsino", ip->i_ino); + +#ifdef CONFIG_LOCKDEP + /* + * Disable reclaim tests for the i_lock; reclaim is guarded + * by a reference count... XXX write coherent comment. + */ + do { + int i; + + for (i = 0; i < MAX_LOCKDEP_SUBCLASSES; i++) + lockdep_skip_reclaim(&ip->i_lock.mr_lock, i); + } while (0); +#endif } STATIC void diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index eabe0138eb06..fbaa6c8bcff6 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -80,7 +80,8 @@ struct lock_class { /* * IRQ/softirq usage tracking bits: */ - unsigned long usage_mask; + unsigned int usage_mask; + unsigned int skip_mask; struct stack_trace usage_traces[XXX_LOCK_USAGE_STATES]; /* @@ -281,6 +282,8 @@ extern void lockdep_on(void); extern void lockdep_init_map(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass); +extern void lock_skip_reclaim(struct lockdep_map *lock, int subclass); + /* * To initialize a lockdep_map statically use this macro. * Note that _name must not be NULL. @@ -304,6 +307,9 @@ extern void lockdep_init_map(struct lockdep_map *lock, const char *name, lockdep_init_map(&(lock)->dep_map, #lock, \ (lock)->dep_map.key, sub) +#define lockdep_skip_reclaim(lock, sub) \ + lock_skip_reclaim(&(lock)->dep_map, sub) + #define lockdep_set_novalidate_class(lock) \ lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock) /* diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 81f1a7107c0e..f3b3b3e7938a 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3022,13 +3022,17 @@ void lockdep_trace_alloc(gfp_t gfp_mask) static int mark_lock(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit) { + struct lock_class *class = hlock_class(this); unsigned int new_mask = 1 << new_bit, ret = 1; /* * If already set then do not dirty the cacheline, * nor do any checks: */ - if (likely(hlock_class(this)->usage_mask & new_mask)) + if (likely(class->usage_mask & new_mask)) + return 1; + + if (class->skip_mask & (new_mask >> 2)) return 1; if (!graph_lock()) @@ -3036,14 +3040,14 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, /* * Make sure we didn't race: */ - if (unlikely(hlock_class(this)->usage_mask & new_mask)) { + if (unlikely(class->usage_mask & new_mask)) { graph_unlock(); return 1; } - hlock_class(this)->usage_mask |= new_mask; + class->usage_mask |= new_mask; - if (!save_trace(hlock_class(this)->usage_traces + new_bit)) + if (!save_trace(class->usage_traces + new_bit)) return 0; switch (new_bit) { @@ -3586,6 +3590,24 @@ static int __lock_is_held(struct lockdep_map *lock) return 0; } +static void __lock_skip_reclaim(struct lockdep_map *lock, int subclass) +{ + struct lock_class *class = register_lock_class(lock, subclass, 0); + + if (!class) + return; + + if (class->skip_mask & (1 << RECLAIM_FS)) + return; + + if (!graph_lock()) + return; + + class->skip_mask |= 1 << RECLAIM_FS; + + graph_unlock(); +} + static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock) { struct pin_cookie cookie = NIL_COOKIE; @@ -3784,6 +3806,23 @@ int lock_is_held(struct lockdep_map *lock) } EXPORT_SYMBOL_GPL(lock_is_held); +void lock_skip_reclaim(struct lockdep_map *lock, int subclass) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + check_flags(flags); + + current->lockdep_recursion = 1; + __lock_skip_reclaim(lock, subclass); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_skip_reclaim); + struct pin_cookie lock_pin_lock(struct lockdep_map *lock) { struct pin_cookie cookie = NIL_COOKIE; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-17 14:49 ` Peter Zijlstra @ 2016-05-17 22:35 ` Dave Chinner 2016-05-18 7:20 ` Peter Zijlstra 2016-05-19 8:11 ` Peter Zijlstra 0 siblings, 2 replies; 37+ messages in thread From: Dave Chinner @ 2016-05-17 22:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > Thanks for writing all that down Dave! > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > > > The reason we don't have lock clases for the ilock is that we aren't > > supposed to call memory reclaim with that lock held in exclusive > > mode. This is because reclaim can run transactions, and that may > > need to flush dirty inodes to make progress. Flushing dirty inode > > requires taking the ilock in shared mode. > > > > In the code path that was reported, we hold the ilock in /shared/ > > mode with no transaction context (we are doing a read-only > > operation). This means we can run transactions in memory reclaim > > because a) we can't deadlock on the inode we hold locks on, and b) > > transaction reservations will be able to make progress as we don't > > hold any locks it can block on. > > Just to clarify; I read the above as that we cannot block on recursive > shared locks, is this correct? > > Because we can in fact block on down_read()+down_read() just fine, so if > you're assuming that, then something's busted. The transaction reservation path will run down_read_trylock() on the inode, not down_read(). Hence if there are no pending writers, it will happily take the lock twice and make progress, otherwise it will skip the inode and there's no deadlock. If there's a pending writer, then we have another context that is already in a transaction context and has already pushed the item, hence it is only in the scope of the current push because IO hasn't completed yet and removed it from the list. > Otherwise, I'm not quite reading it right, which is, given the > complexity of that stuff, entirely possible. There's a maze of dark, grue-filled twisty passages here... > The other possible reading is that we cannot deadlock on the inode we > hold locks on because we hold a reference on it; and the reference > avoids the inode from being reclaimed. But then the whole > shared/exclusive thing doesn't seem to make sense. Right, because that's not the problem. The issue has to do with transaction contexts and what locks are safe to hold when calling xfs_trans_reserve(). Direct reclaim is putting xfs_trans_reserve() behind memory allocation, which means it is unsafe for XFS to hold the ilock exclusive or be in an existing transaction context when doing GFP_KERNEL allocation. > > For the ilock, the number of places where the ilock is held over > > GFP_KERNEL allocations is pretty small. Hence we've simply added > > GFP_NOFS to those allocations to - effectively - annotate those > > allocations as "lockdep causes problems here". There are probably > > 30-35 allocations in XFS that explicitly use KM_NOFS - some of these > > are masking lockdep false positive reports. > > > > In the end, like pretty much all the complex lockdep false positives > > we've had to deal in XFS, we've ended up changing the locking or > > allocation contexts because that's been far easier than trying to > > make annotations cover everything or convince other people that > > lockdep annotations are insufficient. > > Well, I don't mind creating lockdep annotations; but explanations of the > exact details always go a long way towards helping me come up with > something. > > While going over the code; I see there's complaining about > MAX_SUBCLASSES being too small. Would it help if we doubled it? We > cannot grow the thing without limits, but doubling it should be possible > I think. Last time I asked cwif we could increase MAX_SUBCLASSES I was told no. So we've just had to try to fit about 30 different inode lock contexts into 8 subclasses split across multiple class types (i.e. xfs_[non]dir_ilock_class). I wasted an entire week on getting those annotations to fit the limitations of lockdep and still work. > In any case; would something like this work for you? Its entirely > untested, but the idea is to mark an entire class to skip reclaim > validation, instead of marking individual sites. Probably would, but it seems like swatting a fly with runaway train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) so that we don't turn off something that will tell us we've made a mistake while developing new code... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-17 22:35 ` Dave Chinner @ 2016-05-18 7:20 ` Peter Zijlstra 2016-05-18 8:25 ` Michal Hocko 2016-05-19 8:11 ` Peter Zijlstra 1 sibling, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-18 7:20 UTC (permalink / raw) To: Dave Chinner Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > > > Thanks for writing all that down Dave! > > > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > > > > > The reason we don't have lock clases for the ilock is that we aren't > > > supposed to call memory reclaim with that lock held in exclusive > > > mode. This is because reclaim can run transactions, and that may > > > need to flush dirty inodes to make progress. Flushing dirty inode > > > requires taking the ilock in shared mode. > > > > > > In the code path that was reported, we hold the ilock in /shared/ > > > mode with no transaction context (we are doing a read-only > > > operation). This means we can run transactions in memory reclaim > > > because a) we can't deadlock on the inode we hold locks on, and b) > > > transaction reservations will be able to make progress as we don't > > > hold any locks it can block on. > > > > Just to clarify; I read the above as that we cannot block on recursive > > shared locks, is this correct? > > > > Because we can in fact block on down_read()+down_read() just fine, so if > > you're assuming that, then something's busted. > > The transaction reservation path will run down_read_trylock() on the > inode, not down_read(). Hence if there are no pending writers, it > will happily take the lock twice and make progress, otherwise it > will skip the inode and there's no deadlock. If there's a pending > writer, then we have another context that is already in a > transaction context and has already pushed the item, hence it is > only in the scope of the current push because IO hasn't completed > yet and removed it from the list. > > > Otherwise, I'm not quite reading it right, which is, given the > > complexity of that stuff, entirely possible. > > There's a maze of dark, grue-filled twisty passages here... I'm sure. Thanks for the extra detail; yes the down_read_trylock() thing will work. > > In any case; would something like this work for you? Its entirely > > untested, but the idea is to mark an entire class to skip reclaim > > validation, instead of marking individual sites. > > Probably would, but it seems like swatting a fly with runaway > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) > so that we don't turn off something that will tell us we've made a > mistake while developing new code... Fair enough; if the mm folks don't object to 'wasting' a GFP flag on this the below ought to do I think. --- include/linux/gfp.h | 9 ++++++++- kernel/locking/lockdep.c | 3 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 570383a41853..7b5b9db4c821 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -41,6 +41,7 @@ struct vm_area_struct; #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u #define ___GFP_KSWAPD_RECLAIM 0x2000000u +#define ___GFP_NOVALIDATE 0x4000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -146,6 +147,11 @@ struct vm_area_struct; * return NULL when direct reclaim and memory compaction have failed to allow * the allocation to succeed. The OOM killer is not called with the current * implementation. + * + * __GFP_NOVALIDATE: lockdep annotation; do not use this allocation to set + * LOCK_USED_IN_RECLAIM_FS and thereby avoid possible false positives. + * Use of this flag should have a comment explaining the false positive + * in detail. */ #define __GFP_IO ((__force gfp_t)___GFP_IO) #define __GFP_FS ((__force gfp_t)___GFP_FS) @@ -155,6 +161,7 @@ struct vm_area_struct; #define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) +#define __GFP_NOVALIDATE ((__force gfp_t)___GFP_NOVALIDATE) /* * Action modifiers @@ -188,7 +195,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 26 +#define __GFP_BITS_SHIFT 27 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 81f1a7107c0e..88bf43c3bf76 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2853,6 +2853,9 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) if (unlikely(!debug_locks)) return; + if (gfp_mask & __GFP_NOVALIDATE) + return; + /* no reclaim without waiting on it */ if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) return; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-18 7:20 ` Peter Zijlstra @ 2016-05-18 8:25 ` Michal Hocko 2016-05-18 9:49 ` Peter Zijlstra 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-05-18 8:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Wed 18-05-16 09:20:05, Peter Zijlstra wrote: > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: [...] > > > In any case; would something like this work for you? Its entirely > > > untested, but the idea is to mark an entire class to skip reclaim > > > validation, instead of marking individual sites. > > > > Probably would, but it seems like swatting a fly with runaway > > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) > > so that we don't turn off something that will tell us we've made a > > mistake while developing new code... > > Fair enough; if the mm folks don't object to 'wasting' a GFP flag on > this the below ought to do I think. GFP flag space is quite scarse. Especially when it would be used only for lockdep configurations which are mostly disabled. Why cannot we go with an explicit disable/enable API I have proposed? It would be lockdep contained and quite easy to grep for and git blame would tell us (hopefuly) why the lockdep had to be put out of the way for the particular path. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-18 8:25 ` Michal Hocko @ 2016-05-18 9:49 ` Peter Zijlstra 2016-05-18 11:31 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-18 9:49 UTC (permalink / raw) To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Wed, May 18, 2016 at 10:25:39AM +0200, Michal Hocko wrote: > On Wed 18-05-16 09:20:05, Peter Zijlstra wrote: > > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > [...] > > > > In any case; would something like this work for you? Its entirely > > > > untested, but the idea is to mark an entire class to skip reclaim > > > > validation, instead of marking individual sites. > > > > > > Probably would, but it seems like swatting a fly with runaway > > > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) > > > so that we don't turn off something that will tell us we've made a > > > mistake while developing new code... > > > > Fair enough; if the mm folks don't object to 'wasting' a GFP flag on > > this the below ought to do I think. > > GFP flag space is quite scarse. There's still 5 or so bits available, and you could always make gfp_t u64. > Especially when it would be used only > for lockdep configurations which are mostly disabled. Why cannot we go > with an explicit disable/enable API I have proposed? It has unbounded scope. And in that respect the GFP flag thingy is wider than I'd like too, it avoids setting the state for all held locks, even though we'd only like to avoid setting it for one class. So ideally we'd combine the GFP flag with the previously proposed skip flag to only avoid marking the one class while keeping everything working for all other held locks. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-18 9:49 ` Peter Zijlstra @ 2016-05-18 11:31 ` Michal Hocko 0 siblings, 0 replies; 37+ messages in thread From: Michal Hocko @ 2016-05-18 11:31 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Wed 18-05-16 11:49:52, Peter Zijlstra wrote: > On Wed, May 18, 2016 at 10:25:39AM +0200, Michal Hocko wrote: > > On Wed 18-05-16 09:20:05, Peter Zijlstra wrote: > > > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > > > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > [...] > > > > > In any case; would something like this work for you? Its entirely > > > > > untested, but the idea is to mark an entire class to skip reclaim > > > > > validation, instead of marking individual sites. > > > > > > > > Probably would, but it seems like swatting a fly with runaway > > > > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag) > > > > so that we don't turn off something that will tell us we've made a > > > > mistake while developing new code... > > > > > > Fair enough; if the mm folks don't object to 'wasting' a GFP flag on > > > this the below ought to do I think. > > > > GFP flag space is quite scarse. > > There's still 5 or so bits available, and you could always make gfp_t > u64. It seems we have some places where we encode further data into the same word as gfp_mask (radix tree tags and mapping_flags). From a quick glance they should be OK even with __GFP_BITS_SHIFT increased to 27 but this tells us that we shouldn't consume them without a good reason. > > Especially when it would be used only > > for lockdep configurations which are mostly disabled. Why cannot we go > > with an explicit disable/enable API I have proposed? > > It has unbounded scope. And in that respect the GFP flag thingy is wider > than I'd like too, it avoids setting the state for all held locks, even > though we'd only like to avoid setting it for one class. > > So ideally we'd combine the GFP flag with the previously proposed skip > flag to only avoid marking the one class while keeping everything > working for all other held locks. This is definitely your call but I would prefer starting with something simple and extend it when we find out that the scope/gfp opt-out hides real bugs or it is insufficient for other reasons. I do not this opt out to be used much, quite contrary. We do not hear about false positives reclaim lockdep lockups very often - except for very complex reclaim implementations which are quite uncommon. Thanks! -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-17 22:35 ` Dave Chinner 2016-05-18 7:20 ` Peter Zijlstra @ 2016-05-19 8:11 ` Peter Zijlstra 2016-05-20 0:17 ` Dave Chinner 1 sibling, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-05-19 8:11 UTC (permalink / raw) To: Dave Chinner Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > > > > > The reason we don't have lock clases for the ilock is that we aren't > > > supposed to call memory reclaim with that lock held in exclusive > > > mode. This is because reclaim can run transactions, and that may > > > need to flush dirty inodes to make progress. Flushing dirty inode > > > requires taking the ilock in shared mode. > > > > > > In the code path that was reported, we hold the ilock in /shared/ > > > mode with no transaction context (we are doing a read-only > > > operation). This means we can run transactions in memory reclaim > > > because a) we can't deadlock on the inode we hold locks on, and b) > > > transaction reservations will be able to make progress as we don't > > > hold any locks it can block on. > > > > Just to clarify; I read the above as that we cannot block on recursive > > shared locks, is this correct? > > > > Because we can in fact block on down_read()+down_read() just fine, so if > > you're assuming that, then something's busted. > > The transaction reservation path will run down_read_trylock() on the > inode, not down_read(). Hence if there are no pending writers, it > will happily take the lock twice and make progress, otherwise it > will skip the inode and there's no deadlock. If there's a pending > writer, then we have another context that is already in a > transaction context and has already pushed the item, hence it is > only in the scope of the current push because IO hasn't completed > yet and removed it from the list. > > > Otherwise, I'm not quite reading it right, which is, given the > > complexity of that stuff, entirely possible. > > There's a maze of dark, grue-filled twisty passages here... OK; I might need a bit more again. So now the code does something like: down_read(&i_lock); -- lockdep marks lock as held kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS --> reclaim() down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS Right? My 'problem' is that lockdep doesn't consider a trylock for the USED_IN annotation, so the i_lock class will only get the ENABLED tag but not get the USED_IN tag, and therefore _should_ not trigger the inversion. So what exactly is triggering the inversion? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-19 8:11 ` Peter Zijlstra @ 2016-05-20 0:17 ` Dave Chinner 2016-06-01 13:17 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-05-20 0:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote: > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote: > > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote: > > > > > > > The reason we don't have lock clases for the ilock is that we aren't > > > > supposed to call memory reclaim with that lock held in exclusive > > > > mode. This is because reclaim can run transactions, and that may > > > > need to flush dirty inodes to make progress. Flushing dirty inode > > > > requires taking the ilock in shared mode. > > > > > > > > In the code path that was reported, we hold the ilock in /shared/ > > > > mode with no transaction context (we are doing a read-only > > > > operation). This means we can run transactions in memory reclaim > > > > because a) we can't deadlock on the inode we hold locks on, and b) > > > > transaction reservations will be able to make progress as we don't > > > > hold any locks it can block on. > > > > > > Just to clarify; I read the above as that we cannot block on recursive > > > shared locks, is this correct? > > > > > > Because we can in fact block on down_read()+down_read() just fine, so if > > > you're assuming that, then something's busted. > > > > The transaction reservation path will run down_read_trylock() on the > > inode, not down_read(). Hence if there are no pending writers, it > > will happily take the lock twice and make progress, otherwise it > > will skip the inode and there's no deadlock. If there's a pending > > writer, then we have another context that is already in a > > transaction context and has already pushed the item, hence it is > > only in the scope of the current push because IO hasn't completed > > yet and removed it from the list. > > > > > Otherwise, I'm not quite reading it right, which is, given the > > > complexity of that stuff, entirely possible. > > > > There's a maze of dark, grue-filled twisty passages here... > > OK; I might need a bit more again. > > So now the code does something like: > > down_read(&i_lock); -- lockdep marks lock as held > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > --> reclaim() > down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS > > Right? In the path that can deadlock the log, yes. It's actually way more complex than the above, because the down_read_trylock(&i_lock) that matters is run in a completely separate, async kthread that xfs_trans_reserve() will block waiting for. process context xfsaild kthread(*) --------------- ------------------ down_read(&i_lock); -- lockdep marks lock as held kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS --> reclaim() xfs_trans_reserve() .... xfs_trans_push_ail() ---- called if no space in the log to kick the xfsaild into action .... xlog_grant_head_wait() ---- blocks waiting for log space ..... xfsaild_push() ----- iterates AIL grabs log item lock log item >>>>>>>>>>>>>>>>>>>>> down_read_trylock(&i_lock); format item into buffer add to dirty buffer list .... submit dirty buffer list for IO buffer IO started ..... <async IO completion context> buffer callbacks mark inode clean remove inode from AIL move tail of log forwards wake grant head waiters <woken by log tail moving> <log space available> transaction reservation granted ..... down_write(some other inode ilock) <modify some other inode> xfs_trans_commit ..... (*) xfsaild runs with PF_MEMALLOC context. The problem is that if the ilock is held exclusively at GFP_KERNEL time, the xfsaild cannot lock the inode to flush it, so if that inode pins the tail of the log then we can't make space available for xfs_trans_reserve and there is the deadlock. Once xfs_trans_reserve completes, however, we'll take the ilock on *some other inode*, and that's where the "it can't be the inode we currently hold locked because we have references to it" and henceit's safe to have a pattern like: down_read(&i_lock); -- lockdep marks lock as held kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS --> reclaim() down_write(&ilock) because the lock within reclaim context is completely unrelated to the lock we already hold. Lockdep can't possibly know about this because the deadlock involves locking contexts that *aren't doing anything wrong within their own contexts*. It's only when you add the dependency of log space reservation requirements needed to make forwards progress that there's then an issue with locking and reclaim. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-05-20 0:17 ` Dave Chinner @ 2016-06-01 13:17 ` Michal Hocko 2016-06-01 18:16 ` Peter Zijlstra 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-01 13:17 UTC (permalink / raw) To: Peter Zijlstra, Dave Chinner Cc: linux-mm, Qu Wenruo, xfs, Ingo Molnar, Darrick J. Wong Thanks Dave for your detailed explanation again! Peter do you have any other idea how to deal with these situations other than opt out from lockdep reclaim machinery? If not I would rather go with an annotation than a gfp flag to be honest but if you absolutely hate that approach then I will try to check wheter a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would steal the description from Dave's email and repost my patch. I plan to repost my scope gfp patches in few days and it would be good to have some mechanism to drop those GFP_NOFS to paper over lockdep false positives for that. [keeping Dave's explanation for reference] On Fri 20-05-16 10:17:14, Dave Chinner wrote: > On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote: > > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote: [...] > > > There's a maze of dark, grue-filled twisty passages here... > > > > OK; I might need a bit more again. > > > > So now the code does something like: > > > > down_read(&i_lock); -- lockdep marks lock as held > > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > > --> reclaim() > > down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS > > > > Right? > > In the path that can deadlock the log, yes. It's actually way more > complex than the above, because the down_read_trylock(&i_lock) that > matters is run in a completely separate, async kthread that > xfs_trans_reserve() will block waiting for. > > process context xfsaild kthread(*) > --------------- ------------------ > down_read(&i_lock); -- lockdep marks lock as held > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > --> reclaim() > xfs_trans_reserve() > .... > xfs_trans_push_ail() ---- called if no space in the log to kick the xfsaild into action > .... > xlog_grant_head_wait() ---- blocks waiting for log space > ..... > > xfsaild_push() ----- iterates AIL > grabs log item > lock log item > >>>>>>>>>>>>>>>>>>>>> down_read_trylock(&i_lock); > format item into buffer > add to dirty buffer list > .... > submit dirty buffer list for IO > buffer IO started > ..... > <async IO completion context> > buffer callbacks > mark inode clean > remove inode from AIL > move tail of log forwards > wake grant head waiters > <woken by log tail moving> > <log space available> > transaction reservation granted > ..... > down_write(some other inode ilock) > <modify some other inode> > xfs_trans_commit > ..... > > (*) xfsaild runs with PF_MEMALLOC context. > > The problem is that if the ilock is held exclusively at GFP_KERNEL > time, the xfsaild cannot lock the inode to flush it, so if that > inode pins the tail of the log then we can't make space available > for xfs_trans_reserve and there is the deadlock. > > Once xfs_trans_reserve completes, however, we'll take the ilock on > *some other inode*, and that's where the "it can't be the inode we > currently hold locked because we have references to it" and > henceit's safe to have a pattern like: > > down_read(&i_lock); -- lockdep marks lock as held > kmalloc(GFP_KERNEL); -- lockdep marks held locks as ENABLED_RECLAIM_FS > --> reclaim() > down_write(&ilock) > > because the lock within reclaim context is completely unrelated to > the lock we already hold. > > Lockdep can't possibly know about this because the deadlock involves > locking contexts that *aren't doing anything wrong within their own > contexts*. It's only when you add the dependency of log space > reservation requirements needed to make forwards progress that > there's then an issue with locking and reclaim. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-01 13:17 ` Michal Hocko @ 2016-06-01 18:16 ` Peter Zijlstra 2016-06-02 14:50 ` Michal Hocko 2016-10-06 13:04 ` Michal Hocko 0 siblings, 2 replies; 37+ messages in thread From: Peter Zijlstra @ 2016-06-01 18:16 UTC (permalink / raw) To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > Thanks Dave for your detailed explanation again! Peter do you have any > other idea how to deal with these situations other than opt out from > lockdep reclaim machinery? > > If not I would rather go with an annotation than a gfp flag to be honest > but if you absolutely hate that approach then I will try to check wheter > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > steal the description from Dave's email and repost my patch. > > I plan to repost my scope gfp patches in few days and it would be good > to have some mechanism to drop those GFP_NOFS to paper over lockdep > false positives for that. Right; sorry I got side-tracked in other things again. So my favourite is the dedicated GFP flag, but if that's unpalatable for the mm folks then something like the below might work. It should be similar in effect to your proposal, except its more limited in scope. --- include/linux/gfp.h | 5 ++++- include/linux/lockdep.h | 2 ++ kernel/locking/lockdep.c | 36 ++++++++++++++++++++++++++++++++++++ 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 570383a41853..d6be35643ee7 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -187,7 +187,10 @@ struct vm_area_struct; #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) -/* Room for N __GFP_FOO bits */ +/* + * Room for N __GFP_FOO bits. + * Fix lockdep's __GFP_SKIP_ALLOC if this ever hits 32. + */ #define __GFP_BITS_SHIFT 26 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index eabe0138eb06..08a021b1e275 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -354,6 +354,7 @@ static inline void lock_set_subclass(struct lockdep_map *lock, extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); +extern void lockdep_skip_alloc(void); extern void lockdep_trace_alloc(gfp_t mask); struct pin_cookie { unsigned int val; }; @@ -398,6 +399,7 @@ static inline void lockdep_on(void) # define lock_set_subclass(l, s, i) do { } while (0) # define lockdep_set_current_reclaim_state(g) do { } while (0) # define lockdep_clear_current_reclaim_state() do { } while (0) +# define lockdep_skip_alloc() do { } while (0) # define lockdep_trace_alloc(g) do { } while (0) # define lockdep_info() do { } while (0) # define lockdep_init_map(lock, name, key, sub) \ diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 589d763a49b3..aa3ccbadc74e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2851,6 +2851,13 @@ void trace_softirqs_off(unsigned long ip) debug_atomic_inc(redundant_softirqs_off); } +#define __GFP_SKIP_ALLOC (1UL << __GFP_BITS_SHIFT) + +static void __lockdep_skip_alloc(void) +{ + current->lockdep_reclaim_gfp |= __GFP_SKIP_ALLOC; +} + static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) { struct task_struct *curr = current; @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) return; + /* + * Skip _one_ allocation as per the lockdep_skip_alloc() request. + * Must be done last so that we don't loose the annotation for + * GFP_ATOMIC like things from IRQ or other nesting contexts. + */ + if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) { + current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC; + return; + } + mark_held_locks(curr, RECLAIM_FS); } static void check_flags(unsigned long flags); +void lockdep_skip_alloc(void) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + check_flags(flags); + current->lockdep_recursion = 1; + __lockdep_skip_alloc(); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} + void lockdep_trace_alloc(gfp_t gfp_mask) { unsigned long flags; @@ -3015,6 +3047,10 @@ static inline int separate_irq_context(struct task_struct *curr, return 0; } +void lockdep_skip_alloc(void) +{ +} + void lockdep_trace_alloc(gfp_t gfp_mask) { } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-01 18:16 ` Peter Zijlstra @ 2016-06-02 14:50 ` Michal Hocko 2016-06-02 15:11 ` Peter Zijlstra 2016-10-06 13:04 ` Michal Hocko 1 sibling, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-02 14:50 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > Thanks Dave for your detailed explanation again! Peter do you have any > > other idea how to deal with these situations other than opt out from > > lockdep reclaim machinery? > > > > If not I would rather go with an annotation than a gfp flag to be honest > > but if you absolutely hate that approach then I will try to check wheter > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > steal the description from Dave's email and repost my patch. > > > > I plan to repost my scope gfp patches in few days and it would be good > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > false positives for that. > > Right; sorry I got side-tracked in other things again. > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > the mm folks then something like the below might work. It should be > similar in effect to your proposal, except its more limited in scope. [...] > @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > return; > > + /* > + * Skip _one_ allocation as per the lockdep_skip_alloc() request. > + * Must be done last so that we don't loose the annotation for > + * GFP_ATOMIC like things from IRQ or other nesting contexts. > + */ > + if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) { > + current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC; > + return; > + } > + > mark_held_locks(curr, RECLAIM_FS); > } I might be missing something but does this work actually? Say you would want a kmalloc(size), it would call slab_alloc_node slab_pre_alloc_hook lockdep_trace_alloc [...] ____cache_alloc_node cache_grow_begin kmem_getpages __alloc_pages_node __alloc_pages_nodemask lockdep_trace_alloc I understand your concerns about the scope but usually all allocations have to be __GFP_NOFS or none in the same scope so I would see it as a huge deal. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-02 14:50 ` Michal Hocko @ 2016-06-02 15:11 ` Peter Zijlstra 2016-06-02 15:46 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-06-02 15:11 UTC (permalink / raw) To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Thu, Jun 02, 2016 at 04:50:49PM +0200, Michal Hocko wrote: > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > the mm folks then something like the below might work. It should be > > similar in effect to your proposal, except its more limited in scope. > [...] > > @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > > return; > > > > + /* > > + * Skip _one_ allocation as per the lockdep_skip_alloc() request. > > + * Must be done last so that we don't loose the annotation for > > + * GFP_ATOMIC like things from IRQ or other nesting contexts. > > + */ > > + if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) { > > + current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC; > > + return; > > + } > > + > > mark_held_locks(curr, RECLAIM_FS); > > } > > I might be missing something but does this work actually? Say you would > want a kmalloc(size), it would call > slab_alloc_node > slab_pre_alloc_hook > lockdep_trace_alloc > [...] > ____cache_alloc_node > cache_grow_begin > kmem_getpages > __alloc_pages_node > __alloc_pages_nodemask > lockdep_trace_alloc Bugger :/ You're right, that would fail. So how about doing: #define __GFP_NOLOCKDEP (1u << __GFP_BITS_SHIFT) this means it cannot be part of address_space::flags or radix_tree_root::gfp_mask, but that might not be a bad thing. And this solves the scarcity thing, because per pagemap we need to have 5 'spare' bits anyway. > I understand your concerns about the scope but usually all allocations > have to be __GFP_NOFS or none in the same scope so I would see it as a > huge deal. With scope I mostly meant the fact that you have two calls that you need to pair up. That's not really nice as you can 'annotate' a _lot_ of code in between. I prefer the narrower annotations where you annotate a single specific site. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-02 15:11 ` Peter Zijlstra @ 2016-06-02 15:46 ` Michal Hocko 2016-06-02 23:22 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-02 15:46 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar On Thu 02-06-16 17:11:16, Peter Zijlstra wrote: > On Thu, Jun 02, 2016 at 04:50:49PM +0200, Michal Hocko wrote: > > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > > the mm folks then something like the below might work. It should be > > > similar in effect to your proposal, except its more limited in scope. > > [...] > > > @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > > > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > > > return; > > > > > > + /* > > > + * Skip _one_ allocation as per the lockdep_skip_alloc() request. > > > + * Must be done last so that we don't loose the annotation for > > > + * GFP_ATOMIC like things from IRQ or other nesting contexts. > > > + */ > > > + if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) { > > > + current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC; > > > + return; > > > + } > > > + > > > mark_held_locks(curr, RECLAIM_FS); > > > } > > > > I might be missing something but does this work actually? Say you would > > want a kmalloc(size), it would call > > slab_alloc_node > > slab_pre_alloc_hook > > lockdep_trace_alloc > > [...] > > ____cache_alloc_node > > cache_grow_begin > > kmem_getpages > > __alloc_pages_node > > __alloc_pages_nodemask > > lockdep_trace_alloc > > Bugger :/ You're right, that would fail. > > So how about doing: > > #define __GFP_NOLOCKDEP (1u << __GFP_BITS_SHIFT) Hmm, now that I looked closer this would break GFP_SLAB_BUG_MASK :/ The whole thing is a bit hysterical because I really do not see any reason to blow up just because somebody has used incorrect gfp mask (we have users who give us combinations without any sense in the tree...) We can fix that either by dropping the whole GFP_SLAB_BUG_MASK thingy or to update it with __GFP_NOLOCKDEP. It just shows how this might get really tricky and subtle. > this means it cannot be part of address_space::flags or > radix_tree_root::gfp_mask, but that might not be a bad thing. True, those shouldn't really care. > And this solves the scarcity thing, because per pagemap we need to have > 5 'spare' bits anyway. > > > I understand your concerns about the scope but usually all allocations > > have to be __GFP_NOFS or none in the same scope so I would see it as a > > huge deal. > > With scope I mostly meant the fact that you have two calls that you need > to pair up. That's not really nice as you can 'annotate' a _lot_ of code > in between. I prefer the narrower annotations where you annotate a > single specific site. Yes, I can see you point. What I meant to say is that we would most probably end up with the following pattern lockdep_trace_alloc_enable() some_foo_with_alloc(gfp_mask); lockdep_trace_alloc_disable() and some_foo_with_alloc might be a lot of code. But at the same time we know that _any_ allocation done from that context is safe from the reclaim recursiveness POV. If not then annotation is buggy and needs to be done at a different level but that would be exactly same if we did some_foo_with_alloc(gfp_mask|__GFP_NOLOCKDEP) because all the allocations down that road would reuse the same gfp mask anyway. That being said I completely agree that a single entry point is much less error prone but it also is tricky as we can see. So I would rather go with something less tricky. It's not like people are not used to enable/disable pattern. Anyway I will leave the decision to you. If you really insist on __GFP_NOLOCKDEP which doesn't consume new flag then I can review the resulting patch. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-02 15:46 ` Michal Hocko @ 2016-06-02 23:22 ` Dave Chinner 2016-06-06 12:20 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-06-02 23:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote: > On Thu 02-06-16 17:11:16, Peter Zijlstra wrote: > > With scope I mostly meant the fact that you have two calls that you need > > to pair up. That's not really nice as you can 'annotate' a _lot_ of code > > in between. I prefer the narrower annotations where you annotate a > > single specific site. > > Yes, I can see you point. What I meant to say is that we would most > probably end up with the following pattern > lockdep_trace_alloc_enable() > some_foo_with_alloc(gfp_mask); > lockdep_trace_alloc_disable() > > and some_foo_with_alloc might be a lot of code. That's the problem I see with this - the only way to make it maintainable is to precede each enable/disable() pair with a comment explaining *exactly* what those calls are protecting. And that, in itself, becomes a maintenance problem, because then code several layers deep has no idea what context it is being called from and we are likely to disable warnings in contexts where we probably shouldn't be. I think such an annotation approach really requires per-alloc site annotation, the reason for it should be more obvious from the context. e.g. any function that does memory alloc and takes an optional transaction context needs annotation. Hence, from an XFS perspective, I think it makes more sense to add a new KM_ flag to indicate this call site requirement, then jump through whatever lockdep hoop is required within the kmem_* allocation wrappers. e.g, we can ignore the new KM_* flag if we are in a transaction context and so the flag is only activated in the situations were we currently enforce an external GFP_NOFS context from the call site..... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-02 23:22 ` Dave Chinner @ 2016-06-06 12:20 ` Michal Hocko 2016-06-15 7:21 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-06 12:20 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Fri 03-06-16 09:22:54, Dave Chinner wrote: > On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote: > > On Thu 02-06-16 17:11:16, Peter Zijlstra wrote: > > > With scope I mostly meant the fact that you have two calls that you need > > > to pair up. That's not really nice as you can 'annotate' a _lot_ of code > > > in between. I prefer the narrower annotations where you annotate a > > > single specific site. > > > > Yes, I can see you point. What I meant to say is that we would most > > probably end up with the following pattern > > lockdep_trace_alloc_enable() > > some_foo_with_alloc(gfp_mask); > > lockdep_trace_alloc_disable() > > > > and some_foo_with_alloc might be a lot of code. > > That's the problem I see with this - the only way to make it > maintainable is to precede each enable/disable() pair with a comment > explaining *exactly* what those calls are protecting. And that, in > itself, becomes a maintenance problem, because then code several > layers deep has no idea what context it is being called from and we > are likely to disable warnings in contexts where we probably > shouldn't be. I am not sure I understand what you mean here. I thought the problem is that: func_A (!trans. context) func_B (trans. context) foo1() foo2() bar(inode, GFP_KERNEL) bar(inode, GFP_NOFS) so bar(inode, gfp) can be called from two different contexts which would confuse the lockdep. And the workaround would be annotating bar depending on the context it is called from - either pass a special gfp flag or do disable/enable thing. In both cases that anotation should be global for the whole func_A, no? Or is it possible that something in that path would really need a reclaim lockdep detection? > I think such an annotation approach really requires per-alloc site > annotation, the reason for it should be more obvious from the > context. e.g. any function that does memory alloc and takes an > optional transaction context needs annotation. Hence, from an XFS > perspective, I think it makes more sense to add a new KM_ flag to > indicate this call site requirement, then jump through whatever > lockdep hoop is required within the kmem_* allocation wrappers. > e.g, we can ignore the new KM_* flag if we are in a transaction > context and so the flag is only activated in the situations were > we currently enforce an external GFP_NOFS context from the call > site..... Hmm, I thought we would achive this by using the scope GFP_NOFS usage which would mark those transaction related conctexts and no lockdep specific workarounds would be needed... -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-06 12:20 ` Michal Hocko @ 2016-06-15 7:21 ` Dave Chinner 2016-06-21 14:26 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-06-15 7:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar [sorry for the slow repsonse - been on holidays] On Mon, Jun 06, 2016 at 02:20:22PM +0200, Michal Hocko wrote: > On Fri 03-06-16 09:22:54, Dave Chinner wrote: > > On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote: > > > On Thu 02-06-16 17:11:16, Peter Zijlstra wrote: > > > > With scope I mostly meant the fact that you have two calls that you need > > > > to pair up. That's not really nice as you can 'annotate' a _lot_ of code > > > > in between. I prefer the narrower annotations where you annotate a > > > > single specific site. > > > > > > Yes, I can see you point. What I meant to say is that we would most > > > probably end up with the following pattern > > > lockdep_trace_alloc_enable() > > > some_foo_with_alloc(gfp_mask); > > > lockdep_trace_alloc_disable() > > > > > > and some_foo_with_alloc might be a lot of code. > > > > That's the problem I see with this - the only way to make it > > maintainable is to precede each enable/disable() pair with a comment > > explaining *exactly* what those calls are protecting. And that, in > > itself, becomes a maintenance problem, because then code several > > layers deep has no idea what context it is being called from and we > > are likely to disable warnings in contexts where we probably > > shouldn't be. > > I am not sure I understand what you mean here. I thought the problem is > that: > > func_A (!trans. context) func_B (trans. context) > foo1() foo2() > bar(inode, GFP_KERNEL) bar(inode, GFP_NOFS) > > so bar(inode, gfp) can be called from two different contexts which > would confuse the lockdep. Yes, that's the core of the problem. What I think you are missing is the scale of the problem. > And the workaround would be annotating bar > depending on the context it is called from - either pass a special gfp > flag or do disable/enable thing. In both cases that anotation should be > global for the whole func_A, no? Or is it possible that something in > that path would really need a reclaim lockdep detection? The problem is that there are cases where the call stack that leads to bar() has many different entry points. See, for example, the xfs_bmapi*() interfaces. They all end up in the same low level btree traversal code (and hence memory allocation points). xfs_bmapi_read() can be called from both inside and outside transaction context and there's ~30 callers we'd have to audit and annotate. Then there's ~10 callers of xfs_bmapi_write which are all within transaction context. And then there's xfs_bmapi_delay(), which can end up in the same low level code outside transaction context. Then there's 10 callers of xfs_bunmapi(), which runs both inside and outside transaction context, too. Add to that all the miscellenous points that can read extents off disk, and you get another ~12 entry points. Hopefully you can see the complexity of the issue - for an allocation in the bmap btree code that could occur outside both inside and outside of a transaction context, we've got to work out which of those ~60 high level entry points would need to be annotated. And then we have to ensure that in future we don't miss adding or removing an annotation as we change the code deep inside the btree implementation. It's the latter that is the long term maintainence problem the hihg-level annotation approach introduces. > > I think such an annotation approach really requires per-alloc site > > annotation, the reason for it should be more obvious from the > > context. e.g. any function that does memory alloc and takes an > > optional transaction context needs annotation. Hence, from an XFS > > perspective, I think it makes more sense to add a new KM_ flag to > > indicate this call site requirement, then jump through whatever > > lockdep hoop is required within the kmem_* allocation wrappers. > > e.g, we can ignore the new KM_* flag if we are in a transaction > > context and so the flag is only activated in the situations were > > we currently enforce an external GFP_NOFS context from the call > > site..... > > Hmm, I thought we would achive this by using the scope GFP_NOFS usage > which would mark those transaction related conctexts and no lockdep > specific workarounds would be needed... There are allocations outside transaction context which need to be GFP_NOFS - this is what KM_NOFS was originally intended for. We need to disambiguate the two cases where we use KM_NOFS to shut up lockdep vs the cases where it is necessary to prevent reclaim deadlocks. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-15 7:21 ` Dave Chinner @ 2016-06-21 14:26 ` Michal Hocko 2016-06-22 1:03 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-21 14:26 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Wed 15-06-16 17:21:54, Dave Chinner wrote: [...] > Hopefully you can see the complexity of the issue - for an allocation > in the bmap btree code that could occur outside both inside and > outside of a transaction context, we've got to work out which of > those ~60 high level entry points would need to be annotated. And > then we have to ensure that in future we don't miss adding or > removing an annotation as we change the code deep inside the btree > implementation. It's the latter that is the long term maintainence > problem the hihg-level annotation approach introduces. Sure I can see the complexity here. I might still see this over simplified but I originally thought that the annotation would be used at the highest level which never gets called from the transaction or other NOFS context. So all the layers down would inherit that automatically. I guess that such a place can be identified from the lockdep report by a trained eye. > > > I think such an annotation approach really requires per-alloc site > > > annotation, the reason for it should be more obvious from the > > > context. e.g. any function that does memory alloc and takes an > > > optional transaction context needs annotation. Hence, from an XFS > > > perspective, I think it makes more sense to add a new KM_ flag to > > > indicate this call site requirement, then jump through whatever > > > lockdep hoop is required within the kmem_* allocation wrappers. > > > e.g, we can ignore the new KM_* flag if we are in a transaction > > > context and so the flag is only activated in the situations were > > > we currently enforce an external GFP_NOFS context from the call > > > site..... > > > > Hmm, I thought we would achive this by using the scope GFP_NOFS usage > > which would mark those transaction related conctexts and no lockdep > > specific workarounds would be needed... > > There are allocations outside transaction context which need to be > GFP_NOFS - this is what KM_NOFS was originally intended for. Is it feasible to mark those by the scope NOFS api as well and drop the direct KM_NOFS usage? This should help to identify those that are lockdep only and use the annotation to prevent from the false positives. -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-21 14:26 ` Michal Hocko @ 2016-06-22 1:03 ` Dave Chinner 2016-06-22 12:38 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-06-22 1:03 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote: > On Wed 15-06-16 17:21:54, Dave Chinner wrote: > [...] > > Hopefully you can see the complexity of the issue - for an allocation > > in the bmap btree code that could occur outside both inside and > > outside of a transaction context, we've got to work out which of > > those ~60 high level entry points would need to be annotated. And > > then we have to ensure that in future we don't miss adding or > > removing an annotation as we change the code deep inside the btree > > implementation. It's the latter that is the long term maintainence > > problem the hihg-level annotation approach introduces. > > Sure I can see the complexity here. I might still see this over > simplified but I originally thought that the annotation would be used at > the highest level which never gets called from the transaction or other > NOFS context. So all the layers down would inherit that automatically. I > guess that such a place can be identified from the lockdep report by a > trained eye. Which, as I said before, effectively becomes "turn off lockdep reclaim context checking at all XFS entry points". Yes, we could do that, but it's a "big hammer" solution and there are more entry points than there are memory allocations that need annotations.... > > > > I think such an annotation approach really requires per-alloc site > > > > annotation, the reason for it should be more obvious from the > > > > context. e.g. any function that does memory alloc and takes an > > > > optional transaction context needs annotation. Hence, from an XFS > > > > perspective, I think it makes more sense to add a new KM_ flag to > > > > indicate this call site requirement, then jump through whatever > > > > lockdep hoop is required within the kmem_* allocation wrappers. > > > > e.g, we can ignore the new KM_* flag if we are in a transaction > > > > context and so the flag is only activated in the situations were > > > > we currently enforce an external GFP_NOFS context from the call > > > > site..... > > > > > > Hmm, I thought we would achive this by using the scope GFP_NOFS usage > > > which would mark those transaction related conctexts and no lockdep > > > specific workarounds would be needed... > > > > There are allocations outside transaction context which need to be > > GFP_NOFS - this is what KM_NOFS was originally intended for. > > Is it feasible to mark those by the scope NOFS api as well and drop > the direct KM_NOFS usage? This should help to identify those that are > lockdep only and use the annotation to prevent from the false positives. I don't understand what you are suggesting here. This all started because we use GFP_NOFS in a handful of places to shut up lockdep and you didn't want us to use GFP_NOFS like that. Now it sounds to me like you are advocating setting unconditional GFP_NOFS allocation contexts for entire XFS code paths - whether it's necessary or not - to avoid problems with lockdep false positives. I'm clearly not understanding something here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-22 1:03 ` Dave Chinner @ 2016-06-22 12:38 ` Michal Hocko 2016-06-22 22:58 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-06-22 12:38 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Wed 22-06-16 11:03:20, Dave Chinner wrote: > On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote: > > On Wed 15-06-16 17:21:54, Dave Chinner wrote: [...] > > > There are allocations outside transaction context which need to be > > > GFP_NOFS - this is what KM_NOFS was originally intended for. > > > > Is it feasible to mark those by the scope NOFS api as well and drop > > the direct KM_NOFS usage? This should help to identify those that are > > lockdep only and use the annotation to prevent from the false positives. > > I don't understand what you are suggesting here. This all started > because we use GFP_NOFS in a handful of places to shut up lockdep > and you didn't want us to use GFP_NOFS like that. Now it sounds to > me like you are advocating setting unconditional GFP_NOFS allocation > contexts for entire XFS code paths - whether it's necessary or > not - to avoid problems with lockdep false positives. No, I meant only those paths which need GFP_NOFS for other than lockdep purposes would use the scope api. Anyway, it seems that we are not getting closer to a desired solution here. Or I am not following it at least... It seems that we have effectively two possibilities (from the MM/lockdep) POV. Either add an explicit API to disable the reclaim lockdep machinery for all allocation in a certain scope or a GFP mask to to achieve the same for a particular allocation. Which one would work better for the xfs usecase? -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-22 12:38 ` Michal Hocko @ 2016-06-22 22:58 ` Dave Chinner 2016-06-23 11:35 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-06-22 22:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar On Wed, Jun 22, 2016 at 02:38:22PM +0200, Michal Hocko wrote: > On Wed 22-06-16 11:03:20, Dave Chinner wrote: > > On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote: > > > On Wed 15-06-16 17:21:54, Dave Chinner wrote: > [...] > > > > There are allocations outside transaction context which need to be > > > > GFP_NOFS - this is what KM_NOFS was originally intended for. > > > > > > Is it feasible to mark those by the scope NOFS api as well and drop > > > the direct KM_NOFS usage? This should help to identify those that are > > > lockdep only and use the annotation to prevent from the false positives. > > > > I don't understand what you are suggesting here. This all started > > because we use GFP_NOFS in a handful of places to shut up lockdep > > and you didn't want us to use GFP_NOFS like that. Now it sounds to > > me like you are advocating setting unconditional GFP_NOFS allocation > > contexts for entire XFS code paths - whether it's necessary or > > not - to avoid problems with lockdep false positives. > > No, I meant only those paths which need GFP_NOFS for other than lockdep > purposes would use the scope api. > > Anyway, it seems that we are not getting closer to a desired solution > here. Or I am not following it at least... > > It seems that we have effectively two possibilities (from the > MM/lockdep) POV. Either add an explicit API to disable the reclaim > lockdep machinery for all allocation in a certain scope or a GFP mask > to to achieve the same for a particular allocation. Which one would work > better for the xfs usecase? As I've said - if we annotate the XFS call sites appropriately (e.g. KM_NOLOCKDEP rather than KM_NOFS), we don't care what lockdep mechanism is used to turn off warnings as it will be wholly encapsulated inside kmem_alloc() and friends. This will end up similar to how we are currently encapsulate the memalloc_noio_save() wrappers in kmem_zalloc_large(). IOWs, it doesn't matter to XFS whether it be a GFP flag or a PF flag here, because it's not going to be exposed to the higher level code. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-22 22:58 ` Dave Chinner @ 2016-06-23 11:35 ` Michal Hocko 0 siblings, 0 replies; 37+ messages in thread From: Michal Hocko @ 2016-06-23 11:35 UTC (permalink / raw) To: Peter Zijlstra, Dave Chinner Cc: linux-mm, Qu Wenruo, xfs, Ingo Molnar, Darrick J. Wong On Thu 23-06-16 08:58:16, Dave Chinner wrote: > On Wed, Jun 22, 2016 at 02:38:22PM +0200, Michal Hocko wrote: > > On Wed 22-06-16 11:03:20, Dave Chinner wrote: > > > On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote: > > > > On Wed 15-06-16 17:21:54, Dave Chinner wrote: > > [...] > > > > > There are allocations outside transaction context which need to be > > > > > GFP_NOFS - this is what KM_NOFS was originally intended for. > > > > > > > > Is it feasible to mark those by the scope NOFS api as well and drop > > > > the direct KM_NOFS usage? This should help to identify those that are > > > > lockdep only and use the annotation to prevent from the false positives. > > > > > > I don't understand what you are suggesting here. This all started > > > because we use GFP_NOFS in a handful of places to shut up lockdep > > > and you didn't want us to use GFP_NOFS like that. Now it sounds to > > > me like you are advocating setting unconditional GFP_NOFS allocation > > > contexts for entire XFS code paths - whether it's necessary or > > > not - to avoid problems with lockdep false positives. > > > > No, I meant only those paths which need GFP_NOFS for other than lockdep > > purposes would use the scope api. > > > > Anyway, it seems that we are not getting closer to a desired solution > > here. Or I am not following it at least... > > > > It seems that we have effectively two possibilities (from the > > MM/lockdep) POV. Either add an explicit API to disable the reclaim > > lockdep machinery for all allocation in a certain scope or a GFP mask > > to to achieve the same for a particular allocation. Which one would work > > better for the xfs usecase? > > As I've said - if we annotate the XFS call sites appropriately (e.g. > KM_NOLOCKDEP rather than KM_NOFS), we don't care what lockdep > mechanism is used to turn off warnings as it will be wholly > encapsulated inside kmem_alloc() and friends. This will end up > similar to how we are currently encapsulate the memalloc_noio_save() > wrappers in kmem_zalloc_large(). OK, I see. So which way do we go Peter? Are you going to send the GFP one or is there a way to bribe you to go with a thread flag? -- Michal Hocko SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-06-01 18:16 ` Peter Zijlstra 2016-06-02 14:50 ` Michal Hocko @ 2016-10-06 13:04 ` Michal Hocko 2016-10-17 13:49 ` Michal Hocko ` (2 more replies) 1 sibling, 3 replies; 37+ messages in thread From: Michal Hocko @ 2016-10-06 13:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar [Let me ressurect this thread] On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > Thanks Dave for your detailed explanation again! Peter do you have any > > other idea how to deal with these situations other than opt out from > > lockdep reclaim machinery? > > > > If not I would rather go with an annotation than a gfp flag to be honest > > but if you absolutely hate that approach then I will try to check wheter > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > steal the description from Dave's email and repost my patch. > > > > I plan to repost my scope gfp patches in few days and it would be good > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > false positives for that. > > Right; sorry I got side-tracked in other things again. > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > the mm folks then something like the below might work. It should be > similar in effect to your proposal, except its more limited in scope. OK, so the situation with the GFP flags is somehow relieved after http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with the root radix tree remaining the last user which mangles gfp_mask and tags together we have some few bits left there. As you apparently hate any scoped API and Dave thinks that per allocation flag is the only maintainable way for xfs what do you think about the following? --- >From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 13 May 2016 17:47:31 +0200 Subject: [PATCH] lockdep: allow to disable reclaim lockup detection The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 [<ffffffff81244459>] SyS_ioctl+0x79/0x90 [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 Call Trace: [<ffffffff813a14f9>] dump_stack+0x4b/0x72 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] [<ffffffff8124d7d5>] evict+0xc5/0x190 [<ffffffff8124d8d9>] dispose_list+0x39/0x60 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 [<ffffffff810e6882>] kthread+0xf2/0x110 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 To quote Dave: " Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock... " This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/gfp.h | 10 +++++++++- kernel/locking/lockdep.c | 4 ++++ lib/radix-tree.c | 1 + 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3e6c48dbe6b9..cee3d5fa3821 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -41,6 +41,11 @@ struct vm_area_struct; #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u #define ___GFP_KSWAPD_RECLAIM 0x2000000u +#ifdef CONFIG_LOCKDEP +#define ___GFP_NOLOCKDEP 0x4000000u +#else +#define ___GFP_NOLOCKDEP 0 +#endif /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -186,8 +191,11 @@ struct vm_area_struct; #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) +/* Disable lockdep for GFP context tracking */ +#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) + /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 26 +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index d96c6e058467..a652ac8b3cfa 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2871,6 +2871,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) return; + /* Disable lockdep if explicitly requested */ + if (gfp_mask & __GFP_NOLOCKDEP) + return; + mark_held_locks(curr, RECLAIM_FS); } diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 1b7bf7314141..3154403d30e8 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1672,6 +1672,7 @@ static int radix_tree_callback(struct notifier_block *nfb, void __init radix_tree_init(void) { + BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32); radix_tree_node_cachep = kmem_cache_create("radix_tree_node", sizeof(struct radix_tree_node), 0, SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, -- 2.9.3 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-06 13:04 ` Michal Hocko @ 2016-10-17 13:49 ` Michal Hocko 2016-10-19 0:33 ` Dave Chinner 2016-10-19 8:33 ` Peter Zijlstra 2 siblings, 0 replies; 37+ messages in thread From: Michal Hocko @ 2016-10-17 13:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar On Thu 06-10-16 15:04:54, Michal Hocko wrote: > [Let me ressurect this thread] ping > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > > Thanks Dave for your detailed explanation again! Peter do you have any > > > other idea how to deal with these situations other than opt out from > > > lockdep reclaim machinery? > > > > > > If not I would rather go with an annotation than a gfp flag to be honest > > > but if you absolutely hate that approach then I will try to check wheter > > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > > steal the description from Dave's email and repost my patch. > > > > > > I plan to repost my scope gfp patches in few days and it would be good > > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > > false positives for that. > > > > Right; sorry I got side-tracked in other things again. > > > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > the mm folks then something like the below might work. It should be > > similar in effect to your proposal, except its more limited in scope. > > OK, so the situation with the GFP flags is somehow relieved after > http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with > the root radix tree remaining the last user which mangles gfp_mask and > tags together we have some few bits left there. As you apparently hate > any scoped API and Dave thinks that per allocation flag is the only > maintainable way for xfs what do you think about the following? > --- > From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 13 May 2016 17:47:31 +0200 > Subject: [PATCH] lockdep: allow to disable reclaim lockup detection > > The current implementation of the reclaim lockup detection can lead to > false positives and those even happen and usually lead to tweak the > code to silence the lockdep by using GFP_NOFS even though the context > can use __GFP_FS just fine. See > http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. > > ================================= > [ INFO: inconsistent lock state ] > 4.5.0-rc2+ #4 Tainted: G O > --------------------------------- > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: > > (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > > {RECLAIM_FS-ON-R} state was registered at: > [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 > [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 > [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 > [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] > [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] > [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] > [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] > [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] > [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 > [<ffffffff81244459>] SyS_ioctl+0x79/0x90 > [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f > > CPU0 > ---- > lock(&xfs_nondir_ilock_class); > <Interrupt> > lock(&xfs_nondir_ilock_class); > > *** DEADLOCK *** > > 3 locks held by kswapd0/543: > > stack backtrace: > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 > > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 > > ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 > ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 > ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 > Call Trace: > [<ffffffff813a14f9>] dump_stack+0x4b/0x72 > [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 > [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 > [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 > [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 > [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 > [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] > [<ffffffff8124d7d5>] evict+0xc5/0x190 > [<ffffffff8124d8d9>] dispose_list+0x39/0x60 > [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 > [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 > [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 > [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 > [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 > [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 > [<ffffffff810e6882>] kthread+0xf2/0x110 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > > To quote Dave: > " > Ignoring whether reflink should be doing anything or not, that's a > "xfs_refcountbt_init_cursor() gets called both outside and inside > transactions" lockdep false positive case. The problem here is > lockdep has seen this allocation from within a transaction, hence a > GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. > Also note that we have an active reference to this inode. > > So, because the reclaim annotations overload the interrupt level > detections and it's seen the inode ilock been taken in reclaim > ("interrupt") context, this triggers a reclaim context warning where > it thinks it is unsafe to do this allocation in GFP_KERNEL context > holding the inode ilock... > " > > This sounds like a fundamental problem of the reclaim lock detection. > It is really impossible to annotate such a special usecase IMHO unless > the reclaim lockup detection is reworked completely. Until then it > is much better to provide a way to add "I know what I am doing flag" > and mark problematic places. This would prevent from abusing GFP_NOFS > flag which has a runtime effect even on configurations which have > lockdep disabled. > > Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to > skip the current allocation request. > > While we are at it also make sure that the radix tree doesn't > accidentaly override tags stored in the upper part of the gfp_mask. > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/gfp.h | 10 +++++++++- > kernel/locking/lockdep.c | 4 ++++ > lib/radix-tree.c | 1 + > 3 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 3e6c48dbe6b9..cee3d5fa3821 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -41,6 +41,11 @@ struct vm_area_struct; > #define ___GFP_OTHER_NODE 0x800000u > #define ___GFP_WRITE 0x1000000u > #define ___GFP_KSWAPD_RECLAIM 0x2000000u > +#ifdef CONFIG_LOCKDEP > +#define ___GFP_NOLOCKDEP 0x4000000u > +#else > +#define ___GFP_NOLOCKDEP 0 > +#endif > /* If the above are modified, __GFP_BITS_SHIFT may need updating */ > > /* > @@ -186,8 +191,11 @@ struct vm_area_struct; > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) > > +/* Disable lockdep for GFP context tracking */ > +#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) > + > /* Room for N __GFP_FOO bits */ > -#define __GFP_BITS_SHIFT 26 > +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP)) > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index d96c6e058467..a652ac8b3cfa 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -2871,6 +2871,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > return; > > + /* Disable lockdep if explicitly requested */ > + if (gfp_mask & __GFP_NOLOCKDEP) > + return; > + > mark_held_locks(curr, RECLAIM_FS); > } > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c > index 1b7bf7314141..3154403d30e8 100644 > --- a/lib/radix-tree.c > +++ b/lib/radix-tree.c > @@ -1672,6 +1672,7 @@ static int radix_tree_callback(struct notifier_block *nfb, > > void __init radix_tree_init(void) > { > + BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32); > radix_tree_node_cachep = kmem_cache_create("radix_tree_node", > sizeof(struct radix_tree_node), 0, > SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, > -- > 2.9.3 > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-06 13:04 ` Michal Hocko 2016-10-17 13:49 ` Michal Hocko @ 2016-10-19 0:33 ` Dave Chinner 2016-10-19 5:30 ` Dave Chinner 2016-10-19 8:33 ` Peter Zijlstra 2 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-10-19 0:33 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote: > [Let me ressurect this thread] > > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > > Thanks Dave for your detailed explanation again! Peter do you have any > > > other idea how to deal with these situations other than opt out from > > > lockdep reclaim machinery? > > > > > > If not I would rather go with an annotation than a gfp flag to be honest > > > but if you absolutely hate that approach then I will try to check wheter > > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > > steal the description from Dave's email and repost my patch. > > > > > > I plan to repost my scope gfp patches in few days and it would be good > > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > > false positives for that. > > > > Right; sorry I got side-tracked in other things again. > > > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > the mm folks then something like the below might work. It should be > > similar in effect to your proposal, except its more limited in scope. > > OK, so the situation with the GFP flags is somehow relieved after > http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with > the root radix tree remaining the last user which mangles gfp_mask and > tags together we have some few bits left there. As you apparently hate > any scoped API and Dave thinks that per allocation flag is the only > maintainable way for xfs what do you think about the following? It's a workable solution to allow XFS to play whack-a-mole games with lockdep again. As to the implementation - that's for other people to decide.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-19 0:33 ` Dave Chinner @ 2016-10-19 5:30 ` Dave Chinner 0 siblings, 0 replies; 37+ messages in thread From: Dave Chinner @ 2016-10-19 5:30 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm, Ingo Molnar [resend with the xfs list corrected.] On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote: > [Let me ressurect this thread] > > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote: > > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote: > > > Thanks Dave for your detailed explanation again! Peter do you have any > > > other idea how to deal with these situations other than opt out from > > > lockdep reclaim machinery? > > > > > > If not I would rather go with an annotation than a gfp flag to be honest > > > but if you absolutely hate that approach then I will try to check wheter > > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would > > > steal the description from Dave's email and repost my patch. > > > > > > I plan to repost my scope gfp patches in few days and it would be good > > > to have some mechanism to drop those GFP_NOFS to paper over lockdep > > > false positives for that. > > > > Right; sorry I got side-tracked in other things again. > > > > So my favourite is the dedicated GFP flag, but if that's unpalatable for > > the mm folks then something like the below might work. It should be > > similar in effect to your proposal, except its more limited in scope. > > OK, so the situation with the GFP flags is somehow relieved after > http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with > the root radix tree remaining the last user which mangles gfp_mask and > tags together we have some few bits left there. As you apparently hate > any scoped API and Dave thinks that per allocation flag is the only > maintainable way for xfs what do you think about the following? It's a workable solution to allow XFS to play whack-a-mole games with lockdep again. As to the implementation - that's for other people to decide.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-06 13:04 ` Michal Hocko 2016-10-17 13:49 ` Michal Hocko 2016-10-19 0:33 ` Dave Chinner @ 2016-10-19 8:33 ` Peter Zijlstra 2016-10-19 12:06 ` Michal Hocko 2 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2016-10-19 8:33 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote: > From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 13 May 2016 17:47:31 +0200 > Subject: [PATCH] lockdep: allow to disable reclaim lockup detection > > The current implementation of the reclaim lockup detection can lead to > false positives and those even happen and usually lead to tweak the > code to silence the lockdep by using GFP_NOFS even though the context > can use __GFP_FS just fine. See > http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. > > ================================= > [ INFO: inconsistent lock state ] > 4.5.0-rc2+ #4 Tainted: G O > --------------------------------- > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: > > (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > > {RECLAIM_FS-ON-R} state was registered at: > [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 > [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 > [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 > [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] > [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] > [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] > [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] > [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] > [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 > [<ffffffff81244459>] SyS_ioctl+0x79/0x90 > [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f > > CPU0 > ---- > lock(&xfs_nondir_ilock_class); > <Interrupt> > lock(&xfs_nondir_ilock_class); > > *** DEADLOCK *** > > 3 locks held by kswapd0/543: > > stack backtrace: > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 > > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 > > ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 > ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 > ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 > Call Trace: > [<ffffffff813a14f9>] dump_stack+0x4b/0x72 > [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 > [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 > [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 > [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 > [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 > [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] > [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 > [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] > [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] > [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] > [<ffffffff8124d7d5>] evict+0xc5/0x190 > [<ffffffff8124d8d9>] dispose_list+0x39/0x60 > [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 > [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 > [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 > [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 > [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 > [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 > [<ffffffff810e6882>] kthread+0xf2/0x110 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 > [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 > > To quote Dave: > " > Ignoring whether reflink should be doing anything or not, that's a > "xfs_refcountbt_init_cursor() gets called both outside and inside > transactions" lockdep false positive case. The problem here is > lockdep has seen this allocation from within a transaction, hence a > GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. > Also note that we have an active reference to this inode. > > So, because the reclaim annotations overload the interrupt level > detections and it's seen the inode ilock been taken in reclaim > ("interrupt") context, this triggers a reclaim context warning where > it thinks it is unsafe to do this allocation in GFP_KERNEL context > holding the inode ilock... > " > > This sounds like a fundamental problem of the reclaim lock detection. > It is really impossible to annotate such a special usecase IMHO unless > the reclaim lockup detection is reworked completely. Until then it > is much better to provide a way to add "I know what I am doing flag" > and mark problematic places. This would prevent from abusing GFP_NOFS > flag which has a runtime effect even on configurations which have > lockdep disabled. > > Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to > skip the current allocation request. > > While we are at it also make sure that the radix tree doesn't > accidentaly override tags stored in the upper part of the gfp_mask. > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Michal Hocko <mhocko@suse.com> So I'm all for this if this works for Dave. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Please take it through the XFS tree which would introduce its first user etc.. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-19 8:33 ` Peter Zijlstra @ 2016-10-19 12:06 ` Michal Hocko 2016-10-19 21:49 ` Dave Chinner 0 siblings, 1 reply; 37+ messages in thread From: Michal Hocko @ 2016-10-19 12:06 UTC (permalink / raw) To: Dave Chinner Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar On Wed 19-10-16 10:33:04, Peter Zijlstra wrote: [...] > So I'm all for this if this works for Dave. > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks Peter! > Please take it through the XFS tree which would introduce its first user > etc.. Dave, does that work for you? I agree that having this followed by a first user would be really preferable. Maybe to turn some of those added by b17cb364dbbb? I wish I could help here but as you've said earlier each such annotation should be accompanied by an explanation which I am not qualified to provide. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-19 12:06 ` Michal Hocko @ 2016-10-19 21:49 ` Dave Chinner 2016-10-20 7:15 ` Michal Hocko 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2016-10-19 21:49 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm, Ingo Molnar On Wed, Oct 19, 2016 at 02:06:27PM +0200, Michal Hocko wrote: > On Wed 19-10-16 10:33:04, Peter Zijlstra wrote: > [...] > > So I'm all for this if this works for Dave. > > > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Thanks Peter! > > > Please take it through the XFS tree which would introduce its first user > > etc.. > > Dave, does that work for you? I agree that having this followed by a > first user would be really preferable. Maybe to turn some of those added > by b17cb364dbbb? I wish I could help here but as you've said earlier > each such annotation should be accompanied by an explanation which I am > not qualified to provide. I've got my hands full right now, so I'm not going to try to page all this stuff back into my brain right now. Try reminding me as the merge window gets closer... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Xfs lockdep warning with for-dave-for-4.6 branch 2016-10-19 21:49 ` Dave Chinner @ 2016-10-20 7:15 ` Michal Hocko 0 siblings, 0 replies; 37+ messages in thread From: Michal Hocko @ 2016-10-20 7:15 UTC (permalink / raw) To: Dave Chinner Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm, Ingo Molnar On Thu 20-10-16 08:49:23, Dave Chinner wrote: > On Wed, Oct 19, 2016 at 02:06:27PM +0200, Michal Hocko wrote: > > On Wed 19-10-16 10:33:04, Peter Zijlstra wrote: > > [...] > > > So I'm all for this if this works for Dave. > > > > > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > > Thanks Peter! > > > > > Please take it through the XFS tree which would introduce its first user > > > etc.. > > > > Dave, does that work for you? I agree that having this followed by a > > first user would be really preferable. Maybe to turn some of those added > > by b17cb364dbbb? I wish I could help here but as you've said earlier > > each such annotation should be accompanied by an explanation which I am > > not qualified to provide. > > I've got my hands full right now, so I'm not going to try to page > all this stuff back into my brain right now. Try reminding me as > the merge window gets closer... Sure, I do not think we are in hurry. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2016-10-20 7:15 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-12 5:53 Xfs lockdep warning with for-dave-for-4.6 branch Qu Wenruo 2016-05-12 5:57 ` Darrick J. Wong 2016-05-12 8:03 ` Dave Chinner 2016-05-13 16:03 ` Michal Hocko 2016-05-16 10:41 ` Peter Zijlstra 2016-05-16 13:05 ` Michal Hocko 2016-05-16 13:25 ` Peter Zijlstra 2016-05-16 23:10 ` Dave Chinner 2016-05-17 14:49 ` Peter Zijlstra 2016-05-17 22:35 ` Dave Chinner 2016-05-18 7:20 ` Peter Zijlstra 2016-05-18 8:25 ` Michal Hocko 2016-05-18 9:49 ` Peter Zijlstra 2016-05-18 11:31 ` Michal Hocko 2016-05-19 8:11 ` Peter Zijlstra 2016-05-20 0:17 ` Dave Chinner 2016-06-01 13:17 ` Michal Hocko 2016-06-01 18:16 ` Peter Zijlstra 2016-06-02 14:50 ` Michal Hocko 2016-06-02 15:11 ` Peter Zijlstra 2016-06-02 15:46 ` Michal Hocko 2016-06-02 23:22 ` Dave Chinner 2016-06-06 12:20 ` Michal Hocko 2016-06-15 7:21 ` Dave Chinner 2016-06-21 14:26 ` Michal Hocko 2016-06-22 1:03 ` Dave Chinner 2016-06-22 12:38 ` Michal Hocko 2016-06-22 22:58 ` Dave Chinner 2016-06-23 11:35 ` Michal Hocko 2016-10-06 13:04 ` Michal Hocko 2016-10-17 13:49 ` Michal Hocko 2016-10-19 0:33 ` Dave Chinner 2016-10-19 5:30 ` Dave Chinner 2016-10-19 8:33 ` Peter Zijlstra 2016-10-19 12:06 ` Michal Hocko 2016-10-19 21:49 ` Dave Chinner 2016-10-20 7:15 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).