linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Xfs lockdep warning with for-dave-for-4.6 branch
@ 2016-05-12  5:53 Qu Wenruo
  2016-05-12  5:57 ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Qu Wenruo @ 2016-05-12  5:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

Hi Darrick,

Thanks for your reflink work for xfs first, that's quite a good and 
useful feature, also helps to debug btrfs problems.
(Without that, there is no good reference for reflink behavior)

But when testing your for-dave-for-4.6 branch, even I'm just testing 
btrfs with xfstests, kernel report some strange lockdep from xfs:

------
run fstests generic/175 at 2016-05-12 13:22:06
BTRFS: device fsid 3d5c9c3b-2d08-4f0b-9663-00a88cd218da devid 1 transid 
3 /dev/sdb6
BTRFS info (device sdb6): disk space caching is enabled
BTRFS: has skinny extents
BTRFS: flagging fs with big metadata feature
BTRFS: creating UUID tree
BTRFS: device fsid bb75eb48-4c5f-4b75-a41d-f642d70c7294 devid 1 transid 
3 /dev/sdb6
BTRFS info (device sdb6): disk space caching is enabled
BTRFS: has skinny extents
BTRFS: flagging fs with big metadata feature
BTRFS: creating UUID tree


=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] 
xfs_ilock+0x177/0x200 [xfs]
{RECLAIM_FS-ON-R} state was registered at:
   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
irq event stamp: 510775
hardirqs last  enabled at (510775): [<ffffffff812245d0>] 
__slab_alloc+0x50/0x70
hardirqs last disabled at (510774): [<ffffffff812245ae>] 
__slab_alloc+0x2e/0x70
softirqs last  enabled at (510506): [<ffffffff810c8ea8>] 
__do_softirq+0x358/0x430
softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&xfs_nondir_ilock_class);
   <Interrupt>
     lock(&xfs_nondir_ilock_class);

  *** DEADLOCK ***

3 locks held by kswapd0/543:
  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>] 
shrink_slab.part.63.constprop.79+0x48/0x4e0
  #1:  (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>] 
trylock_super+0x1b/0x50
  #2:  (sb_internal){.+.+.?}, at: [<ffffffff812327f4>] 
__sb_start_write+0xb4/0xf0

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 
12/01/2006
  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
  [<ffffffff8124d7d5>] evict+0xc5/0x190
  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
  [<ffffffff810e6882>] kthread+0xf2/0x110
  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
hrtimer: interrupt took 4824925 ns

BTRFS info (device sdb6): disk space caching is enabled
BTRFS: has skinny extents
------

The test machine is using normal xfs (without -m reflink=1) as its root.
As you can see, it's running generic/175 on *BTRFS*, not *XFS*, but 
still lockdep warning from xfs.

Hopes the output could help you.

Thanks,
Qu


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-12  5:53 Xfs lockdep warning with for-dave-for-4.6 branch Qu Wenruo
@ 2016-05-12  5:57 ` Darrick J. Wong
  2016-05-12  8:03   ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2016-05-12  5:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: xfs

On Thu, May 12, 2016 at 01:53:44PM +0800, Qu Wenruo wrote:
> Hi Darrick,
> 
> Thanks for your reflink work for xfs first, that's quite a good and useful
> feature, also helps to debug btrfs problems.
> (Without that, there is no good reference for reflink behavior)
> 
> But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs
> with xfstests, kernel report some strange lockdep from xfs:
> 
> ------
> run fstests generic/175 at 2016-05-12 13:22:06
> BTRFS: device fsid 3d5c9c3b-2d08-4f0b-9663-00a88cd218da devid 1 transid 3
> /dev/sdb6
> BTRFS info (device sdb6): disk space caching is enabled
> BTRFS: has skinny extents
> BTRFS: flagging fs with big metadata feature
> BTRFS: creating UUID tree
> BTRFS: device fsid bb75eb48-4c5f-4b75-a41d-f642d70c7294 devid 1 transid 3
> /dev/sdb6
> BTRFS info (device sdb6): disk space caching is enabled
> BTRFS: has skinny extents
> BTRFS: flagging fs with big metadata feature
> BTRFS: creating UUID tree
> 
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>]
> xfs_ilock+0x177/0x200 [xfs]
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> irq event stamp: 510775
> hardirqs last  enabled at (510775): [<ffffffff812245d0>]
> __slab_alloc+0x50/0x70
> hardirqs last disabled at (510774): [<ffffffff812245ae>]
> __slab_alloc+0x2e/0x70
> softirqs last  enabled at (510506): [<ffffffff810c8ea8>]
> __do_softirq+0x358/0x430
> softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
>  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>]
> shrink_slab.part.63.constprop.79+0x48/0x4e0
>  #1:  (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>]
> trylock_super+0x1b/0x50
>  #2:  (sb_internal){.+.+.?}, at: [<ffffffff812327f4>]
> __sb_start_write+0xb4/0xf0
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
> 12/01/2006
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> hrtimer: interrupt took 4824925 ns
> 
> BTRFS info (device sdb6): disk space caching is enabled
> BTRFS: has skinny extents
> ------
> 
> The test machine is using normal xfs (without -m reflink=1) as its root.
> As you can see, it's running generic/175 on *BTRFS*, not *XFS*, but still
> lockdep warning from xfs.

That's ... odd.  We shouldn't ever be in the refcountbt code if reflink
isn't enabled.  Welp, I'll have a look in the morning.

--D

> 
> Hopes the output could help you.
> 
> Thanks,
> Qu
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-12  5:57 ` Darrick J. Wong
@ 2016-05-12  8:03   ` Dave Chinner
  2016-05-13 16:03     ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-05-12  8:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Qu Wenruo, Michal Hocko, xfs

[ cc Michal Hocko, just so he can see a lockdep reclaim state false
positive ]

On Wed, May 11, 2016 at 10:57:56PM -0700, Darrick J. Wong wrote:
> On Thu, May 12, 2016 at 01:53:44PM +0800, Qu Wenruo wrote:
> > Hi Darrick,
> > 
> > Thanks for your reflink work for xfs first, that's quite a good and useful
> > feature, also helps to debug btrfs problems.
> > (Without that, there is no good reference for reflink behavior)
> > 
> > But when testing your for-dave-for-4.6 branch, even I'm just testing btrfs
> > with xfstests, kernel report some strange lockdep from xfs:
.....
> > =================================
> > [ INFO: inconsistent lock state ]
> > 4.5.0-rc2+ #4 Tainted: G           O
> > ---------------------------------
> > inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> > kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >  (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>]
> > xfs_ilock+0x177/0x200 [xfs]
> > {RECLAIM_FS-ON-R} state was registered at:
> >   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
> >   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
> >   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
> >   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
> >   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
> >   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
> >   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
> >   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
> >   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
> >   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
> >   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
> >   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

> > irq event stamp: 510775
> > hardirqs last  enabled at (510775): [<ffffffff812245d0>]
> > __slab_alloc+0x50/0x70
> > hardirqs last disabled at (510774): [<ffffffff812245ae>]
> > __slab_alloc+0x2e/0x70
> > softirqs last  enabled at (510506): [<ffffffff810c8ea8>]
> > __do_softirq+0x358/0x430
> > softirqs last disabled at (510489): [<ffffffff810c911d>] irq_exit+0xad/0xb0
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(&xfs_nondir_ilock_class);
> >   <Interrupt>
> >     lock(&xfs_nondir_ilock_class);

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...

> >  *** DEADLOCK ***
> > 
> > 3 locks held by kswapd0/543:
> >  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811e0b78>]
> > shrink_slab.part.63.constprop.79+0x48/0x4e0
> >  #1:  (&type->s_umount_key#26){++++++}, at: [<ffffffff81232ffb>]
> > trylock_super+0x1b/0x50
> >  #2:  (sb_internal){.+.+.?}, at: [<ffffffff812327f4>]
> > __sb_start_write+0xb4/0xf0
> > 
> > stack backtrace:
> > CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> > Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
> > 12/01/2006
> >  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
> >  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
> >  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> > Call Trace:
> >  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
> >  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
> >  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
> >  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
> >  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
> >  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
> >  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
> >  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
> >  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
> >  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
> >  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> >  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
> >  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
> >  [<ffffffff8124d7d5>] evict+0xc5/0x190
> >  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
> >  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
> >  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
> >  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
> >  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
> >  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
> >  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
> >  [<ffffffff810e6882>] kthread+0xf2/0x110
> >  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> >  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
> >  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> > hrtimer: interrupt took 4824925 ns

The reclaim -> lock context that it's complaining about here is on
an inode being reclaimed - it has no active references and so, by
definition, cannot deadlock with a context holding an active
reference to an inode ilock. Hence there cannot possibly be a
deadlock here, but we can't tell lockdep that easily in any way
without going back to the bad old ways of creating a new lockdep
class for inode ilocks the moment they enter ->evict. This then
disables "entire lifecycle" lockdep checking on the xfs inode ilock,
which is why we got rid of it in the first place.

Our "traditional" solution to this is to add KM_NOFS to the
allocation to force GFP_NOFS allocation in all contexts so that we
shut up lockdep and people don't keep endlessly reporting the same
non-problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-12  8:03   ` Dave Chinner
@ 2016-05-13 16:03     ` Michal Hocko
  2016-05-16 10:41       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-05-13 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, Qu Wenruo, Peter Zijlstra, xfs, Darrick J. Wong

On Thu 12-05-16 18:03:21, Dave Chinner wrote:
> [ cc Michal Hocko, just so he can see a lockdep reclaim state false
> positive ]

Thank you for CCing me!

I am sorry I didn't follow up on the previous discussion but I got side
tracked by something else. I have tried to cook up something really
simply. I didn't get to test it at all and it might be completely broken
but I just wanted to throw an idea for the discussion. I am CCing Peter
as well - he might have a better idea (the reference to the full email
is in the changelog. Is something like the following correct/acceptable?

This is on top of my scope gfp_nofs patch I have posted recently but I
can reorder them if this looks ok.
---
>From 9e980c2342f355e2bf1e12839c51e8e304e8842e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 13 May 2016 17:47:31 +0200
Subject: [PATCH] lockdep: allow to disable reclaim lockup detection

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce lockdep_trace_alloc_{disable,enable} which would tell
__lockdep_trace_alloc to skip an allocation request even when it has
__GFP_FS enabled. This means that the false positive shouldn't be
generated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/lockdep.h  | 10 ++++++++++
 include/linux/sched.h    |  1 +
 kernel/locking/lockdep.c |  4 ++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 4dca42fd32f5..4b04bf9ab560 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -354,6 +354,14 @@ static inline void lock_set_subclass(struct lockdep_map *lock,
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
+static inline void lockdep_trace_alloc_disable(void)
+{
+	current->lockdep_reclaim_disabled = 1;
+}
+static inline void lockdep_trace_alloc_enable(void)
+{
+	current->lockdep_reclaim_disabled = 0;
+}
 
 extern void lock_pin_lock(struct lockdep_map *lock);
 extern void lock_unpin_lock(struct lockdep_map *lock);
@@ -392,6 +400,8 @@ static inline void lockdep_on(void)
 # define lockdep_set_current_reclaim_state(g)	do { } while (0)
 # define lockdep_clear_current_reclaim_state()	do { } while (0)
 # define lockdep_trace_alloc(g)			do { } while (0)
+# define lockdep_trace_alloc_disable()		do { } while (0)
+# define lockdep_trace_alloc_enable()		do { } while (0)
 # define lockdep_init()				do { } while (0)
 # define lockdep_info()				do { } while (0)
 # define lockdep_init_map(lock, name, key, sub) \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9521dc0475f..ec643916bad3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1644,6 +1644,7 @@ struct task_struct {
 	unsigned int lockdep_recursion;
 	struct held_lock held_locks[MAX_LOCK_DEPTH];
 	gfp_t lockdep_reclaim_gfp;
+	bool lockdep_reclaim_disabled;
 #endif
 #ifdef CONFIG_UBSAN
 	unsigned int in_ubsan;
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index f60124d0871c..cf6ead3e7014 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2753,6 +2753,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
+	/* The caller explicitly asked to disable reclaim recursion tracking */
+	if (curr->lockdep_reclaim_disabled)
+		return;
+
 	/*
 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
 	 */
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-13 16:03     ` Michal Hocko
@ 2016-05-16 10:41       ` Peter Zijlstra
  2016-05-16 13:05         ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-16 10:41 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong

On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote:

> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.

So the only thing that distinguishes the good from the bad case is that
reference; how should that then not do anything?

> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely.

How would you like to see it done? The interrupt model works well for
reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not
an 'interrupt' like thing?

> Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.

So without more context; no. The mail you referenced mentions:

"The reclaim -> lock context that it's complaining about here is on
an inode being reclaimed - it has no active references and so, by
definition, cannot deadlock with a context holding an active
reference to an inode ilock. Hence there cannot possibly be a
deadlock here, but we can't tell lockdep that easily in any way
without going back to the bad old ways of creating a new lockdep
class for inode ilocks the moment they enter ->evict. This then
disables "entire lifecycle" lockdep checking on the xfs inode ilock,
which is why we got rid of it in the first place."

But fails to explain the problems with the 'old' approach.

So clearly this is a 'problem' that has existed for quite a while, so I
don't see any need to rush half baked solutions either.

Please better explain things.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-16 10:41       ` Peter Zijlstra
@ 2016-05-16 13:05         ` Michal Hocko
  2016-05-16 13:25           ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-05-16 13:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong

On Mon 16-05-16 12:41:30, Peter Zijlstra wrote:
> On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote:
[...]
> > So, because the reclaim annotations overload the interrupt level
> > detections and it's seen the inode ilock been taken in reclaim
> > ("interrupt") context, this triggers a reclaim context warning where
> > it thinks it is unsafe to do this allocation in GFP_KERNEL context
> > holding the inode ilock...
> > "
> > 
> > This sounds like a fundamental problem of the reclaim lock detection.
> > It is really impossible to annotate such a special usecase IMHO unless
> > the reclaim lockup detection is reworked completely.
> 
> How would you like to see it done? The interrupt model works well for
> reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not
> an 'interrupt' like thing?

Unfortunately I do not have any good ideas. It would basically require
to allow marking the lockdep context transaction specific AFAIU somehow
and tell that there is no real dependency between !GFP_NOWAIT and
'interrupt' context.
IIRC Dave's emails they have tried that by using lockdep classes and
that turned out to be an overly complex maze which still doesn't work
100% reliably.

> > Until then it
> > is much better to provide a way to add "I know what I am doing flag"
> > and mark problematic places. This would prevent from abusing GFP_NOFS
> > flag which has a runtime effect even on configurations which have
> > lockdep disabled.
> 
> So without more context; no. The mail you referenced mentions:
> 
> "The reclaim -> lock context that it's complaining about here is on
> an inode being reclaimed - it has no active references and so, by
> definition, cannot deadlock with a context holding an active
> reference to an inode ilock. Hence there cannot possibly be a
> deadlock here, but we can't tell lockdep that easily in any way
> without going back to the bad old ways of creating a new lockdep
> class for inode ilocks the moment they enter ->evict. This then
> disables "entire lifecycle" lockdep checking on the xfs inode ilock,
> which is why we got rid of it in the first place."
> 
> But fails to explain the problems with the 'old' approach.
> 
> So clearly this is a 'problem' that has existed for quite a while, so I
> don't see any need to rush half baked solutions either.

Well, at least my motivation for _some_ solution here is that xfs has
worked around this deficiency by forcing GFP_NOFS also for contexts which
are perfectly OK to do __GFP_FS allocation. And that in turn leads to
other issues which I would really like to sort out. So the idea was to
give xfs another way to express that workaround that would be a noop
without lockdep configured.
 
> Please better explain things.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-16 13:05         ` Michal Hocko
@ 2016-05-16 13:25           ` Peter Zijlstra
  2016-05-16 23:10             ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-16 13:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong

On Mon, May 16, 2016 at 03:05:19PM +0200, Michal Hocko wrote:
> On Mon 16-05-16 12:41:30, Peter Zijlstra wrote:
> > On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote:

> > How would you like to see it done? The interrupt model works well for
> > reclaim because how is direct reclaim from a !GFP_NOWAIT allocation not
> > an 'interrupt' like thing?
> 
> Unfortunately I do not have any good ideas. It would basically require
> to allow marking the lockdep context transaction specific AFAIU somehow
> and tell that there is no real dependency between !GFP_NOWAIT and
> 'interrupt' context.

But here is; direct reclaim is very much an 'interrupt' in the normal
program flow.

But the problem here appears to be that at some points we 'know' things
cannot get reclaimed because stuff we didn't tell lockdep about (its got
references), and sure then it don't work right.

But that doesn't mean that the 'interrupt' model is wrong.

> IIRC Dave's emails they have tried that by using lockdep classes and
> that turned out to be an overly complex maze which still doesn't work
> 100% reliably.

So that would be the: 

> >  but we can't tell lockdep that easily in any way
> > without going back to the bad old ways of creating a new lockdep
> > class for inode ilocks the moment they enter ->evict. This then
> > disables "entire lifecycle" lockdep checking on the xfs inode ilock,
> > which is why we got rid of it in the first place."
> > 
> > But fails to explain the problems with the 'old' approach.
> > 
> > So clearly this is a 'problem' that has existed for quite a while, so I
> > don't see any need to rush half baked solutions either.
> 
> Well, at least my motivation for _some_ solution here is that xfs has
> worked around this deficiency by forcing GFP_NOFS also for contexts which
> are perfectly OK to do __GFP_FS allocation. And that in turn leads to
> other issues which I would really like to sort out. So the idea was to
> give xfs another way to express that workaround that would be a noop
> without lockdep configured.

Right, that's unfortunate. But I would really like to understand the
problem with the classes vs lifecycle thing.

Is there an email explaining that somewhere?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-16 13:25           ` Peter Zijlstra
@ 2016-05-16 23:10             ` Dave Chinner
  2016-05-17 14:49               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-05-16 23:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-mm, Qu Wenruo, xfs, Darrick J. Wong, Michal Hocko

On Mon, May 16, 2016 at 03:25:41PM +0200, Peter Zijlstra wrote:
> On Mon, May 16, 2016 at 03:05:19PM +0200, Michal Hocko wrote:
> > On Mon 16-05-16 12:41:30, Peter Zijlstra wrote:
> > > On Fri, May 13, 2016 at 06:03:41PM +0200, Michal Hocko wrote:
> > IIRC Dave's emails they have tried that by using lockdep classes and
> > that turned out to be an overly complex maze which still doesn't work
> > 100% reliably.
> 
> So that would be the: 
> 
> > >  but we can't tell lockdep that easily in any way
> > > without going back to the bad old ways of creating a new lockdep
> > > class for inode ilocks the moment they enter ->evict. This then
> > > disables "entire lifecycle" lockdep checking on the xfs inode ilock,
> > > which is why we got rid of it in the first place."
> > > 
> > > But fails to explain the problems with the 'old' approach.
> > > 
> > > So clearly this is a 'problem' that has existed for quite a while, so I
> > > don't see any need to rush half baked solutions either.
> > 
> > Well, at least my motivation for _some_ solution here is that xfs has
> > worked around this deficiency by forcing GFP_NOFS also for contexts which
> > are perfectly OK to do __GFP_FS allocation. And that in turn leads to
> > other issues which I would really like to sort out. So the idea was to
> > give xfs another way to express that workaround that would be a noop
> > without lockdep configured.
> 
> Right, that's unfortunate. But I would really like to understand the
> problem with the classes vs lifecycle thing.
> 
> Is there an email explaining that somewhere?

Years ago (i.e. last time I bothered mentioning that lockdep didn't
cover these cases) but buggered if I can find a reference.

We used to have iolock classes. Added in commit 033da48 ("xfs: reset
the i_iolock lock class in the reclaim path") in 2010, removed by
commit 4f59af7 ("xfs: remove iolock lock classes") a couple of years
later.

We needed distinct lock classes above and below reclaim because the
same locks were taken above and below memory allocation, and the
reclaimed inode recycling made lockdep think it had a loop in it's
graph:


inode lookup	memory reclaim		inode lookup
------------	--------------		------------
not found
allocate inode
init inode
take active reference
return to caller
.....
		  inode shrinker
		  find inode with no active reference
		   ->evict
		     start transaction
		       take inode locks
		     .....
		     commit transaction
		     mark for reclaim
		     .....

					find inactive inode in reclaimable state
					remove from reclaim list
					re-initialise inode state
					take active reference
					return to caller

So you can see that an inode can go through the reclaim context
without being freed but having taken locks, but then immediately
reused in a non-reclaim context. Hence we had to split the two
lockdep contexts and re-initialise the lock contexts in the
different allocation paths.

The reason we don't have lock clases for the ilock is that we aren't
supposed to call memory reclaim with that lock held in exclusive
mode. This is because reclaim can run transactions, and that may
need to flush dirty inodes to make progress. Flushing dirty inode
requires taking the ilock in shared mode.

In the code path that was reported, we hold the ilock in /shared/
mode with no transaction context (we are doing a read-only
operation). This means we can run transactions in memory reclaim
because a) we can't deadlock on the inode we hold locks on, and b)
transaction reservations will be able to make progress as we don't
hold any locks it can block on.

Sure, we can have multiple lock classes so that this can be done,
but that then breaks lock order checking between the two contexts.
Both contexts require locking order and transactional behaviour to
be identical and, from this perspective, class separation is
effectively no different from turning lockdep off.

For the ilock, the number of places where the ilock is held over
GFP_KERNEL allocations is pretty small. Hence we've simply added
GFP_NOFS to those allocations to - effectively - annotate those
allocations as "lockdep causes problems here". There are probably
30-35 allocations in XFS that explicitly use KM_NOFS - some of these
are masking lockdep false positive reports.

The reason why we had lock classes for the *iolock* (not the ilock)
was because ->evict processing used to require the iolock for
attribute removal. This caused all sorts of problems with
GFP_KERNEL allocations in the IO path (e.g. allocate a page cache
page under the IO lock, direct reclaim would trigger lockdep
warnings - there's another reason why XFS always used GFP_NOFS
contexts for page cache allocations) which were all false positives
because the inode holding the lock has active references and hence
will never be accessed by the reclaim path. The only workable
solution iwe could come up with at the time this was reported was
to this was to split the lock classes.

In the end, like pretty much all the complex lockdep false positives
we've had to deal in XFS, we've ended up changing the locking or
allocation contexts because that's been far easier than trying to
make annotations cover everything or convince other people that
lockdep annotations are insufficient.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-16 23:10             ` Dave Chinner
@ 2016-05-17 14:49               ` Peter Zijlstra
  2016-05-17 22:35                 ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-17 14:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar


Thanks for writing all that down Dave!

On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:

> The reason we don't have lock clases for the ilock is that we aren't
> supposed to call memory reclaim with that lock held in exclusive
> mode. This is because reclaim can run transactions, and that may
> need to flush dirty inodes to make progress. Flushing dirty inode
> requires taking the ilock in shared mode.
> 
> In the code path that was reported, we hold the ilock in /shared/
> mode with no transaction context (we are doing a read-only
> operation). This means we can run transactions in memory reclaim
> because a) we can't deadlock on the inode we hold locks on, and b)
> transaction reservations will be able to make progress as we don't
> hold any locks it can block on.

Just to clarify; I read the above as that we cannot block on recursive
shared locks, is this correct?

Because we can in fact block on down_read()+down_read() just fine, so if
you're assuming that, then something's busted.

Otherwise, I'm not quite reading it right, which is, given the
complexity of that stuff, entirely possible.

The other possible reading is that we cannot deadlock on the inode we
hold locks on because we hold a reference on it; and the reference
avoids the inode from being reclaimed. But then the whole
shared/exclusive thing doesn't seem to make sense.

> For the ilock, the number of places where the ilock is held over
> GFP_KERNEL allocations is pretty small. Hence we've simply added
> GFP_NOFS to those allocations to - effectively - annotate those
> allocations as "lockdep causes problems here". There are probably
> 30-35 allocations in XFS that explicitly use KM_NOFS - some of these
> are masking lockdep false positive reports.


> In the end, like pretty much all the complex lockdep false positives
> we've had to deal in XFS, we've ended up changing the locking or
> allocation contexts because that's been far easier than trying to
> make annotations cover everything or convince other people that
> lockdep annotations are insufficient.

Well, I don't mind creating lockdep annotations; but explanations of the
exact details always go a long way towards helping me come up with
something.

While going over the code; I see there's complaining about
MAX_SUBCLASSES being too small. Would it help if we doubled it? We
cannot grow the thing without limits, but doubling it should be possible
I think.


In any case; would something like this work for you? Its entirely
untested, but the idea is to mark an entire class to skip reclaim
validation, instead of marking individual sites.

We have to do the subclass loop because; as per the comment with
XFS_ILOCK_* we use all 8 subclasses.

---
 fs/xfs/xfs_super.c       | 13 +++++++++++++
 include/linux/lockdep.h  |  8 +++++++-
 kernel/locking/lockdep.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 187e14b696c2..ea55f87edad8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -985,6 +985,19 @@ xfs_fs_inode_init_once(
 		     "xfsino", ip->i_ino);
 	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
 		     "xfsino", ip->i_ino);
+
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * Disable reclaim tests for the i_lock; reclaim is guarded
+	 * by a reference count... XXX write coherent comment.
+	 */
+	do {
+		int i;
+
+		for (i = 0; i < MAX_LOCKDEP_SUBCLASSES; i++)
+			lockdep_skip_reclaim(&ip->i_lock.mr_lock, i);
+	} while (0);
+#endif
 }
 
 STATIC void
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index eabe0138eb06..fbaa6c8bcff6 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -80,7 +80,8 @@ struct lock_class {
 	/*
 	 * IRQ/softirq usage tracking bits:
 	 */
-	unsigned long			usage_mask;
+	unsigned int			usage_mask;
+	unsigned int			skip_mask;
 	struct stack_trace		usage_traces[XXX_LOCK_USAGE_STATES];
 
 	/*
@@ -281,6 +282,8 @@ extern void lockdep_on(void);
 extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
 			     struct lock_class_key *key, int subclass);
 
+extern void lock_skip_reclaim(struct lockdep_map *lock, int subclass);
+
 /*
  * To initialize a lockdep_map statically use this macro.
  * Note that _name must not be NULL.
@@ -304,6 +307,9 @@ extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
 		lockdep_init_map(&(lock)->dep_map, #lock, \
 				 (lock)->dep_map.key, sub)
 
+#define lockdep_skip_reclaim(lock, sub) \
+		lock_skip_reclaim(&(lock)->dep_map, sub)
+
 #define lockdep_set_novalidate_class(lock) \
 	lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock)
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 81f1a7107c0e..f3b3b3e7938a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3022,13 +3022,17 @@ void lockdep_trace_alloc(gfp_t gfp_mask)
 static int mark_lock(struct task_struct *curr, struct held_lock *this,
 			     enum lock_usage_bit new_bit)
 {
+	struct lock_class *class = hlock_class(this);
 	unsigned int new_mask = 1 << new_bit, ret = 1;
 
 	/*
 	 * If already set then do not dirty the cacheline,
 	 * nor do any checks:
 	 */
-	if (likely(hlock_class(this)->usage_mask & new_mask))
+	if (likely(class->usage_mask & new_mask))
+		return 1;
+
+	if (class->skip_mask & (new_mask >> 2))
 		return 1;
 
 	if (!graph_lock())
@@ -3036,14 +3040,14 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	/*
 	 * Make sure we didn't race:
 	 */
-	if (unlikely(hlock_class(this)->usage_mask & new_mask)) {
+	if (unlikely(class->usage_mask & new_mask)) {
 		graph_unlock();
 		return 1;
 	}
 
-	hlock_class(this)->usage_mask |= new_mask;
+	class->usage_mask |= new_mask;
 
-	if (!save_trace(hlock_class(this)->usage_traces + new_bit))
+	if (!save_trace(class->usage_traces + new_bit))
 		return 0;
 
 	switch (new_bit) {
@@ -3586,6 +3590,24 @@ static int __lock_is_held(struct lockdep_map *lock)
 	return 0;
 }
 
+static void __lock_skip_reclaim(struct lockdep_map *lock, int subclass)
+{
+	struct lock_class *class = register_lock_class(lock, subclass, 0);
+
+	if (!class)
+		return;
+
+	if (class->skip_mask & (1 << RECLAIM_FS))
+		return;
+
+	if (!graph_lock())
+		return;
+
+	class->skip_mask |= 1 << RECLAIM_FS;
+
+	graph_unlock();
+}
+
 static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock)
 {
 	struct pin_cookie cookie = NIL_COOKIE;
@@ -3784,6 +3806,23 @@ int lock_is_held(struct lockdep_map *lock)
 }
 EXPORT_SYMBOL_GPL(lock_is_held);
 
+void lock_skip_reclaim(struct lockdep_map *lock, int subclass)
+{
+	unsigned long flags;
+
+	if (unlikely(current->lockdep_recursion))
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+
+	current->lockdep_recursion = 1;
+	__lock_skip_reclaim(lock, subclass);
+	current->lockdep_recursion = 0;
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_skip_reclaim);
+
 struct pin_cookie lock_pin_lock(struct lockdep_map *lock)
 {
 	struct pin_cookie cookie = NIL_COOKIE;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-17 14:49               ` Peter Zijlstra
@ 2016-05-17 22:35                 ` Dave Chinner
  2016-05-18  7:20                   ` Peter Zijlstra
  2016-05-19  8:11                   ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2016-05-17 22:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar

On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> 
> Thanks for writing all that down Dave!
> 
> On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> 
> > The reason we don't have lock clases for the ilock is that we aren't
> > supposed to call memory reclaim with that lock held in exclusive
> > mode. This is because reclaim can run transactions, and that may
> > need to flush dirty inodes to make progress. Flushing dirty inode
> > requires taking the ilock in shared mode.
> > 
> > In the code path that was reported, we hold the ilock in /shared/
> > mode with no transaction context (we are doing a read-only
> > operation). This means we can run transactions in memory reclaim
> > because a) we can't deadlock on the inode we hold locks on, and b)
> > transaction reservations will be able to make progress as we don't
> > hold any locks it can block on.
> 
> Just to clarify; I read the above as that we cannot block on recursive
> shared locks, is this correct?
> 
> Because we can in fact block on down_read()+down_read() just fine, so if
> you're assuming that, then something's busted.

The transaction reservation path will run down_read_trylock() on the
inode, not down_read(). Hence if there are no pending writers, it
will happily take the lock twice and make progress, otherwise it
will skip the inode and there's no deadlock.  If there's a pending
writer, then we have another context that is already in a
transaction context and has already pushed the item, hence it is
only in the scope of the current push because IO hasn't completed
yet and removed it from the list.

> Otherwise, I'm not quite reading it right, which is, given the
> complexity of that stuff, entirely possible.

There's a maze of dark, grue-filled twisty passages here...

> The other possible reading is that we cannot deadlock on the inode we
> hold locks on because we hold a reference on it; and the reference
> avoids the inode from being reclaimed. But then the whole
> shared/exclusive thing doesn't seem to make sense.

Right, because that's not the problem. The issue has to do with
transaction contexts and what locks are safe to hold when calling
xfs_trans_reserve(). Direct reclaim is putting xfs_trans_reserve()
behind memory allocation, which means it is unsafe for XFS to hold
the ilock exclusive or be in an existing transaction context when
doing GFP_KERNEL allocation.

> > For the ilock, the number of places where the ilock is held over
> > GFP_KERNEL allocations is pretty small. Hence we've simply added
> > GFP_NOFS to those allocations to - effectively - annotate those
> > allocations as "lockdep causes problems here". There are probably
> > 30-35 allocations in XFS that explicitly use KM_NOFS - some of these
> > are masking lockdep false positive reports.
> 
> 
> > In the end, like pretty much all the complex lockdep false positives
> > we've had to deal in XFS, we've ended up changing the locking or
> > allocation contexts because that's been far easier than trying to
> > make annotations cover everything or convince other people that
> > lockdep annotations are insufficient.
> 
> Well, I don't mind creating lockdep annotations; but explanations of the
> exact details always go a long way towards helping me come up with
> something.
> 
> While going over the code; I see there's complaining about
> MAX_SUBCLASSES being too small. Would it help if we doubled it? We
> cannot grow the thing without limits, but doubling it should be possible
> I think.

Last time I asked cwif we could increase MAX_SUBCLASSES I was told
no. So we've just had to try to fit about 30 different
inode lock contexts into 8 subclasses split across multiple class
types (i.e. xfs_[non]dir_ilock_class). I wasted an entire week on
getting those annotations to fit the limitations of lockdep and
still work.

> In any case; would something like this work for you? Its entirely
> untested, but the idea is to mark an entire class to skip reclaim
> validation, instead of marking individual sites.

Probably would, but it seems like swatting a fly with runaway
train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag)
so that we don't turn off something that will tell us we've made a
mistake while developing new code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-17 22:35                 ` Dave Chinner
@ 2016-05-18  7:20                   ` Peter Zijlstra
  2016-05-18  8:25                     ` Michal Hocko
  2016-05-19  8:11                   ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-18  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar

On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > 
> > Thanks for writing all that down Dave!
> > 
> > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> > 
> > > The reason we don't have lock clases for the ilock is that we aren't
> > > supposed to call memory reclaim with that lock held in exclusive
> > > mode. This is because reclaim can run transactions, and that may
> > > need to flush dirty inodes to make progress. Flushing dirty inode
> > > requires taking the ilock in shared mode.
> > > 
> > > In the code path that was reported, we hold the ilock in /shared/
> > > mode with no transaction context (we are doing a read-only
> > > operation). This means we can run transactions in memory reclaim
> > > because a) we can't deadlock on the inode we hold locks on, and b)
> > > transaction reservations will be able to make progress as we don't
> > > hold any locks it can block on.
> > 
> > Just to clarify; I read the above as that we cannot block on recursive
> > shared locks, is this correct?
> > 
> > Because we can in fact block on down_read()+down_read() just fine, so if
> > you're assuming that, then something's busted.
> 
> The transaction reservation path will run down_read_trylock() on the
> inode, not down_read(). Hence if there are no pending writers, it
> will happily take the lock twice and make progress, otherwise it
> will skip the inode and there's no deadlock.  If there's a pending
> writer, then we have another context that is already in a
> transaction context and has already pushed the item, hence it is
> only in the scope of the current push because IO hasn't completed
> yet and removed it from the list.
> 
> > Otherwise, I'm not quite reading it right, which is, given the
> > complexity of that stuff, entirely possible.
> 
> There's a maze of dark, grue-filled twisty passages here...

I'm sure. Thanks for the extra detail; yes the down_read_trylock() thing
will work.

> > In any case; would something like this work for you? Its entirely
> > untested, but the idea is to mark an entire class to skip reclaim
> > validation, instead of marking individual sites.
> 
> Probably would, but it seems like swatting a fly with runaway
> train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag)
> so that we don't turn off something that will tell us we've made a
> mistake while developing new code...

Fair enough; if the mm folks don't object to 'wasting' a GFP flag on
this the below ought to do I think.

---
 include/linux/gfp.h      | 9 ++++++++-
 kernel/locking/lockdep.c | 3 +++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..7b5b9db4c821 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,7 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#define ___GFP_NOVALIDATE	0x4000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -146,6 +147,11 @@ struct vm_area_struct;
  *   return NULL when direct reclaim and memory compaction have failed to allow
  *   the allocation to succeed.  The OOM killer is not called with the current
  *   implementation.
+ *
+ * __GFP_NOVALIDATE: lockdep annotation; do not use this allocation to set
+ *   LOCK_USED_IN_RECLAIM_FS and thereby avoid possible false positives.
+ *   Use of this flag should have a comment explaining the false positive
+ *   in detail.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
@@ -155,6 +161,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
+#define __GFP_NOVALIDATE ((__force gfp_t)___GFP_NOVALIDATE)
 
 /*
  * Action modifiers
@@ -188,7 +195,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT 27
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 81f1a7107c0e..88bf43c3bf76 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2853,6 +2853,9 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (unlikely(!debug_locks))
 		return;
 
+	if (gfp_mask & __GFP_NOVALIDATE)
+		return;
+
 	/* no reclaim without waiting on it */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-18  7:20                   ` Peter Zijlstra
@ 2016-05-18  8:25                     ` Michal Hocko
  2016-05-18  9:49                       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-05-18  8:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Wed 18-05-16 09:20:05, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
[...]
> > > In any case; would something like this work for you? Its entirely
> > > untested, but the idea is to mark an entire class to skip reclaim
> > > validation, instead of marking individual sites.
> > 
> > Probably would, but it seems like swatting a fly with runaway
> > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag)
> > so that we don't turn off something that will tell us we've made a
> > mistake while developing new code...
> 
> Fair enough; if the mm folks don't object to 'wasting' a GFP flag on
> this the below ought to do I think.

GFP flag space is quite scarse. Especially when it would be used only
for lockdep configurations which are mostly disabled. Why cannot we go
with an explicit disable/enable API I have proposed? It would be lockdep
contained and quite easy to grep for and git blame would tell us
(hopefuly) why the lockdep had to be put out of the way for the
particular path.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-18  8:25                     ` Michal Hocko
@ 2016-05-18  9:49                       ` Peter Zijlstra
  2016-05-18 11:31                         ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-18  9:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Wed, May 18, 2016 at 10:25:39AM +0200, Michal Hocko wrote:
> On Wed 18-05-16 09:20:05, Peter Zijlstra wrote:
> > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> [...]
> > > > In any case; would something like this work for you? Its entirely
> > > > untested, but the idea is to mark an entire class to skip reclaim
> > > > validation, instead of marking individual sites.
> > > 
> > > Probably would, but it seems like swatting a fly with runaway
> > > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag)
> > > so that we don't turn off something that will tell us we've made a
> > > mistake while developing new code...
> > 
> > Fair enough; if the mm folks don't object to 'wasting' a GFP flag on
> > this the below ought to do I think.
> 
> GFP flag space is quite scarse. 

There's still 5 or so bits available, and you could always make gfp_t
u64.

> Especially when it would be used only
> for lockdep configurations which are mostly disabled. Why cannot we go
> with an explicit disable/enable API I have proposed? 

It has unbounded scope. And in that respect the GFP flag thingy is wider
than I'd like too, it avoids setting the state for all held locks, even
though we'd only like to avoid setting it for one class.

So ideally we'd combine the GFP flag with the previously proposed skip
flag to only avoid marking the one class while keeping everything
working for all other held locks.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-18  9:49                       ` Peter Zijlstra
@ 2016-05-18 11:31                         ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-05-18 11:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Wed 18-05-16 11:49:52, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 10:25:39AM +0200, Michal Hocko wrote:
> > On Wed 18-05-16 09:20:05, Peter Zijlstra wrote:
> > > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > > > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > [...]
> > > > > In any case; would something like this work for you? Its entirely
> > > > > untested, but the idea is to mark an entire class to skip reclaim
> > > > > validation, instead of marking individual sites.
> > > > 
> > > > Probably would, but it seems like swatting a fly with runaway
> > > > train. I'd much prefer a per-site annotation (e.g. as a GFP_ flag)
> > > > so that we don't turn off something that will tell us we've made a
> > > > mistake while developing new code...
> > > 
> > > Fair enough; if the mm folks don't object to 'wasting' a GFP flag on
> > > this the below ought to do I think.
> > 
> > GFP flag space is quite scarse. 
> 
> There's still 5 or so bits available, and you could always make gfp_t
> u64.

It seems we have some places where we encode further data into the same
word as gfp_mask (radix tree tags and mapping_flags). From a quick
glance they should be OK even with __GFP_BITS_SHIFT increased to 27 but
this tells us that we shouldn't consume them without a good reason.
 
> > Especially when it would be used only
> > for lockdep configurations which are mostly disabled. Why cannot we go
> > with an explicit disable/enable API I have proposed? 
> 
> It has unbounded scope. And in that respect the GFP flag thingy is wider
> than I'd like too, it avoids setting the state for all held locks, even
> though we'd only like to avoid setting it for one class.
>
> So ideally we'd combine the GFP flag with the previously proposed skip
> flag to only avoid marking the one class while keeping everything
> working for all other held locks.

This is definitely your call but I would prefer starting with something
simple and extend it when we find out that the scope/gfp opt-out hides
real bugs or it is insufficient for other reasons. I do not this opt out
to be used much, quite contrary. We do not hear about false positives
reclaim lockdep lockups very often - except for very complex reclaim
implementations which are quite uncommon.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-17 22:35                 ` Dave Chinner
  2016-05-18  7:20                   ` Peter Zijlstra
@ 2016-05-19  8:11                   ` Peter Zijlstra
  2016-05-20  0:17                     ` Dave Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-05-19  8:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar

On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> > 
> > > The reason we don't have lock clases for the ilock is that we aren't
> > > supposed to call memory reclaim with that lock held in exclusive
> > > mode. This is because reclaim can run transactions, and that may
> > > need to flush dirty inodes to make progress. Flushing dirty inode
> > > requires taking the ilock in shared mode.
> > > 
> > > In the code path that was reported, we hold the ilock in /shared/
> > > mode with no transaction context (we are doing a read-only
> > > operation). This means we can run transactions in memory reclaim
> > > because a) we can't deadlock on the inode we hold locks on, and b)
> > > transaction reservations will be able to make progress as we don't
> > > hold any locks it can block on.
> > 
> > Just to clarify; I read the above as that we cannot block on recursive
> > shared locks, is this correct?
> > 
> > Because we can in fact block on down_read()+down_read() just fine, so if
> > you're assuming that, then something's busted.
> 
> The transaction reservation path will run down_read_trylock() on the
> inode, not down_read(). Hence if there are no pending writers, it
> will happily take the lock twice and make progress, otherwise it
> will skip the inode and there's no deadlock.  If there's a pending
> writer, then we have another context that is already in a
> transaction context and has already pushed the item, hence it is
> only in the scope of the current push because IO hasn't completed
> yet and removed it from the list.
> 
> > Otherwise, I'm not quite reading it right, which is, given the
> > complexity of that stuff, entirely possible.
> 
> There's a maze of dark, grue-filled twisty passages here...

OK; I might need a bit more again.

So now the code does something like:

	down_read(&i_lock);		-- lockdep marks lock as held
	kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
	  --> reclaim()
	     down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS

Right?

My 'problem' is that lockdep doesn't consider a trylock for the USED_IN
annotation, so the i_lock class will only get the ENABLED tag but not
get the USED_IN tag, and therefore _should_ not trigger the inversion.


So what exactly is triggering the inversion?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-19  8:11                   ` Peter Zijlstra
@ 2016-05-20  0:17                     ` Dave Chinner
  2016-06-01 13:17                       ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-05-20  0:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Michal Hocko, Ingo Molnar

On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
> > On Tue, May 17, 2016 at 04:49:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 17, 2016 at 09:10:56AM +1000, Dave Chinner wrote:
> > > 
> > > > The reason we don't have lock clases for the ilock is that we aren't
> > > > supposed to call memory reclaim with that lock held in exclusive
> > > > mode. This is because reclaim can run transactions, and that may
> > > > need to flush dirty inodes to make progress. Flushing dirty inode
> > > > requires taking the ilock in shared mode.
> > > > 
> > > > In the code path that was reported, we hold the ilock in /shared/
> > > > mode with no transaction context (we are doing a read-only
> > > > operation). This means we can run transactions in memory reclaim
> > > > because a) we can't deadlock on the inode we hold locks on, and b)
> > > > transaction reservations will be able to make progress as we don't
> > > > hold any locks it can block on.
> > > 
> > > Just to clarify; I read the above as that we cannot block on recursive
> > > shared locks, is this correct?
> > > 
> > > Because we can in fact block on down_read()+down_read() just fine, so if
> > > you're assuming that, then something's busted.
> > 
> > The transaction reservation path will run down_read_trylock() on the
> > inode, not down_read(). Hence if there are no pending writers, it
> > will happily take the lock twice and make progress, otherwise it
> > will skip the inode and there's no deadlock.  If there's a pending
> > writer, then we have another context that is already in a
> > transaction context and has already pushed the item, hence it is
> > only in the scope of the current push because IO hasn't completed
> > yet and removed it from the list.
> > 
> > > Otherwise, I'm not quite reading it right, which is, given the
> > > complexity of that stuff, entirely possible.
> > 
> > There's a maze of dark, grue-filled twisty passages here...
> 
> OK; I might need a bit more again.
> 
> So now the code does something like:
> 
> 	down_read(&i_lock);		-- lockdep marks lock as held
> 	kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
> 	  --> reclaim()
> 	     down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS
> 
> Right?

In the path that can deadlock the log, yes. It's actually way more
complex than the above, because the down_read_trylock(&i_lock) that
matters is run in a completely separate, async kthread that
xfs_trans_reserve() will block waiting for.

process context				xfsaild kthread(*)
---------------				------------------
down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
     xfs_trans_reserve()
     ....
	  xfs_trans_push_ail()	---- called if no space in the log to kick the xfsaild into action
	  ....
       xlog_grant_head_wait()	---- blocks waiting for log space
       .....

					xfsaild_push()   ----- iterates AIL
					  grabs log item
					    lock log item
	>>>>>>>>>>>>>>>>>>>>>		      down_read_trylock(&i_lock);
					      format item into buffer
					      add to dirty buffer list
					  ....
					  submit dirty buffer list for IO
					    buffer IO started
					.....
					<async IO completion context>
					buffer callbacks
					  mark inode clean
					  remove inode from AIL
					  move tail of log forwards
					    wake grant head waiters
	<woken by log tail moving>
	<log space available>
	transaction reservation granted
     .....
     down_write(some other inode ilock)
     <modify some other inode>
     xfs_trans_commit
     .....

(*) xfsaild runs with PF_MEMALLOC context.

The problem is that if the ilock is held exclusively at GFP_KERNEL
time, the xfsaild cannot lock the inode to flush it, so if that
inode pins the tail of the log then we can't make space available
for xfs_trans_reserve and there is the deadlock.

Once xfs_trans_reserve completes, however, we'll take the ilock on
*some other inode*, and that's where the "it can't be the inode we
currently hold locked because we have references to it" and
henceit's safe to have a pattern like:

down_read(&i_lock);		-- lockdep marks lock as held
kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
  --> reclaim()
    down_write(&ilock)

because the lock within reclaim context is completely unrelated to
the lock we already hold.

Lockdep can't possibly know about this because the deadlock involves
locking contexts that *aren't doing anything wrong within their own
contexts*. It's only when you add the dependency of log space
reservation requirements needed to make forwards progress that
there's then an issue with locking and reclaim.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-05-20  0:17                     ` Dave Chinner
@ 2016-06-01 13:17                       ` Michal Hocko
  2016-06-01 18:16                         ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-01 13:17 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Chinner
  Cc: linux-mm, Qu Wenruo, xfs, Ingo Molnar, Darrick J. Wong

Thanks Dave for your detailed explanation again! Peter do you have any
other idea how to deal with these situations other than opt out from
lockdep reclaim machinery?

If not I would rather go with an annotation than a gfp flag to be honest
but if you absolutely hate that approach then I will try to check wheter
a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
steal the description from Dave's email and repost my patch.

I plan to repost my scope gfp patches in few days and it would be good
to have some mechanism to drop those GFP_NOFS to paper over lockdep
false positives for that.

[keeping Dave's explanation for reference]

On Fri 20-05-16 10:17:14, Dave Chinner wrote:
> On Thu, May 19, 2016 at 10:11:46AM +0200, Peter Zijlstra wrote:
> > On Wed, May 18, 2016 at 08:35:49AM +1000, Dave Chinner wrote:
[...]
> > > There's a maze of dark, grue-filled twisty passages here...
> > 
> > OK; I might need a bit more again.
> > 
> > So now the code does something like:
> > 
> > 	down_read(&i_lock);		-- lockdep marks lock as held
> > 	kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
> > 	  --> reclaim()
> > 	     down_read_trylock(&i_lock); -- lockdep does _NOT_ mark as USED_IN_RECLAIM_FS
> > 
> > Right?
> 
> In the path that can deadlock the log, yes. It's actually way more
> complex than the above, because the down_read_trylock(&i_lock) that
> matters is run in a completely separate, async kthread that
> xfs_trans_reserve() will block waiting for.
> 
> process context				xfsaild kthread(*)
> ---------------				------------------
> down_read(&i_lock);		-- lockdep marks lock as held
> kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
>   --> reclaim()
>      xfs_trans_reserve()
>      ....
> 	  xfs_trans_push_ail()	---- called if no space in the log to kick the xfsaild into action
> 	  ....
>        xlog_grant_head_wait()	---- blocks waiting for log space
>        .....
> 
> 					xfsaild_push()   ----- iterates AIL
> 					  grabs log item
> 					    lock log item
> 	>>>>>>>>>>>>>>>>>>>>>		      down_read_trylock(&i_lock);
> 					      format item into buffer
> 					      add to dirty buffer list
> 					  ....
> 					  submit dirty buffer list for IO
> 					    buffer IO started
> 					.....
> 					<async IO completion context>
> 					buffer callbacks
> 					  mark inode clean
> 					  remove inode from AIL
> 					  move tail of log forwards
> 					    wake grant head waiters
> 	<woken by log tail moving>
> 	<log space available>
> 	transaction reservation granted
>      .....
>      down_write(some other inode ilock)
>      <modify some other inode>
>      xfs_trans_commit
>      .....
> 
> (*) xfsaild runs with PF_MEMALLOC context.
> 
> The problem is that if the ilock is held exclusively at GFP_KERNEL
> time, the xfsaild cannot lock the inode to flush it, so if that
> inode pins the tail of the log then we can't make space available
> for xfs_trans_reserve and there is the deadlock.
> 
> Once xfs_trans_reserve completes, however, we'll take the ilock on
> *some other inode*, and that's where the "it can't be the inode we
> currently hold locked because we have references to it" and
> henceit's safe to have a pattern like:
> 
> down_read(&i_lock);		-- lockdep marks lock as held
> kmalloc(GFP_KERNEL);		-- lockdep marks held locks as ENABLED_RECLAIM_FS
>   --> reclaim()
>     down_write(&ilock)
> 
> because the lock within reclaim context is completely unrelated to
> the lock we already hold.
> 
> Lockdep can't possibly know about this because the deadlock involves
> locking contexts that *aren't doing anything wrong within their own
> contexts*. It's only when you add the dependency of log space
> reservation requirements needed to make forwards progress that
> there's then an issue with locking and reclaim.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-01 13:17                       ` Michal Hocko
@ 2016-06-01 18:16                         ` Peter Zijlstra
  2016-06-02 14:50                           ` Michal Hocko
  2016-10-06 13:04                           ` Michal Hocko
  0 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2016-06-01 18:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> Thanks Dave for your detailed explanation again! Peter do you have any
> other idea how to deal with these situations other than opt out from
> lockdep reclaim machinery?
> 
> If not I would rather go with an annotation than a gfp flag to be honest
> but if you absolutely hate that approach then I will try to check wheter
> a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> steal the description from Dave's email and repost my patch.
> 
> I plan to repost my scope gfp patches in few days and it would be good
> to have some mechanism to drop those GFP_NOFS to paper over lockdep
> false positives for that.

Right; sorry I got side-tracked in other things again.

So my favourite is the dedicated GFP flag, but if that's unpalatable for
the mm folks then something like the below might work. It should be
similar in effect to your proposal, except its more limited in scope.

---
 include/linux/gfp.h      |  5 ++++-
 include/linux/lockdep.h  |  2 ++
 kernel/locking/lockdep.c | 36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..d6be35643ee7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -187,7 +187,10 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
-/* Room for N __GFP_FOO bits */
+/*
+ * Room for N __GFP_FOO bits.
+ * Fix lockdep's __GFP_SKIP_ALLOC if this ever hits 32.
+ */
 #define __GFP_BITS_SHIFT 26
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index eabe0138eb06..08a021b1e275 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -354,6 +354,7 @@ static inline void lock_set_subclass(struct lockdep_map *lock,
 
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
+extern void lockdep_skip_alloc(void);
 extern void lockdep_trace_alloc(gfp_t mask);
 
 struct pin_cookie { unsigned int val; };
@@ -398,6 +399,7 @@ static inline void lockdep_on(void)
 # define lock_set_subclass(l, s, i)		do { } while (0)
 # define lockdep_set_current_reclaim_state(g)	do { } while (0)
 # define lockdep_clear_current_reclaim_state()	do { } while (0)
+# define lockdep_skip_alloc()			do { } while (0)
 # define lockdep_trace_alloc(g)			do { } while (0)
 # define lockdep_info()				do { } while (0)
 # define lockdep_init_map(lock, name, key, sub) \
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 589d763a49b3..aa3ccbadc74e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2851,6 +2851,13 @@ void trace_softirqs_off(unsigned long ip)
 		debug_atomic_inc(redundant_softirqs_off);
 }
 
+#define __GFP_SKIP_ALLOC (1UL << __GFP_BITS_SHIFT)
+
+static void __lockdep_skip_alloc(void)
+{
+	current->lockdep_reclaim_gfp |= __GFP_SKIP_ALLOC;
+}
+
 static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 {
 	struct task_struct *curr = current;
@@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/*
+	 * Skip _one_ allocation as per the lockdep_skip_alloc() request.
+	 * Must be done last so that we don't loose the annotation for
+	 * GFP_ATOMIC like things from IRQ or other nesting contexts.
+	 */
+	if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) {
+		current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC;
+		return;
+	}
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
 static void check_flags(unsigned long flags);
 
+void lockdep_skip_alloc(void)
+{
+	unsigned long flags;
+
+	if (unlikely(current->lockdep_recursion))
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+	current->lockdep_recursion = 1;
+	__lockdep_skip_alloc();
+	current->lockdep_recursion = 0;
+	raw_local_irq_restore(flags);
+}
+
 void lockdep_trace_alloc(gfp_t gfp_mask)
 {
 	unsigned long flags;
@@ -3015,6 +3047,10 @@ static inline int separate_irq_context(struct task_struct *curr,
 	return 0;
 }
 
+void lockdep_skip_alloc(void)
+{
+}
+
 void lockdep_trace_alloc(gfp_t gfp_mask)
 {
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-01 18:16                         ` Peter Zijlstra
@ 2016-06-02 14:50                           ` Michal Hocko
  2016-06-02 15:11                             ` Peter Zijlstra
  2016-10-06 13:04                           ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-02 14:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> > Thanks Dave for your detailed explanation again! Peter do you have any
> > other idea how to deal with these situations other than opt out from
> > lockdep reclaim machinery?
> > 
> > If not I would rather go with an annotation than a gfp flag to be honest
> > but if you absolutely hate that approach then I will try to check wheter
> > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> > steal the description from Dave's email and repost my patch.
> > 
> > I plan to repost my scope gfp patches in few days and it would be good
> > to have some mechanism to drop those GFP_NOFS to paper over lockdep
> > false positives for that.
> 
> Right; sorry I got side-tracked in other things again.
> 
> So my favourite is the dedicated GFP flag, but if that's unpalatable for
> the mm folks then something like the below might work. It should be
> similar in effect to your proposal, except its more limited in scope.
[...]
> @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
>  	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
>  		return;
>  
> +	/*
> +	 * Skip _one_ allocation as per the lockdep_skip_alloc() request.
> +	 * Must be done last so that we don't loose the annotation for
> +	 * GFP_ATOMIC like things from IRQ or other nesting contexts.
> +	 */
> +	if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) {
> +		current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC;
> +		return;
> +	}
> +
>  	mark_held_locks(curr, RECLAIM_FS);
>  }

I might be missing something but does this work actually? Say you would
want a kmalloc(size), it would call
slab_alloc_node
  slab_pre_alloc_hook
    lockdep_trace_alloc
[...]
  ____cache_alloc_node
    cache_grow_begin
      kmem_getpages
        __alloc_pages_node
	  __alloc_pages_nodemask
	    lockdep_trace_alloc

I understand your concerns about the scope but usually all allocations
have to be __GFP_NOFS or none in the same scope so I would see it as a
huge deal.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-02 14:50                           ` Michal Hocko
@ 2016-06-02 15:11                             ` Peter Zijlstra
  2016-06-02 15:46                               ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-06-02 15:11 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Thu, Jun 02, 2016 at 04:50:49PM +0200, Michal Hocko wrote:
> On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:

> > So my favourite is the dedicated GFP flag, but if that's unpalatable for
> > the mm folks then something like the below might work. It should be
> > similar in effect to your proposal, except its more limited in scope.
> [...]
> > @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> >  	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> >  		return;
> >  
> > +	/*
> > +	 * Skip _one_ allocation as per the lockdep_skip_alloc() request.
> > +	 * Must be done last so that we don't loose the annotation for
> > +	 * GFP_ATOMIC like things from IRQ or other nesting contexts.
> > +	 */
> > +	if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) {
> > +		current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC;
> > +		return;
> > +	}
> > +
> >  	mark_held_locks(curr, RECLAIM_FS);
> >  }
> 
> I might be missing something but does this work actually? Say you would
> want a kmalloc(size), it would call
> slab_alloc_node
>   slab_pre_alloc_hook
>     lockdep_trace_alloc
> [...]
>   ____cache_alloc_node
>     cache_grow_begin
>       kmem_getpages
>         __alloc_pages_node
> 	  __alloc_pages_nodemask
> 	    lockdep_trace_alloc

Bugger :/ You're right, that would fail.

So how about doing:

#define __GFP_NOLOCKDEP	(1u << __GFP_BITS_SHIFT)

this means it cannot be part of address_space::flags or
radix_tree_root::gfp_mask, but that might not be a bad thing.

And this solves the scarcity thing, because per pagemap we need to have
5 'spare' bits anyway.

> I understand your concerns about the scope but usually all allocations
> have to be __GFP_NOFS or none in the same scope so I would see it as a
> huge deal.

With scope I mostly meant the fact that you have two calls that you need
to pair up. That's not really nice as you can 'annotate' a _lot_ of code
in between. I prefer the narrower annotations where you annotate a
single specific site.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-02 15:11                             ` Peter Zijlstra
@ 2016-06-02 15:46                               ` Michal Hocko
  2016-06-02 23:22                                 ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-02 15:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Qu Wenruo, Darrick J. Wong, xfs, linux-mm, Ingo Molnar

On Thu 02-06-16 17:11:16, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 04:50:49PM +0200, Michal Hocko wrote:
> > On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> 
> > > So my favourite is the dedicated GFP flag, but if that's unpalatable for
> > > the mm folks then something like the below might work. It should be
> > > similar in effect to your proposal, except its more limited in scope.
> > [...]
> > > @@ -2876,11 +2883,36 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> > >  	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> > >  		return;
> > >  
> > > +	/*
> > > +	 * Skip _one_ allocation as per the lockdep_skip_alloc() request.
> > > +	 * Must be done last so that we don't loose the annotation for
> > > +	 * GFP_ATOMIC like things from IRQ or other nesting contexts.
> > > +	 */
> > > +	if (current->lockdep_reclaim_gfp & __GFP_SKIP_ALLOC) {
> > > +		current->lockdep_reclaim_gfp &= ~__GFP_SKIP_ALLOC;
> > > +		return;
> > > +	}
> > > +
> > >  	mark_held_locks(curr, RECLAIM_FS);
> > >  }
> > 
> > I might be missing something but does this work actually? Say you would
> > want a kmalloc(size), it would call
> > slab_alloc_node
> >   slab_pre_alloc_hook
> >     lockdep_trace_alloc
> > [...]
> >   ____cache_alloc_node
> >     cache_grow_begin
> >       kmem_getpages
> >         __alloc_pages_node
> > 	  __alloc_pages_nodemask
> > 	    lockdep_trace_alloc
> 
> Bugger :/ You're right, that would fail.
> 
> So how about doing:
> 
> #define __GFP_NOLOCKDEP	(1u << __GFP_BITS_SHIFT)

Hmm, now that I looked closer this would break GFP_SLAB_BUG_MASK :/
The whole thing is a bit hysterical because I really do not see any
reason to blow up just because somebody has used incorrect gfp mask
(we have users who give us combinations without any sense in the tree...)

We can fix that either by dropping the whole GFP_SLAB_BUG_MASK thingy
or to update it with __GFP_NOLOCKDEP. It just shows how this might get
really tricky and subtle.

> this means it cannot be part of address_space::flags or
> radix_tree_root::gfp_mask, but that might not be a bad thing.

True, those shouldn't really care.

> And this solves the scarcity thing, because per pagemap we need to have
> 5 'spare' bits anyway.
> 
> > I understand your concerns about the scope but usually all allocations
> > have to be __GFP_NOFS or none in the same scope so I would see it as a
> > huge deal.
> 
> With scope I mostly meant the fact that you have two calls that you need
> to pair up. That's not really nice as you can 'annotate' a _lot_ of code
> in between. I prefer the narrower annotations where you annotate a
> single specific site.

Yes, I can see you point. What I meant to say is that we would most
probably end up with the following pattern
	lockdep_trace_alloc_enable()
	some_foo_with_alloc(gfp_mask);
	lockdep_trace_alloc_disable()

and some_foo_with_alloc might be a lot of code. But at the same time we
know that _any_ allocation done from that context is safe from the
reclaim recursiveness POV. If not then annotation is buggy and needs to
be done at a different level but that would be exactly same if we did
some_foo_with_alloc(gfp_mask|__GFP_NOLOCKDEP) because all the
allocations down that road would reuse the same gfp mask anyway.

That being said I completely agree that a single entry point is much
less error prone but it also is tricky as we can see. So I would rather
go with something less tricky. It's not like people are not used to
enable/disable pattern.

Anyway I will leave the decision to you. If you really insist on
__GFP_NOLOCKDEP which doesn't consume new flag then I can review the
resulting patch.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-02 15:46                               ` Michal Hocko
@ 2016-06-02 23:22                                 ` Dave Chinner
  2016-06-06 12:20                                   ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-06-02 23:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote:
> On Thu 02-06-16 17:11:16, Peter Zijlstra wrote:
> > With scope I mostly meant the fact that you have two calls that you need
> > to pair up. That's not really nice as you can 'annotate' a _lot_ of code
> > in between. I prefer the narrower annotations where you annotate a
> > single specific site.
> 
> Yes, I can see you point. What I meant to say is that we would most
> probably end up with the following pattern
> 	lockdep_trace_alloc_enable()
> 	some_foo_with_alloc(gfp_mask);
> 	lockdep_trace_alloc_disable()
>
> and some_foo_with_alloc might be a lot of code.

That's the problem I see with this - the only way to make it
maintainable is to precede each enable/disable() pair with a comment
explaining *exactly* what those calls are protecting.  And that, in
itself, becomes a maintenance problem, because then code several
layers deep has no idea what context it is being called from and we
are likely to disable warnings in contexts where we probably
shouldn't be.

I think such an annotation approach really requires per-alloc site
annotation, the reason for it should be more obvious from the
context. e.g. any function that does memory alloc and takes an
optional transaction context needs annotation. Hence, from an XFS
perspective, I think it makes more sense to add a new KM_ flag to
indicate this call site requirement, then jump through whatever
lockdep hoop is required within the kmem_* allocation wrappers.
e.g, we can ignore the new KM_* flag if we are in a transaction
context and so the flag is only activated in the situations were
we currently enforce an external GFP_NOFS context from the call
site.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-02 23:22                                 ` Dave Chinner
@ 2016-06-06 12:20                                   ` Michal Hocko
  2016-06-15  7:21                                     ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-06 12:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Fri 03-06-16 09:22:54, Dave Chinner wrote:
> On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote:
> > On Thu 02-06-16 17:11:16, Peter Zijlstra wrote:
> > > With scope I mostly meant the fact that you have two calls that you need
> > > to pair up. That's not really nice as you can 'annotate' a _lot_ of code
> > > in between. I prefer the narrower annotations where you annotate a
> > > single specific site.
> > 
> > Yes, I can see you point. What I meant to say is that we would most
> > probably end up with the following pattern
> > 	lockdep_trace_alloc_enable()
> > 	some_foo_with_alloc(gfp_mask);
> > 	lockdep_trace_alloc_disable()
> >
> > and some_foo_with_alloc might be a lot of code.
> 
> That's the problem I see with this - the only way to make it
> maintainable is to precede each enable/disable() pair with a comment
> explaining *exactly* what those calls are protecting.  And that, in
> itself, becomes a maintenance problem, because then code several
> layers deep has no idea what context it is being called from and we
> are likely to disable warnings in contexts where we probably
> shouldn't be.

I am not sure I understand what you mean here. I thought the problem is
that:

func_A (!trans. context)		func_B (trans. context)
 foo1()					  foo2()
   bar(inode, GFP_KERNEL)		    bar(inode, GFP_NOFS)

so bar(inode, gfp) can be called from two different contexts which
would confuse the lockdep. And the workaround would be annotating bar
depending on the context it is called from - either pass a special gfp
flag or do disable/enable thing. In both cases that anotation should be
global for the whole func_A, no? Or is it possible that something in
that path would really need a reclaim lockdep detection?

> I think such an annotation approach really requires per-alloc site
> annotation, the reason for it should be more obvious from the
> context. e.g. any function that does memory alloc and takes an
> optional transaction context needs annotation. Hence, from an XFS
> perspective, I think it makes more sense to add a new KM_ flag to
> indicate this call site requirement, then jump through whatever
> lockdep hoop is required within the kmem_* allocation wrappers.
> e.g, we can ignore the new KM_* flag if we are in a transaction
> context and so the flag is only activated in the situations were
> we currently enforce an external GFP_NOFS context from the call
> site.....

Hmm, I thought we would achive this by using the scope GFP_NOFS usage
which would mark those transaction related conctexts and no lockdep
specific workarounds would be needed...

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-06 12:20                                   ` Michal Hocko
@ 2016-06-15  7:21                                     ` Dave Chinner
  2016-06-21 14:26                                       ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-06-15  7:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

[sorry for the slow repsonse - been on holidays]

On Mon, Jun 06, 2016 at 02:20:22PM +0200, Michal Hocko wrote:
> On Fri 03-06-16 09:22:54, Dave Chinner wrote:
> > On Thu, Jun 02, 2016 at 05:46:19PM +0200, Michal Hocko wrote:
> > > On Thu 02-06-16 17:11:16, Peter Zijlstra wrote:
> > > > With scope I mostly meant the fact that you have two calls that you need
> > > > to pair up. That's not really nice as you can 'annotate' a _lot_ of code
> > > > in between. I prefer the narrower annotations where you annotate a
> > > > single specific site.
> > > 
> > > Yes, I can see you point. What I meant to say is that we would most
> > > probably end up with the following pattern
> > > 	lockdep_trace_alloc_enable()
> > > 	some_foo_with_alloc(gfp_mask);
> > > 	lockdep_trace_alloc_disable()
> > >
> > > and some_foo_with_alloc might be a lot of code.
> > 
> > That's the problem I see with this - the only way to make it
> > maintainable is to precede each enable/disable() pair with a comment
> > explaining *exactly* what those calls are protecting.  And that, in
> > itself, becomes a maintenance problem, because then code several
> > layers deep has no idea what context it is being called from and we
> > are likely to disable warnings in contexts where we probably
> > shouldn't be.
> 
> I am not sure I understand what you mean here. I thought the problem is
> that:
> 
> func_A (!trans. context)		func_B (trans. context)
>  foo1()					  foo2()
>    bar(inode, GFP_KERNEL)		    bar(inode, GFP_NOFS)
> 
> so bar(inode, gfp) can be called from two different contexts which
> would confuse the lockdep.

Yes, that's the core of the problem. What I think you are missing is
the scale of the problem.

> And the workaround would be annotating bar
> depending on the context it is called from - either pass a special gfp
> flag or do disable/enable thing. In both cases that anotation should be
> global for the whole func_A, no? Or is it possible that something in
> that path would really need a reclaim lockdep detection?

The problem is that there are cases where the call stack that leads
to bar() has many different entry points. See, for example, the
xfs_bmapi*() interfaces.  They all end up in the same low level
btree traversal code (and hence memory allocation points).
xfs_bmapi_read() can be called from both inside and outside
transaction context and there's ~30 callers we'd have to audit and
annotate. Then there's ~10 callers of xfs_bmapi_write which are all
within transaction context.

And then there's xfs_bmapi_delay(), which can end up in the same
low level code outside transaction context. Then there's
10 callers of xfs_bunmapi(), which runs both inside and outside
transaction context, too. Add to that all the miscellenous points
that can read extents off disk, and you get another ~12 entry
points.

Hopefully you can see the complexity of the issue - for an allocation
in the bmap btree code that could occur outside both inside and
outside of a transaction context, we've got to work out which of
those ~60 high level entry points would need to be annotated. And
then we have to ensure that in future we don't miss adding or
removing an annotation as we change the code deep inside the btree
implementation. It's the latter that is the long term maintainence
problem the hihg-level annotation approach introduces.

> > I think such an annotation approach really requires per-alloc site
> > annotation, the reason for it should be more obvious from the
> > context. e.g. any function that does memory alloc and takes an
> > optional transaction context needs annotation. Hence, from an XFS
> > perspective, I think it makes more sense to add a new KM_ flag to
> > indicate this call site requirement, then jump through whatever
> > lockdep hoop is required within the kmem_* allocation wrappers.
> > e.g, we can ignore the new KM_* flag if we are in a transaction
> > context and so the flag is only activated in the situations were
> > we currently enforce an external GFP_NOFS context from the call
> > site.....
> 
> Hmm, I thought we would achive this by using the scope GFP_NOFS usage
> which would mark those transaction related conctexts and no lockdep
> specific workarounds would be needed...

There are allocations outside transaction context which need to be
GFP_NOFS - this is what KM_NOFS was originally intended for. We need to
disambiguate the two cases where we use KM_NOFS to shut up lockdep
vs the cases where it is necessary to prevent reclaim deadlocks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-15  7:21                                     ` Dave Chinner
@ 2016-06-21 14:26                                       ` Michal Hocko
  2016-06-22  1:03                                         ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-21 14:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Wed 15-06-16 17:21:54, Dave Chinner wrote:
[...]
> Hopefully you can see the complexity of the issue - for an allocation
> in the bmap btree code that could occur outside both inside and
> outside of a transaction context, we've got to work out which of
> those ~60 high level entry points would need to be annotated. And
> then we have to ensure that in future we don't miss adding or
> removing an annotation as we change the code deep inside the btree
> implementation. It's the latter that is the long term maintainence
> problem the hihg-level annotation approach introduces.

Sure I can see the complexity here. I might still see this over
simplified but I originally thought that the annotation would be used at
the highest level which never gets called from the transaction or other
NOFS context. So all the layers down would inherit that automatically. I
guess that such a place can be identified from the lockdep report by a
trained eye.
 
> > > I think such an annotation approach really requires per-alloc site
> > > annotation, the reason for it should be more obvious from the
> > > context. e.g. any function that does memory alloc and takes an
> > > optional transaction context needs annotation. Hence, from an XFS
> > > perspective, I think it makes more sense to add a new KM_ flag to
> > > indicate this call site requirement, then jump through whatever
> > > lockdep hoop is required within the kmem_* allocation wrappers.
> > > e.g, we can ignore the new KM_* flag if we are in a transaction
> > > context and so the flag is only activated in the situations were
> > > we currently enforce an external GFP_NOFS context from the call
> > > site.....
> > 
> > Hmm, I thought we would achive this by using the scope GFP_NOFS usage
> > which would mark those transaction related conctexts and no lockdep
> > specific workarounds would be needed...
> 
> There are allocations outside transaction context which need to be
> GFP_NOFS - this is what KM_NOFS was originally intended for.

Is it feasible to mark those by the scope NOFS api as well and drop
the direct KM_NOFS usage? This should help to identify those that are
lockdep only and use the annotation to prevent from the false positives.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-21 14:26                                       ` Michal Hocko
@ 2016-06-22  1:03                                         ` Dave Chinner
  2016-06-22 12:38                                           ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-06-22  1:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote:
> On Wed 15-06-16 17:21:54, Dave Chinner wrote:
> [...]
> > Hopefully you can see the complexity of the issue - for an allocation
> > in the bmap btree code that could occur outside both inside and
> > outside of a transaction context, we've got to work out which of
> > those ~60 high level entry points would need to be annotated. And
> > then we have to ensure that in future we don't miss adding or
> > removing an annotation as we change the code deep inside the btree
> > implementation. It's the latter that is the long term maintainence
> > problem the hihg-level annotation approach introduces.
> 
> Sure I can see the complexity here. I might still see this over
> simplified but I originally thought that the annotation would be used at
> the highest level which never gets called from the transaction or other
> NOFS context. So all the layers down would inherit that automatically. I
> guess that such a place can be identified from the lockdep report by a
> trained eye.

Which, as I said before, effectively becomes "turn off lockdep
reclaim context checking at all XFS entry points". Yes, we could do
that, but it's a "big hammer" solution and there are more entry
points than there are memory allocations that need annotations....

> > > > I think such an annotation approach really requires per-alloc site
> > > > annotation, the reason for it should be more obvious from the
> > > > context. e.g. any function that does memory alloc and takes an
> > > > optional transaction context needs annotation. Hence, from an XFS
> > > > perspective, I think it makes more sense to add a new KM_ flag to
> > > > indicate this call site requirement, then jump through whatever
> > > > lockdep hoop is required within the kmem_* allocation wrappers.
> > > > e.g, we can ignore the new KM_* flag if we are in a transaction
> > > > context and so the flag is only activated in the situations were
> > > > we currently enforce an external GFP_NOFS context from the call
> > > > site.....
> > > 
> > > Hmm, I thought we would achive this by using the scope GFP_NOFS usage
> > > which would mark those transaction related conctexts and no lockdep
> > > specific workarounds would be needed...
> > 
> > There are allocations outside transaction context which need to be
> > GFP_NOFS - this is what KM_NOFS was originally intended for.
> 
> Is it feasible to mark those by the scope NOFS api as well and drop
> the direct KM_NOFS usage? This should help to identify those that are
> lockdep only and use the annotation to prevent from the false positives.

I don't understand what you are suggesting here. This all started
because we use GFP_NOFS in a handful of places to shut up lockdep
and you didn't want us to use GFP_NOFS like that. Now it sounds to
me like you are advocating setting unconditional GFP_NOFS allocation
contexts for entire XFS code paths - whether it's necessary or
not - to avoid problems with lockdep false positives.

I'm clearly not understanding something here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-22  1:03                                         ` Dave Chinner
@ 2016-06-22 12:38                                           ` Michal Hocko
  2016-06-22 22:58                                             ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-06-22 12:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Wed 22-06-16 11:03:20, Dave Chinner wrote:
> On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote:
> > On Wed 15-06-16 17:21:54, Dave Chinner wrote:
[...]
> > > There are allocations outside transaction context which need to be
> > > GFP_NOFS - this is what KM_NOFS was originally intended for.
> > 
> > Is it feasible to mark those by the scope NOFS api as well and drop
> > the direct KM_NOFS usage? This should help to identify those that are
> > lockdep only and use the annotation to prevent from the false positives.
> 
> I don't understand what you are suggesting here. This all started
> because we use GFP_NOFS in a handful of places to shut up lockdep
> and you didn't want us to use GFP_NOFS like that. Now it sounds to
> me like you are advocating setting unconditional GFP_NOFS allocation
> contexts for entire XFS code paths - whether it's necessary or
> not - to avoid problems with lockdep false positives.

No, I meant only those paths which need GFP_NOFS for other than lockdep
purposes would use the scope api.

Anyway, it seems that we are not getting closer to a desired solution
here. Or I am not following it at least...

It seems that we have effectively two possibilities (from the
MM/lockdep) POV. Either add an explicit API to disable the reclaim
lockdep machinery for all allocation in a certain scope or a GFP mask
to to achieve the same for a particular allocation. Which one would work
better for the xfs usecase?

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-22 12:38                                           ` Michal Hocko
@ 2016-06-22 22:58                                             ` Dave Chinner
  2016-06-23 11:35                                               ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-06-22 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Peter Zijlstra, Darrick J. Wong, xfs, Qu Wenruo, Ingo Molnar

On Wed, Jun 22, 2016 at 02:38:22PM +0200, Michal Hocko wrote:
> On Wed 22-06-16 11:03:20, Dave Chinner wrote:
> > On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote:
> > > On Wed 15-06-16 17:21:54, Dave Chinner wrote:
> [...]
> > > > There are allocations outside transaction context which need to be
> > > > GFP_NOFS - this is what KM_NOFS was originally intended for.
> > > 
> > > Is it feasible to mark those by the scope NOFS api as well and drop
> > > the direct KM_NOFS usage? This should help to identify those that are
> > > lockdep only and use the annotation to prevent from the false positives.
> > 
> > I don't understand what you are suggesting here. This all started
> > because we use GFP_NOFS in a handful of places to shut up lockdep
> > and you didn't want us to use GFP_NOFS like that. Now it sounds to
> > me like you are advocating setting unconditional GFP_NOFS allocation
> > contexts for entire XFS code paths - whether it's necessary or
> > not - to avoid problems with lockdep false positives.
> 
> No, I meant only those paths which need GFP_NOFS for other than lockdep
> purposes would use the scope api.
> 
> Anyway, it seems that we are not getting closer to a desired solution
> here. Or I am not following it at least...
> 
> It seems that we have effectively two possibilities (from the
> MM/lockdep) POV. Either add an explicit API to disable the reclaim
> lockdep machinery for all allocation in a certain scope or a GFP mask
> to to achieve the same for a particular allocation. Which one would work
> better for the xfs usecase?

As I've said - if we annotate the XFS call sites appropriately (e.g.
KM_NOLOCKDEP rather than KM_NOFS), we don't care what lockdep
mechanism is used to turn off warnings as it will be wholly
encapsulated inside kmem_alloc() and friends.  This will end up
similar to how we are currently encapsulate the memalloc_noio_save()
wrappers in kmem_zalloc_large().

IOWs, it doesn't matter to XFS whether it be a GFP flag or a PF flag
here, because it's not going to be exposed to the higher level code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-22 22:58                                             ` Dave Chinner
@ 2016-06-23 11:35                                               ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-06-23 11:35 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Chinner
  Cc: linux-mm, Qu Wenruo, xfs, Ingo Molnar, Darrick J. Wong

On Thu 23-06-16 08:58:16, Dave Chinner wrote:
> On Wed, Jun 22, 2016 at 02:38:22PM +0200, Michal Hocko wrote:
> > On Wed 22-06-16 11:03:20, Dave Chinner wrote:
> > > On Tue, Jun 21, 2016 at 04:26:28PM +0200, Michal Hocko wrote:
> > > > On Wed 15-06-16 17:21:54, Dave Chinner wrote:
> > [...]
> > > > > There are allocations outside transaction context which need to be
> > > > > GFP_NOFS - this is what KM_NOFS was originally intended for.
> > > > 
> > > > Is it feasible to mark those by the scope NOFS api as well and drop
> > > > the direct KM_NOFS usage? This should help to identify those that are
> > > > lockdep only and use the annotation to prevent from the false positives.
> > > 
> > > I don't understand what you are suggesting here. This all started
> > > because we use GFP_NOFS in a handful of places to shut up lockdep
> > > and you didn't want us to use GFP_NOFS like that. Now it sounds to
> > > me like you are advocating setting unconditional GFP_NOFS allocation
> > > contexts for entire XFS code paths - whether it's necessary or
> > > not - to avoid problems with lockdep false positives.
> > 
> > No, I meant only those paths which need GFP_NOFS for other than lockdep
> > purposes would use the scope api.
> > 
> > Anyway, it seems that we are not getting closer to a desired solution
> > here. Or I am not following it at least...
> > 
> > It seems that we have effectively two possibilities (from the
> > MM/lockdep) POV. Either add an explicit API to disable the reclaim
> > lockdep machinery for all allocation in a certain scope or a GFP mask
> > to to achieve the same for a particular allocation. Which one would work
> > better for the xfs usecase?
> 
> As I've said - if we annotate the XFS call sites appropriately (e.g.
> KM_NOLOCKDEP rather than KM_NOFS), we don't care what lockdep
> mechanism is used to turn off warnings as it will be wholly
> encapsulated inside kmem_alloc() and friends.  This will end up
> similar to how we are currently encapsulate the memalloc_noio_save()
> wrappers in kmem_zalloc_large().

OK, I see. So which way do we go Peter? Are you going to send the GFP
one or is there a way to bribe you to go with a thread flag?

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-06-01 18:16                         ` Peter Zijlstra
  2016-06-02 14:50                           ` Michal Hocko
@ 2016-10-06 13:04                           ` Michal Hocko
  2016-10-17 13:49                             ` Michal Hocko
                                               ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Michal Hocko @ 2016-10-06 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar

[Let me ressurect this thread]

On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> > Thanks Dave for your detailed explanation again! Peter do you have any
> > other idea how to deal with these situations other than opt out from
> > lockdep reclaim machinery?
> > 
> > If not I would rather go with an annotation than a gfp flag to be honest
> > but if you absolutely hate that approach then I will try to check wheter
> > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> > steal the description from Dave's email and repost my patch.
> > 
> > I plan to repost my scope gfp patches in few days and it would be good
> > to have some mechanism to drop those GFP_NOFS to paper over lockdep
> > false positives for that.
> 
> Right; sorry I got side-tracked in other things again.
> 
> So my favourite is the dedicated GFP flag, but if that's unpalatable for
> the mm folks then something like the below might work. It should be
> similar in effect to your proposal, except its more limited in scope.

OK, so the situation with the GFP flags is somehow relieved after 
http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with
the root radix tree remaining the last user which mangles gfp_mask and
tags together we have some few bits left there. As you apparently hate
any scoped API and Dave thinks that per allocation flag is the only
maintainable way for xfs what do you think about the following?
---
>From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 13 May 2016 17:47:31 +0200
Subject: [PATCH] lockdep: allow to disable reclaim lockup detection

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  1 +
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3e6c48dbe6b9..cee3d5fa3821 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d96c6e058467..a652ac8b3cfa 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2871,6 +2871,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1b7bf7314141..3154403d30e8 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1672,6 +1672,7 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.9.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-06 13:04                           ` Michal Hocko
@ 2016-10-17 13:49                             ` Michal Hocko
  2016-10-19  0:33                             ` Dave Chinner
  2016-10-19  8:33                             ` Peter Zijlstra
  2 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-10-17 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar

On Thu 06-10-16 15:04:54, Michal Hocko wrote:
> [Let me ressurect this thread]

ping

> On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> > > Thanks Dave for your detailed explanation again! Peter do you have any
> > > other idea how to deal with these situations other than opt out from
> > > lockdep reclaim machinery?
> > > 
> > > If not I would rather go with an annotation than a gfp flag to be honest
> > > but if you absolutely hate that approach then I will try to check wheter
> > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> > > steal the description from Dave's email and repost my patch.
> > > 
> > > I plan to repost my scope gfp patches in few days and it would be good
> > > to have some mechanism to drop those GFP_NOFS to paper over lockdep
> > > false positives for that.
> > 
> > Right; sorry I got side-tracked in other things again.
> > 
> > So my favourite is the dedicated GFP flag, but if that's unpalatable for
> > the mm folks then something like the below might work. It should be
> > similar in effect to your proposal, except its more limited in scope.
> 
> OK, so the situation with the GFP flags is somehow relieved after 
> http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with
> the root radix tree remaining the last user which mangles gfp_mask and
> tags together we have some few bits left there. As you apparently hate
> any scoped API and Dave thinks that per allocation flag is the only
> maintainable way for xfs what do you think about the following?
> ---
> From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 13 May 2016 17:47:31 +0200
> Subject: [PATCH] lockdep: allow to disable reclaim lockup detection
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/gfp.h      | 10 +++++++++-
>  kernel/locking/lockdep.c |  4 ++++
>  lib/radix-tree.c         |  1 +
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3e6c48dbe6b9..cee3d5fa3821 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -41,6 +41,11 @@ struct vm_area_struct;
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u
>  #define ___GFP_KSWAPD_RECLAIM	0x2000000u
> +#ifdef CONFIG_LOCKDEP
> +#define ___GFP_NOLOCKDEP	0x4000000u
> +#else
> +#define ___GFP_NOLOCKDEP	0
> +#endif
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -186,8 +191,11 @@ struct vm_area_struct;
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
>  
> +/* Disable lockdep for GFP context tracking */
> +#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
> +
>  /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT 26
> +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /*
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index d96c6e058467..a652ac8b3cfa 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2871,6 +2871,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
>  	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
>  		return;
>  
> +	/* Disable lockdep if explicitly requested */
> +	if (gfp_mask & __GFP_NOLOCKDEP)
> +		return;
> +
>  	mark_held_locks(curr, RECLAIM_FS);
>  }
>  
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 1b7bf7314141..3154403d30e8 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1672,6 +1672,7 @@ static int radix_tree_callback(struct notifier_block *nfb,
>  
>  void __init radix_tree_init(void)
>  {
> +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> -- 
> 2.9.3
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-06 13:04                           ` Michal Hocko
  2016-10-17 13:49                             ` Michal Hocko
@ 2016-10-19  0:33                             ` Dave Chinner
  2016-10-19  5:30                               ` Dave Chinner
  2016-10-19  8:33                             ` Peter Zijlstra
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-10-19  0:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar

On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote:
> [Let me ressurect this thread]
> 
> On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> > > Thanks Dave for your detailed explanation again! Peter do you have any
> > > other idea how to deal with these situations other than opt out from
> > > lockdep reclaim machinery?
> > > 
> > > If not I would rather go with an annotation than a gfp flag to be honest
> > > but if you absolutely hate that approach then I will try to check wheter
> > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> > > steal the description from Dave's email and repost my patch.
> > > 
> > > I plan to repost my scope gfp patches in few days and it would be good
> > > to have some mechanism to drop those GFP_NOFS to paper over lockdep
> > > false positives for that.
> > 
> > Right; sorry I got side-tracked in other things again.
> > 
> > So my favourite is the dedicated GFP flag, but if that's unpalatable for
> > the mm folks then something like the below might work. It should be
> > similar in effect to your proposal, except its more limited in scope.
> 
> OK, so the situation with the GFP flags is somehow relieved after 
> http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with
> the root radix tree remaining the last user which mangles gfp_mask and
> tags together we have some few bits left there. As you apparently hate
> any scoped API and Dave thinks that per allocation flag is the only
> maintainable way for xfs what do you think about the following?

It's a workable solution to allow XFS to play whack-a-mole games
with lockdep again. As to the implementation - that's for other
people to decide....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-19  0:33                             ` Dave Chinner
@ 2016-10-19  5:30                               ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2016-10-19  5:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm,
	Ingo Molnar

[resend with the xfs list corrected.]

On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote:
> [Let me ressurect this thread]
> 
> On Wed 01-06-16 20:16:17, Peter Zijlstra wrote:
> > On Wed, Jun 01, 2016 at 03:17:58PM +0200, Michal Hocko wrote:
> > > Thanks Dave for your detailed explanation again! Peter do you have any
> > > other idea how to deal with these situations other than opt out from
> > > lockdep reclaim machinery?
> > > 
> > > If not I would rather go with an annotation than a gfp flag to be honest
> > > but if you absolutely hate that approach then I will try to check wheter
> > > a CONFIG_LOCKDEP GFP_FOO doesn't break something else. Otherwise I would
> > > steal the description from Dave's email and repost my patch.
> > > 
> > > I plan to repost my scope gfp patches in few days and it would be good
> > > to have some mechanism to drop those GFP_NOFS to paper over lockdep
> > > false positives for that.
> > 
> > Right; sorry I got side-tracked in other things again.
> > 
> > So my favourite is the dedicated GFP flag, but if that's unpalatable for
> > the mm folks then something like the below might work. It should be
> > similar in effect to your proposal, except its more limited in scope.
> 
> OK, so the situation with the GFP flags is somehow relieved after 
> http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz and with
> the root radix tree remaining the last user which mangles gfp_mask and
> tags together we have some few bits left there. As you apparently hate
> any scoped API and Dave thinks that per allocation flag is the only
> maintainable way for xfs what do you think about the following?

It's a workable solution to allow XFS to play whack-a-mole games
with lockdep again. As to the implementation - that's for other
people to decide....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-06 13:04                           ` Michal Hocko
  2016-10-17 13:49                             ` Michal Hocko
  2016-10-19  0:33                             ` Dave Chinner
@ 2016-10-19  8:33                             ` Peter Zijlstra
  2016-10-19 12:06                               ` Michal Hocko
  2 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2016-10-19  8:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar

On Thu, Oct 06, 2016 at 03:04:54PM +0200, Michal Hocko wrote:
> From 04b3923e5b12f0eb3859f0718881fa0f40e60164 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 13 May 2016 17:47:31 +0200
> Subject: [PATCH] lockdep: allow to disable reclaim lockup detection
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

So I'm all for this if this works for Dave.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Please take it through the XFS tree which would introduce its first user
etc..

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-19  8:33                             ` Peter Zijlstra
@ 2016-10-19 12:06                               ` Michal Hocko
  2016-10-19 21:49                                 ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2016-10-19 12:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, xfs, linux-mm, Ingo Molnar

On Wed 19-10-16 10:33:04, Peter Zijlstra wrote:
[...]
> So I'm all for this if this works for Dave.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks Peter!

> Please take it through the XFS tree which would introduce its first user
> etc..

Dave, does that work for you? I agree that having this followed by a
first user would be really preferable. Maybe to turn some of those added
by b17cb364dbbb? I wish I could help here but as you've said earlier
each such annotation should be accompanied by an explanation which I am
not qualified to provide.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-19 12:06                               ` Michal Hocko
@ 2016-10-19 21:49                                 ` Dave Chinner
  2016-10-20  7:15                                   ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2016-10-19 21:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm,
	Ingo Molnar

On Wed, Oct 19, 2016 at 02:06:27PM +0200, Michal Hocko wrote:
> On Wed 19-10-16 10:33:04, Peter Zijlstra wrote:
> [...]
> > So I'm all for this if this works for Dave.
> > 
> > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Thanks Peter!
> 
> > Please take it through the XFS tree which would introduce its first user
> > etc..
> 
> Dave, does that work for you? I agree that having this followed by a
> first user would be really preferable. Maybe to turn some of those added
> by b17cb364dbbb? I wish I could help here but as you've said earlier
> each such annotation should be accompanied by an explanation which I am
> not qualified to provide.

I've got my hands full right now, so I'm not going to try to page
all this stuff back into my brain right now.  Try reminding me as
the merge window gets closer...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Xfs lockdep warning with for-dave-for-4.6 branch
  2016-10-19 21:49                                 ` Dave Chinner
@ 2016-10-20  7:15                                   ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-10-20  7:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Darrick J. Wong, Qu Wenruo, linux-xfs, linux-mm,
	Ingo Molnar

On Thu 20-10-16 08:49:23, Dave Chinner wrote:
> On Wed, Oct 19, 2016 at 02:06:27PM +0200, Michal Hocko wrote:
> > On Wed 19-10-16 10:33:04, Peter Zijlstra wrote:
> > [...]
> > > So I'm all for this if this works for Dave.
> > > 
> > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > 
> > Thanks Peter!
> > 
> > > Please take it through the XFS tree which would introduce its first user
> > > etc..
> > 
> > Dave, does that work for you? I agree that having this followed by a
> > first user would be really preferable. Maybe to turn some of those added
> > by b17cb364dbbb? I wish I could help here but as you've said earlier
> > each such annotation should be accompanied by an explanation which I am
> > not qualified to provide.
> 
> I've got my hands full right now, so I'm not going to try to page
> all this stuff back into my brain right now.  Try reminding me as
> the merge window gets closer...

Sure, I do not think we are in hurry.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-10-20  7:15 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-12  5:53 Xfs lockdep warning with for-dave-for-4.6 branch Qu Wenruo
2016-05-12  5:57 ` Darrick J. Wong
2016-05-12  8:03   ` Dave Chinner
2016-05-13 16:03     ` Michal Hocko
2016-05-16 10:41       ` Peter Zijlstra
2016-05-16 13:05         ` Michal Hocko
2016-05-16 13:25           ` Peter Zijlstra
2016-05-16 23:10             ` Dave Chinner
2016-05-17 14:49               ` Peter Zijlstra
2016-05-17 22:35                 ` Dave Chinner
2016-05-18  7:20                   ` Peter Zijlstra
2016-05-18  8:25                     ` Michal Hocko
2016-05-18  9:49                       ` Peter Zijlstra
2016-05-18 11:31                         ` Michal Hocko
2016-05-19  8:11                   ` Peter Zijlstra
2016-05-20  0:17                     ` Dave Chinner
2016-06-01 13:17                       ` Michal Hocko
2016-06-01 18:16                         ` Peter Zijlstra
2016-06-02 14:50                           ` Michal Hocko
2016-06-02 15:11                             ` Peter Zijlstra
2016-06-02 15:46                               ` Michal Hocko
2016-06-02 23:22                                 ` Dave Chinner
2016-06-06 12:20                                   ` Michal Hocko
2016-06-15  7:21                                     ` Dave Chinner
2016-06-21 14:26                                       ` Michal Hocko
2016-06-22  1:03                                         ` Dave Chinner
2016-06-22 12:38                                           ` Michal Hocko
2016-06-22 22:58                                             ` Dave Chinner
2016-06-23 11:35                                               ` Michal Hocko
2016-10-06 13:04                           ` Michal Hocko
2016-10-17 13:49                             ` Michal Hocko
2016-10-19  0:33                             ` Dave Chinner
2016-10-19  5:30                               ` Dave Chinner
2016-10-19  8:33                             ` Peter Zijlstra
2016-10-19 12:06                               ` Michal Hocko
2016-10-19 21:49                                 ` Dave Chinner
2016-10-20  7:15                                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).