xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
diff mbox series

Message ID 20200102155208.8977-1-longman@redhat.com
State New
Headers show
Series
  • xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
Related show

Commit Message

Waiman Long Jan. 2, 2020, 3:52 p.m. UTC
Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:

======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G        W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30

but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

which lock already depends on the new lock.
  :
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

4 locks held by fsfreeze/4346:
 #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

stack backtrace:
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 fs_reclaim_acquire.part.19+0x29/0x30
 fs_reclaim_acquire+0x19/0x20
 kmem_cache_alloc+0x3e/0x3f0
 kmem_zone_alloc+0x79/0x150
 xfs_trans_alloc+0xfa/0x9d0
 xfs_sync_sb+0x86/0x170
 xfs_log_sbcount+0x10f/0x140
 xfs_quiesce_attr+0x134/0x270
 xfs_fs_freeze+0x4a/0x70
 freeze_super+0x1af/0x290
 do_vfs_ioctl+0xedc/0x16c0
 ksys_ioctl+0x41/0x80
 __x64_sys_ioctl+0x73/0xa9
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

According to Dave Chinner:

  Freezing the filesystem, after all the data has been cleaned. IOWs
  memory reclaim will never run the above writeback path when
  the freeze process is trying to allocate a transaction here because
  there are no dirty data pages in the filesystem at this point.

  Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
  it /doesn't deadlock/ by taking freeze references for the
  transaction. We've just drained all the transactions
  in progress and written back all the dirty metadata, too, and so the
  filesystem is completely clean and only needs the superblock to be
  updated to complete the freeze process. And to do that, it does not
  take a freeze reference because calling sb_start_intwrite() here
  would deadlock.

  IOWs, this is a false positive, caused by the fact that
  xfs_trans_alloc() is called from both above and below memory reclaim
  as well as within /every level/ of freeze processing. Lockdep is
  unable to describe the staged flush logic in the freeze process that
  prevents deadlocks from occurring, and hence we will pretty much
  always see false positives in the freeze path....

Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
may fix the issue. However, that will greatly complicate the logic and
may not be worth it.

Another way to fix it is to disable the taking of the fs_reclaim
pseudo lock when in the freezing code path as a reclaim on the freezed
filesystem is not possible as stated above. This patch takes this
approach by setting the __GFP_NOLOCKDEP flag in the slab memory
allocation calls when the filesystem has been freezed.

Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.

 # fsfreeze -f /home
 # fsfreeze --unfreeze /home
 # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal

After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/xfs/kmem.h      | 11 ++++++++++-
 fs/xfs/xfs_log.c   |  3 ++-
 fs/xfs/xfs_trans.c |  8 +++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

Comments

Qian Cai Jan. 2, 2020, 4:19 p.m. UTC | #1
> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote:
> 
> Depending on the workloads, the following circular locking dependency
> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> lock) may show up:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.0.0-rc1+ #60 Tainted: G        W
> ------------------------------------------------------
> fsfreeze/4346 is trying to acquire lock:
> 0000000026f1d784 (fs_reclaim){+.+.}, at:
> fs_reclaim_acquire.part.19+0x5/0x30
> 
> but task is already holding lock:
> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> which lock already depends on the new lock.
>  :
> Possible unsafe locking scenario:
> 
>       CPU0                    CPU1
>       ----                    ----
>  lock(sb_internal);
>                               lock(fs_reclaim);
>                               lock(sb_internal);
>  lock(fs_reclaim);
> 
> *** DEADLOCK ***
> 
> 4 locks held by fsfreeze/4346:
> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> stack backtrace:
> Call Trace:
> dump_stack+0xe0/0x19a
> print_circular_bug.isra.10.cold.34+0x2f4/0x435
> check_prev_add.constprop.19+0xca1/0x15f0
> validate_chain.isra.14+0x11af/0x3b50
> __lock_acquire+0x728/0x1200
> lock_acquire+0x269/0x5a0
> fs_reclaim_acquire.part.19+0x29/0x30
> fs_reclaim_acquire+0x19/0x20
> kmem_cache_alloc+0x3e/0x3f0
> kmem_zone_alloc+0x79/0x150
> xfs_trans_alloc+0xfa/0x9d0
> xfs_sync_sb+0x86/0x170
> xfs_log_sbcount+0x10f/0x140
> xfs_quiesce_attr+0x134/0x270
> xfs_fs_freeze+0x4a/0x70
> freeze_super+0x1af/0x290
> do_vfs_ioctl+0xedc/0x16c0
> ksys_ioctl+0x41/0x80
> __x64_sys_ioctl+0x73/0xa9
> do_syscall_64+0x18f/0xd23
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> According to Dave Chinner:
> 
>  Freezing the filesystem, after all the data has been cleaned. IOWs
>  memory reclaim will never run the above writeback path when
>  the freeze process is trying to allocate a transaction here because
>  there are no dirty data pages in the filesystem at this point.
> 
>  Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
>  it /doesn't deadlock/ by taking freeze references for the
>  transaction. We've just drained all the transactions
>  in progress and written back all the dirty metadata, too, and so the
>  filesystem is completely clean and only needs the superblock to be
>  updated to complete the freeze process. And to do that, it does not
>  take a freeze reference because calling sb_start_intwrite() here
>  would deadlock.
> 
>  IOWs, this is a false positive, caused by the fact that
>  xfs_trans_alloc() is called from both above and below memory reclaim
>  as well as within /every level/ of freeze processing. Lockdep is
>  unable to describe the staged flush logic in the freeze process that
>  prevents deadlocks from occurring, and hence we will pretty much
>  always see false positives in the freeze path....
> 
> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> may fix the issue. However, that will greatly complicate the logic and
> may not be worth it.
> 
> Another way to fix it is to disable the taking of the fs_reclaim
> pseudo lock when in the freezing code path as a reclaim on the freezed
> filesystem is not possible as stated above. This patch takes this
> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
> allocation calls when the filesystem has been freezed.
> 
> Without this patch, the command sequence below will show that the lock
> dependency chain sb_internal -> fs_reclaim exists.
> 
> # fsfreeze -f /home
> # fsfreeze --unfreeze /home
> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
> 
> After applying the patch, such sb_internal -> fs_reclaim lock dependency
> chain can no longer be found. Because of that, the locking dependency
> warning will not be shown.

There was an attempt to fix this in the past, but Dave rejected right away for any workaround in xfs and insisted to make lockdep smarter instead. No sure your approach will make any difference this time. Good luck.
Darrick J. Wong Jan. 2, 2020, 6:24 p.m. UTC | #2
On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote:
> 
> 
> > On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote:
> > 
> > Depending on the workloads, the following circular locking dependency
> > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> > lock) may show up:
> > 
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 5.0.0-rc1+ #60 Tainted: G        W
> > ------------------------------------------------------
> > fsfreeze/4346 is trying to acquire lock:
> > 0000000026f1d784 (fs_reclaim){+.+.}, at:
> > fs_reclaim_acquire.part.19+0x5/0x30
> > 
> > but task is already holding lock:
> > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> > 
> > which lock already depends on the new lock.
> >  :
> > Possible unsafe locking scenario:
> > 
> >       CPU0                    CPU1
> >       ----                    ----
> >  lock(sb_internal);
> >                               lock(fs_reclaim);
> >                               lock(sb_internal);
> >  lock(fs_reclaim);
> > 
> > *** DEADLOCK ***
> > 
> > 4 locks held by fsfreeze/4346:
> > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
> > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
> > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
> > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> > 
> > stack backtrace:
> > Call Trace:
> > dump_stack+0xe0/0x19a
> > print_circular_bug.isra.10.cold.34+0x2f4/0x435
> > check_prev_add.constprop.19+0xca1/0x15f0
> > validate_chain.isra.14+0x11af/0x3b50
> > __lock_acquire+0x728/0x1200
> > lock_acquire+0x269/0x5a0
> > fs_reclaim_acquire.part.19+0x29/0x30
> > fs_reclaim_acquire+0x19/0x20
> > kmem_cache_alloc+0x3e/0x3f0
> > kmem_zone_alloc+0x79/0x150
> > xfs_trans_alloc+0xfa/0x9d0
> > xfs_sync_sb+0x86/0x170
> > xfs_log_sbcount+0x10f/0x140
> > xfs_quiesce_attr+0x134/0x270
> > xfs_fs_freeze+0x4a/0x70
> > freeze_super+0x1af/0x290
> > do_vfs_ioctl+0xedc/0x16c0
> > ksys_ioctl+0x41/0x80
> > __x64_sys_ioctl+0x73/0xa9
> > do_syscall_64+0x18f/0xd23
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > According to Dave Chinner:
> > 
> >  Freezing the filesystem, after all the data has been cleaned. IOWs
> >  memory reclaim will never run the above writeback path when
> >  the freeze process is trying to allocate a transaction here because
> >  there are no dirty data pages in the filesystem at this point.
> > 
> >  Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
> >  it /doesn't deadlock/ by taking freeze references for the
> >  transaction. We've just drained all the transactions
> >  in progress and written back all the dirty metadata, too, and so the
> >  filesystem is completely clean and only needs the superblock to be
> >  updated to complete the freeze process. And to do that, it does not
> >  take a freeze reference because calling sb_start_intwrite() here
> >  would deadlock.
> > 
> >  IOWs, this is a false positive, caused by the fact that
> >  xfs_trans_alloc() is called from both above and below memory reclaim
> >  as well as within /every level/ of freeze processing. Lockdep is
> >  unable to describe the staged flush logic in the freeze process that
> >  prevents deadlocks from occurring, and hence we will pretty much
> >  always see false positives in the freeze path....
> > 
> > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> > may fix the issue. However, that will greatly complicate the logic and
> > may not be worth it.
> > 
> > Another way to fix it is to disable the taking of the fs_reclaim
> > pseudo lock when in the freezing code path as a reclaim on the freezed
> > filesystem is not possible as stated above. This patch takes this
> > approach by setting the __GFP_NOLOCKDEP flag in the slab memory
> > allocation calls when the filesystem has been freezed.
> > 
> > Without this patch, the command sequence below will show that the lock
> > dependency chain sb_internal -> fs_reclaim exists.
> > 
> > # fsfreeze -f /home
> > # fsfreeze --unfreeze /home
> > # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
> > 
> > After applying the patch, such sb_internal -> fs_reclaim lock dependency
> > chain can no longer be found. Because of that, the locking dependency
> > warning will not be shown.
> 
> There was an attempt to fix this in the past, but Dave rejected right
> away for any workaround in xfs and insisted to make lockdep smarter
> instead. No sure your approach will make any difference this time.
> Good luck.

/me wonders if you can fix this by having the freeze path call
memalloc_nofs_save() since we probably don't want to be recursing into
the fs for reclaim while freezing it?  Probably not, because that's a
bigger hammer than we really need here.  We can certainly steal memory
from other filesystems that aren't frozen.

It doesn't solve the key issue that lockdep isn't smart enough to know
that we can't recurse into the fs that's being frozen and therefore
there's no chance of deadlock.

--D
Waiman Long Jan. 2, 2020, 6:41 p.m. UTC | #3
On 1/2/20 1:24 PM, Darrick J. Wong wrote:
> On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote:
>>
>>> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote:
>>>
>>> Depending on the workloads, the following circular locking dependency
>>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
>>> lock) may show up:
>>>
>>> ======================================================
>>> WARNING: possible circular locking dependency detected
>>> 5.0.0-rc1+ #60 Tainted: G        W
>>> ------------------------------------------------------
>>> fsfreeze/4346 is trying to acquire lock:
>>> 0000000026f1d784 (fs_reclaim){+.+.}, at:
>>> fs_reclaim_acquire.part.19+0x5/0x30
>>>
>>> but task is already holding lock:
>>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>>
>>> which lock already depends on the new lock.
>>>  :
>>> Possible unsafe locking scenario:
>>>
>>>       CPU0                    CPU1
>>>       ----                    ----
>>>  lock(sb_internal);
>>>                               lock(fs_reclaim);
>>>                               lock(sb_internal);
>>>  lock(fs_reclaim);
>>>
>>> *** DEADLOCK ***
>>>
>>> 4 locks held by fsfreeze/4346:
>>> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>>> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>>> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>>> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>>
>>> stack backtrace:
>>> Call Trace:
>>> dump_stack+0xe0/0x19a
>>> print_circular_bug.isra.10.cold.34+0x2f4/0x435
>>> check_prev_add.constprop.19+0xca1/0x15f0
>>> validate_chain.isra.14+0x11af/0x3b50
>>> __lock_acquire+0x728/0x1200
>>> lock_acquire+0x269/0x5a0
>>> fs_reclaim_acquire.part.19+0x29/0x30
>>> fs_reclaim_acquire+0x19/0x20
>>> kmem_cache_alloc+0x3e/0x3f0
>>> kmem_zone_alloc+0x79/0x150
>>> xfs_trans_alloc+0xfa/0x9d0
>>> xfs_sync_sb+0x86/0x170
>>> xfs_log_sbcount+0x10f/0x140
>>> xfs_quiesce_attr+0x134/0x270
>>> xfs_fs_freeze+0x4a/0x70
>>> freeze_super+0x1af/0x290
>>> do_vfs_ioctl+0xedc/0x16c0
>>> ksys_ioctl+0x41/0x80
>>> __x64_sys_ioctl+0x73/0xa9
>>> do_syscall_64+0x18f/0xd23
>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> According to Dave Chinner:
>>>
>>>  Freezing the filesystem, after all the data has been cleaned. IOWs
>>>  memory reclaim will never run the above writeback path when
>>>  the freeze process is trying to allocate a transaction here because
>>>  there are no dirty data pages in the filesystem at this point.
>>>
>>>  Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
>>>  it /doesn't deadlock/ by taking freeze references for the
>>>  transaction. We've just drained all the transactions
>>>  in progress and written back all the dirty metadata, too, and so the
>>>  filesystem is completely clean and only needs the superblock to be
>>>  updated to complete the freeze process. And to do that, it does not
>>>  take a freeze reference because calling sb_start_intwrite() here
>>>  would deadlock.
>>>
>>>  IOWs, this is a false positive, caused by the fact that
>>>  xfs_trans_alloc() is called from both above and below memory reclaim
>>>  as well as within /every level/ of freeze processing. Lockdep is
>>>  unable to describe the staged flush logic in the freeze process that
>>>  prevents deadlocks from occurring, and hence we will pretty much
>>>  always see false positives in the freeze path....
>>>
>>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
>>> may fix the issue. However, that will greatly complicate the logic and
>>> may not be worth it.
>>>
>>> Another way to fix it is to disable the taking of the fs_reclaim
>>> pseudo lock when in the freezing code path as a reclaim on the freezed
>>> filesystem is not possible as stated above. This patch takes this
>>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
>>> allocation calls when the filesystem has been freezed.
>>>
>>> Without this patch, the command sequence below will show that the lock
>>> dependency chain sb_internal -> fs_reclaim exists.
>>>
>>> # fsfreeze -f /home
>>> # fsfreeze --unfreeze /home
>>> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
>>>
>>> After applying the patch, such sb_internal -> fs_reclaim lock dependency
>>> chain can no longer be found. Because of that, the locking dependency
>>> warning will not be shown.
>> There was an attempt to fix this in the past, but Dave rejected right
>> away for any workaround in xfs and insisted to make lockdep smarter
>> instead. No sure your approach will make any difference this time.
>> Good luck.
> /me wonders if you can fix this by having the freeze path call
> memalloc_nofs_save() since we probably don't want to be recursing into
> the fs for reclaim while freezing it?  Probably not, because that's a
> bigger hammer than we really need here.  We can certainly steal memory
> from other filesystems that aren't frozen.
>
> It doesn't solve the key issue that lockdep isn't smart enough to know
> that we can't recurse into the fs that's being frozen and therefore
> there's no chance of deadlock.

Lockdep only looks at all the possible locking chains to see if a
circular deadlock is possible. It doesn't have the smart to understand
filesystem internals. The problem here is caused by the fact that
fs_reclaim is a global pseudo lock that is acquired whenever there is a
chance that FS reclaim can happen. As I said in the commit log, it may
be possible to fix that by breaking up fs_reclaim into a set of
per-filesystem pseudo locks, but that will add quite a bit of complexity
to the code. That is why I don't want to go this route. This patch is
the least invasive that I can think of to address the problem without
inhibiting other valid lockdep checking.

Cheers,
Longman
Dave Chinner Jan. 4, 2020, 2:36 a.m. UTC | #4
On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
> Depending on the workloads, the following circular locking dependency
> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> lock) may show up:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.0.0-rc1+ #60 Tainted: G        W
> ------------------------------------------------------
> fsfreeze/4346 is trying to acquire lock:
> 0000000026f1d784 (fs_reclaim){+.+.}, at:
> fs_reclaim_acquire.part.19+0x5/0x30
> 
> but task is already holding lock:
> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> which lock already depends on the new lock.
>   :
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_internal);
>                                lock(fs_reclaim);
>                                lock(sb_internal);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
> 4 locks held by fsfreeze/4346:
>  #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>  #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>  #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>  #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> stack backtrace:
> Call Trace:
>  dump_stack+0xe0/0x19a
>  print_circular_bug.isra.10.cold.34+0x2f4/0x435
>  check_prev_add.constprop.19+0xca1/0x15f0
>  validate_chain.isra.14+0x11af/0x3b50
>  __lock_acquire+0x728/0x1200
>  lock_acquire+0x269/0x5a0
>  fs_reclaim_acquire.part.19+0x29/0x30
>  fs_reclaim_acquire+0x19/0x20
>  kmem_cache_alloc+0x3e/0x3f0
>  kmem_zone_alloc+0x79/0x150
>  xfs_trans_alloc+0xfa/0x9d0
>  xfs_sync_sb+0x86/0x170
>  xfs_log_sbcount+0x10f/0x140
>  xfs_quiesce_attr+0x134/0x270
>  xfs_fs_freeze+0x4a/0x70
>  freeze_super+0x1af/0x290
>  do_vfs_ioctl+0xedc/0x16c0
>  ksys_ioctl+0x41/0x80
>  __x64_sys_ioctl+0x73/0xa9
>  do_syscall_64+0x18f/0xd23
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> According to Dave Chinner:
> 
>   Freezing the filesystem, after all the data has been cleaned. IOWs
>   memory reclaim will never run the above writeback path when
>   the freeze process is trying to allocate a transaction here because
>   there are no dirty data pages in the filesystem at this point.
> 
>   Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
>   it /doesn't deadlock/ by taking freeze references for the
>   transaction. We've just drained all the transactions
>   in progress and written back all the dirty metadata, too, and so the
>   filesystem is completely clean and only needs the superblock to be
>   updated to complete the freeze process. And to do that, it does not
>   take a freeze reference because calling sb_start_intwrite() here
>   would deadlock.
> 
>   IOWs, this is a false positive, caused by the fact that
>   xfs_trans_alloc() is called from both above and below memory reclaim
>   as well as within /every level/ of freeze processing. Lockdep is
>   unable to describe the staged flush logic in the freeze process that
>   prevents deadlocks from occurring, and hence we will pretty much
>   always see false positives in the freeze path....
> 
> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> may fix the issue. However, that will greatly complicate the logic and
> may not be worth it.

ANd it won't work, because now we'll just get lockedp warnings on
the per-fs reclaim lockdep context.

> Another way to fix it is to disable the taking of the fs_reclaim
> pseudo lock when in the freezing code path as a reclaim on the freezed
> filesystem is not possible as stated above. This patch takes this
> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
> allocation calls when the filesystem has been freezed.

IOWs, "fix" it by stating that "lockdep can't track freeze
dependencies correctly"?

In the past we have just used KM_NOFS for that, because
__GFP_NOLOCKDEP didn't exist. But that has just been a nasty hack
because lockdep isn't capable of understanding allocation context
constraints because allocation contexts are much more complex than a
"lock"....


> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t;
>  #define KM_MAYFAIL	((__force xfs_km_flags_t)0x0008u)
>  #define KM_ZERO		((__force xfs_km_flags_t)0x0010u)
>  
> +#ifdef CONFIG_LOCKDEP
> +#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0x0020u)
> +#else
> +#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0)
> +#endif

Nope. We are getting rid of kmem_alloc wrappers and all the
associated flags, not adding new ones. Part of that process is
identifying all the places we currently use KM_NOFS to "shut up
lockdep" and converting them to explicit __GFP_NOLOCKDEP flags.

So right now, this change needs to be queued up behind the API
changes that are currently in progress...

> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index f6006d94a581..b1997649ecd8 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -454,7 +454,8 @@ xfs_log_reserve(
>  	XFS_STATS_INC(mp, xs_try_logspace);
>  
>  	ASSERT(*ticp == NULL);
> -	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0);
> +	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
> +			mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);

This is pretty nasty. Having to spew conditional code like this
across every allocation that could be done in freeze conditions is
a non-starter.

We already have a flag to tell us we are doing a transaction in a
freeze state, so use that to turn off lockdep. That is:

>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3b208f9a865c..c0e42e4f5b77 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -262,8 +262,14 @@ xfs_trans_alloc(
>  	 * Allocate the handle before we do our freeze accounting and setting up
>  	 * GFP_NOFS allocation context so that we avoid lockdep false positives
>  	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
> +	 *
> +	 * To prevent false positive lockdep warning of circular locking
> +	 * dependency between sb_internal and fs_reclaim, disable the
> +	 * acquisition of the fs_reclaim pseudo-lock when the superblock
> +	 * has been frozen or in the process of being frozen.
>  	 */
> -	tp = kmem_zone_zalloc(xfs_trans_zone, 0);
> +	tp = kmem_zone_zalloc(xfs_trans_zone,
> +		mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_start_intwrite(mp->m_super);

This code here should be setting PF_GFP_NOLOCKDEP state to turn off
lockdep for all allocations in this context, similar to the way we
use memalloc_nofs_save/restore so that all nested allocations
inherit GFP_NOFS state...

-Dave.
Waiman Long Jan. 6, 2020, 4:12 p.m. UTC | #5
On 1/3/20 9:36 PM, Dave Chinner wrote:
> On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
>> Depending on the workloads, the following circular locking dependency
>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
>> lock) may show up:
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 5.0.0-rc1+ #60 Tainted: G        W
>> ------------------------------------------------------
>> fsfreeze/4346 is trying to acquire lock:
>> 0000000026f1d784 (fs_reclaim){+.+.}, at:
>> fs_reclaim_acquire.part.19+0x5/0x30
>>
>> but task is already holding lock:
>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>
>> which lock already depends on the new lock.
>>   :
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(sb_internal);
>>                                lock(fs_reclaim);
>>                                lock(sb_internal);
>>   lock(fs_reclaim);
>>
>>  *** DEADLOCK ***
>>
>> 4 locks held by fsfreeze/4346:
>>  #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>>  #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>>  #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>>  #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>
>> stack backtrace:
>> Call Trace:
>>  dump_stack+0xe0/0x19a
>>  print_circular_bug.isra.10.cold.34+0x2f4/0x435
>>  check_prev_add.constprop.19+0xca1/0x15f0
>>  validate_chain.isra.14+0x11af/0x3b50
>>  __lock_acquire+0x728/0x1200
>>  lock_acquire+0x269/0x5a0
>>  fs_reclaim_acquire.part.19+0x29/0x30
>>  fs_reclaim_acquire+0x19/0x20
>>  kmem_cache_alloc+0x3e/0x3f0
>>  kmem_zone_alloc+0x79/0x150
>>  xfs_trans_alloc+0xfa/0x9d0
>>  xfs_sync_sb+0x86/0x170
>>  xfs_log_sbcount+0x10f/0x140
>>  xfs_quiesce_attr+0x134/0x270
>>  xfs_fs_freeze+0x4a/0x70
>>  freeze_super+0x1af/0x290
>>  do_vfs_ioctl+0xedc/0x16c0
>>  ksys_ioctl+0x41/0x80
>>  __x64_sys_ioctl+0x73/0xa9
>>  do_syscall_64+0x18f/0xd23
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> According to Dave Chinner:
>>
>>   Freezing the filesystem, after all the data has been cleaned. IOWs
>>   memory reclaim will never run the above writeback path when
>>   the freeze process is trying to allocate a transaction here because
>>   there are no dirty data pages in the filesystem at this point.
>>
>>   Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that
>>   it /doesn't deadlock/ by taking freeze references for the
>>   transaction. We've just drained all the transactions
>>   in progress and written back all the dirty metadata, too, and so the
>>   filesystem is completely clean and only needs the superblock to be
>>   updated to complete the freeze process. And to do that, it does not
>>   take a freeze reference because calling sb_start_intwrite() here
>>   would deadlock.
>>
>>   IOWs, this is a false positive, caused by the fact that
>>   xfs_trans_alloc() is called from both above and below memory reclaim
>>   as well as within /every level/ of freeze processing. Lockdep is
>>   unable to describe the staged flush logic in the freeze process that
>>   prevents deadlocks from occurring, and hence we will pretty much
>>   always see false positives in the freeze path....
>>
>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
>> may fix the issue. However, that will greatly complicate the logic and
>> may not be worth it.
> ANd it won't work, because now we'll just get lockedp warnings on
> the per-fs reclaim lockdep context.

It may or may not work depending on how we are going to break it up. I
haven't thought through that alternative yet as I am expecting that it
will be a bigger change if we decide to go this route.


>
>> Another way to fix it is to disable the taking of the fs_reclaim
>> pseudo lock when in the freezing code path as a reclaim on the freezed
>> filesystem is not possible as stated above. This patch takes this
>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
>> allocation calls when the filesystem has been freezed.
> IOWs, "fix" it by stating that "lockdep can't track freeze
> dependencies correctly"?
The lockdep code has a singular focus on spotting possible deadlock
scenarios from a locking point of view. The freeze dependency has to be
properly translated into appropriate locking sequences to make lockdep
work correctly. I would say that the use of a global fs_reclaim pseudo
lock is not a perfect translation and so it has exception cases we need
to handle.
>
> In the past we have just used KM_NOFS for that, because
> __GFP_NOLOCKDEP didn't exist. But that has just been a nasty hack
> because lockdep isn't capable of understanding allocation context
> constraints because allocation contexts are much more complex than a
> "lock"....
>
>
>> --- a/fs/xfs/kmem.h
>> +++ b/fs/xfs/kmem.h
>> @@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t;
>>  #define KM_MAYFAIL	((__force xfs_km_flags_t)0x0008u)
>>  #define KM_ZERO		((__force xfs_km_flags_t)0x0010u)
>>  
>> +#ifdef CONFIG_LOCKDEP
>> +#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0x0020u)
>> +#else
>> +#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0)
>> +#endif
> Nope. We are getting rid of kmem_alloc wrappers and all the
> associated flags, not adding new ones. Part of that process is
> identifying all the places we currently use KM_NOFS to "shut up
> lockdep" and converting them to explicit __GFP_NOLOCKDEP flags.
>
> So right now, this change needs to be queued up behind the API
> changes that are currently in progress...

That is fine. I can wait and post a revised patch after that. Who is
going to make these changes?


>> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
>> index f6006d94a581..b1997649ecd8 100644
>> --- a/fs/xfs/xfs_log.c
>> +++ b/fs/xfs/xfs_log.c
>> @@ -454,7 +454,8 @@ xfs_log_reserve(
>>  	XFS_STATS_INC(mp, xs_try_logspace);
>>  
>>  	ASSERT(*ticp == NULL);
>> -	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0);
>> +	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
>> +			mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
> This is pretty nasty. Having to spew conditional code like this
> across every allocation that could be done in freeze conditions is
> a non-starter.
>
> We already have a flag to tell us we are doing a transaction in a
> freeze state, so use that to turn off lockdep. That is:

OK.


>>  	*ticp = tic;
>>  
>>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
>> index 3b208f9a865c..c0e42e4f5b77 100644
>> --- a/fs/xfs/xfs_trans.c
>> +++ b/fs/xfs/xfs_trans.c
>> @@ -262,8 +262,14 @@ xfs_trans_alloc(
>>  	 * Allocate the handle before we do our freeze accounting and setting up
>>  	 * GFP_NOFS allocation context so that we avoid lockdep false positives
>>  	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
>> +	 *
>> +	 * To prevent false positive lockdep warning of circular locking
>> +	 * dependency between sb_internal and fs_reclaim, disable the
>> +	 * acquisition of the fs_reclaim pseudo-lock when the superblock
>> +	 * has been frozen or in the process of being frozen.
>>  	 */
>> -	tp = kmem_zone_zalloc(xfs_trans_zone, 0);
>> +	tp = kmem_zone_zalloc(xfs_trans_zone,
>> +		mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
>>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
>>  		sb_start_intwrite(mp->m_super);
> This code here should be setting PF_GFP_NOLOCKDEP state to turn off
> lockdep for all allocations in this context, similar to the way we
> use memalloc_nofs_save/restore so that all nested allocations
> inherit GFP_NOFS state...

Sure.

Thanks for the comments.

Cheers,
Longman
Dave Chinner Jan. 6, 2020, 9:01 p.m. UTC | #6
On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote:
> On 1/3/20 9:36 PM, Dave Chinner wrote:
> > On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
> >> Depending on the workloads, the following circular locking dependency
> >> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> >> lock) may show up:
....
> >>   IOWs, this is a false positive, caused by the fact that
> >>   xfs_trans_alloc() is called from both above and below memory reclaim
> >>   as well as within /every level/ of freeze processing. Lockdep is
> >>   unable to describe the staged flush logic in the freeze process that
> >>   prevents deadlocks from occurring, and hence we will pretty much
> >>   always see false positives in the freeze path....
> >>
> >> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> >> may fix the issue. However, that will greatly complicate the logic and
> >> may not be worth it.
> > ANd it won't work, because now we'll just get lockedp warnings on
> > the per-fs reclaim lockdep context.
> 
> It may or may not work depending on how we are going to break it up. I
> haven't thought through that alternative yet as I am expecting that it
> will be a bigger change if we decide to go this route.

A per-filesystem lock will not work because a single XFS filesystem
can trigger this "locking" inversion -both contexts that lockdep
warns about occur in the normal operation of that single filesystem.

The only way to avoid this is to have multiple context specific
reclaim lockdep contexts per filesystem, and that becomes a mess
really quickly. The biggest problem with this is that the "lock
context" is no longer validated consistently across the entire
filesystem - we con only internally validate the order or each lock
context against itself, and not against operations in the other lock
contexts. Hence there's no global validation of lock orders -
lockdep allows different lock orders in different contexts and so
that defeats the purpose of using lockdep for this validation.

Indeed, we've been down this path before with lockdep for XFS inode
locking vs inode reclaim(*), and we removed it years ago because
multiple lock contexts for different operations and/or life-cycle
stages just hasn't been reliable or maintainable. We still get false
positives because we haven't solved the "lockdep can't express the
dependencies fully" problem, yet we've reduced the lock order
validation scope of lockdep considerably....

> >> Another way to fix it is to disable the taking of the fs_reclaim
> >> pseudo lock when in the freezing code path as a reclaim on the freezed
> >> filesystem is not possible as stated above. This patch takes this
> >> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
> >> allocation calls when the filesystem has been freezed.
> > IOWs, "fix" it by stating that "lockdep can't track freeze
> > dependencies correctly"?
> The lockdep code has a singular focus on spotting possible deadlock
> scenarios from a locking point of view.

IMO, lockdep only works for very simplistic locking strategies.
Anything complex requires more work to get lockdep annotations
"correct enough" to prevent false positives than it does to actually
read the code and very the locking is correct.

Have you noticed we do runs of nested trylock-and-back-out-on-
failure because we lock objects in an order that might deadlock
because of cross-object state dependencies that can't be covered by
lockdep?  e.g. xfs_lock_inodes() which nests up to 5 inodes deep,
can nest 3 separate locks per inode and has to take into account
journal flushing depenedencies when locking multiple inodes?

Lockdep is very simplisitic and the complex annotations we need to
handle situations like the above are very difficult to design,
use and maintainr. It's so much simpler just to use big hammers like
GFP_NOFS to shut up all the different types of false positives
lockdep throws up for reclaim context false positives because after
all these years there still isn't a viable mechanism to easily
express this sort of complex dependency chain.

> The freeze dependency has to be
> properly translated into appropriate locking sequences to make lockdep
> work correctly.i

This is part of the problem - freeze context is not actually a lock
but it's close enough that freezing on simple filesystems can be
validated with lockdep annotations. i.e. same basic concept as the
lockdep reclaim context. However, it just doesn't work reliably for
more complex subsystems where there are much more subtle and complex
behavioural interactions and dependencies that a single lock context
cannot express or be annotated to express. That's where lockdep
falls down....

> I would say that the use of a global fs_reclaim pseudo
> lock is not a perfect translation and so it has exception cases we need
> to handle.

Exactly the problem.

> > Nope. We are getting rid of kmem_alloc wrappers and all the
> > associated flags, not adding new ones. Part of that process is
> > identifying all the places we currently use KM_NOFS to "shut up
> > lockdep" and converting them to explicit __GFP_NOLOCKDEP flags.
> >
> > So right now, this change needs to be queued up behind the API
> > changes that are currently in progress...
> 
> That is fine. I can wait and post a revised patch after that. Who is
> going to make these changes?

https://lore.kernel.org/linux-xfs/20191120104425.407213-1-cmaiolino@redhat.com/

Cheers,

Dave.
Waiman Long Jan. 7, 2020, 3:40 p.m. UTC | #7
On 1/6/20 4:01 PM, Dave Chinner wrote:
> On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote:
>> On 1/3/20 9:36 PM, Dave Chinner wrote:
>>> On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
>>>> Depending on the workloads, the following circular locking dependency
>>>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
>>>> lock) may show up:
> ....
>>>>   IOWs, this is a false positive, caused by the fact that
>>>>   xfs_trans_alloc() is called from both above and below memory reclaim
>>>>   as well as within /every level/ of freeze processing. Lockdep is
>>>>   unable to describe the staged flush logic in the freeze process that
>>>>   prevents deadlocks from occurring, and hence we will pretty much
>>>>   always see false positives in the freeze path....
>>>>
>>>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
>>>> may fix the issue. However, that will greatly complicate the logic and
>>>> may not be worth it.
>>> ANd it won't work, because now we'll just get lockedp warnings on
>>> the per-fs reclaim lockdep context.
>> It may or may not work depending on how we are going to break it up. I
>> haven't thought through that alternative yet as I am expecting that it
>> will be a bigger change if we decide to go this route.
> A per-filesystem lock will not work because a single XFS filesystem
> can trigger this "locking" inversion -both contexts that lockdep
> warns about occur in the normal operation of that single filesystem.
>
> The only way to avoid this is to have multiple context specific
> reclaim lockdep contexts per filesystem, and that becomes a mess
> really quickly. The biggest problem with this is that the "lock
> context" is no longer validated consistently across the entire
> filesystem - we con only internally validate the order or each lock
> context against itself, and not against operations in the other lock
> contexts. Hence there's no global validation of lock orders -
> lockdep allows different lock orders in different contexts and so
> that defeats the purpose of using lockdep for this validation.
>
> Indeed, we've been down this path before with lockdep for XFS inode
> locking vs inode reclaim(*), and we removed it years ago because
> multiple lock contexts for different operations and/or life-cycle
> stages just hasn't been reliable or maintainable. We still get false
> positives because we haven't solved the "lockdep can't express the
> dependencies fully" problem, yet we've reduced the lock order
> validation scope of lockdep considerably....
>
That is true. You can make lockdep to check these kind of complicated
dependency accurately.


>>>> Another way to fix it is to disable the taking of the fs_reclaim
>>>> pseudo lock when in the freezing code path as a reclaim on the freezed
>>>> filesystem is not possible as stated above. This patch takes this
>>>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
>>>> allocation calls when the filesystem has been freezed.
>>> IOWs, "fix" it by stating that "lockdep can't track freeze
>>> dependencies correctly"?
>> The lockdep code has a singular focus on spotting possible deadlock
>> scenarios from a locking point of view.
> IMO, lockdep only works for very simplistic locking strategies.
> Anything complex requires more work to get lockdep annotations
> "correct enough" to prevent false positives than it does to actually
> read the code and very the locking is correct.
>
> Have you noticed we do runs of nested trylock-and-back-out-on-
> failure because we lock objects in an order that might deadlock
> because of cross-object state dependencies that can't be covered by
> lockdep?  e.g. xfs_lock_inodes() which nests up to 5 inodes deep,
> can nest 3 separate locks per inode and has to take into account
> journal flushing depenedencies when locking multiple inodes?
>
> Lockdep is very simplisitic and the complex annotations we need to
> handle situations like the above are very difficult to design,
> use and maintainr. It's so much simpler just to use big hammers like
> GFP_NOFS to shut up all the different types of false positives
> lockdep throws up for reclaim context false positives because after
> all these years there still isn't a viable mechanism to easily
> express this sort of complex dependency chain.
Regarding the trylock-and-back-out-on_failure code, do you think adding
new APIs with timeout for mutex and rwsem and may be spin count for
spinlock will help to at least reduce the number of failures that can
happen in the code. RT mutex does have a rt_mutex_timed_lock(), but
there is no equivalent for mutex and rwsem.
>> The freeze dependency has to be
>> properly translated into appropriate locking sequences to make lockdep
>> work correctly.i
> This is part of the problem - freeze context is not actually a lock
> but it's close enough that freezing on simple filesystems can be
> validated with lockdep annotations. i.e. same basic concept as the
> lockdep reclaim context. However, it just doesn't work reliably for
> more complex subsystems where there are much more subtle and complex
> behavioural interactions and dependencies that a single lock context
> cannot express or be annotated to express. That's where lockdep
> falls down....
>
>> I would say that the use of a global fs_reclaim pseudo
>> lock is not a perfect translation and so it has exception cases we need
>> to handle.
> Exactly the problem.
>
>>> Nope. We are getting rid of kmem_alloc wrappers and all the
>>> associated flags, not adding new ones. Part of that process is
>>> identifying all the places we currently use KM_NOFS to "shut up
>>> lockdep" and converting them to explicit __GFP_NOLOCKDEP flags.
>>>
>>> So right now, this change needs to be queued up behind the API
>>> changes that are currently in progress...
>> That is fine. I can wait and post a revised patch after that. Who is
>> going to make these changes?
> https://lore.kernel.org/linux-xfs/20191120104425.407213-1-cmaiolino@redhat.com/

Thanks for the pointer. I will rebase my patch on top of that.

Cheers,
Longman
Dave Chinner Jan. 7, 2020, 9:31 p.m. UTC | #8
On Tue, Jan 07, 2020 at 10:40:12AM -0500, Waiman Long wrote:
> On 1/6/20 4:01 PM, Dave Chinner wrote:
> > On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote:
> >> On 1/3/20 9:36 PM, Dave Chinner wrote:
> >>> On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
> >>> IOWs, "fix" it by stating that "lockdep can't track freeze
> >>> dependencies correctly"?
> >> The lockdep code has a singular focus on spotting possible deadlock
> >> scenarios from a locking point of view.
> > IMO, lockdep only works for very simplistic locking strategies.
> > Anything complex requires more work to get lockdep annotations
> > "correct enough" to prevent false positives than it does to actually
> > read the code and very the locking is correct.
> >
> > Have you noticed we do runs of nested trylock-and-back-out-on-
> > failure because we lock objects in an order that might deadlock
> > because of cross-object state dependencies that can't be covered by
> > lockdep?  e.g. xfs_lock_inodes() which nests up to 5 inodes deep,
> > can nest 3 separate locks per inode and has to take into account
> > journal flushing depenedencies when locking multiple inodes?
> >
> > Lockdep is very simplisitic and the complex annotations we need to
> > handle situations like the above are very difficult to design,
> > use and maintainr. It's so much simpler just to use big hammers like
> > GFP_NOFS to shut up all the different types of false positives
> > lockdep throws up for reclaim context false positives because after
> > all these years there still isn't a viable mechanism to easily
> > express this sort of complex dependency chain.
>
> Regarding the trylock-and-back-out-on_failure code, do you think adding
> new APIs with timeout for mutex and rwsem and may be spin count for

Timeouts have no place in generic locking APIs. Indeed, in these
cases, timeouts do nothing to avoid the issues that require us to
use trylock-and-back-out algorithms, so timeouts do nothing but
hold off racing inode locks for unnecessarily long time periods
while we wait for something to happen somewhere else that we have no
direct control over....

> spinlock will help to at least reduce the number of failures that can
> happen in the code. RT mutex does have a rt_mutex_timed_lock(), but
> there is no equivalent for mutex and rwsem.

Realtime has all sorts of problems with blocking where normal
kernels don't (e.g. by turning spinlocks into mutexes) and so it
forces rt code to jump through all sorts of crazy hoops to prevent
priority inversions and deadlocks. If lock timeouts are necessary
to avoid deadlocks and/or priority inversions in your code, then
that indicates that there are locking algorithm problems that need
fixing.

Cheers,

Dave.

Patch
diff mbox series

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 6143117770e9..901189dcc474 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -20,6 +20,12 @@  typedef unsigned __bitwise xfs_km_flags_t;
 #define KM_MAYFAIL	((__force xfs_km_flags_t)0x0008u)
 #define KM_ZERO		((__force xfs_km_flags_t)0x0010u)
 
+#ifdef CONFIG_LOCKDEP
+#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0x0020u)
+#else
+#define KM_NOLOCKDEP	((__force xfs_km_flags_t)0)
+#endif
+
 /*
  * We use a special process flag to avoid recursive callbacks into
  * the filesystem during transactions.  We will also issue our own
@@ -30,7 +36,7 @@  kmem_flags_convert(xfs_km_flags_t flags)
 {
 	gfp_t	lflags;
 
-	BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO));
+	BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
 
 	lflags = GFP_KERNEL | __GFP_NOWARN;
 	if (flags & KM_NOFS)
@@ -49,6 +55,9 @@  kmem_flags_convert(xfs_km_flags_t flags)
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
+	if (flags & KM_NOLOCKDEP)
+		lflags |= __GFP_NOLOCKDEP;
+
 	return lflags;
 }
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index f6006d94a581..b1997649ecd8 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -454,7 +454,8 @@  xfs_log_reserve(
 	XFS_STATS_INC(mp, xs_try_logspace);
 
 	ASSERT(*ticp == NULL);
-	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0);
+	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent,
+			mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3b208f9a865c..c0e42e4f5b77 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -262,8 +262,14 @@  xfs_trans_alloc(
 	 * Allocate the handle before we do our freeze accounting and setting up
 	 * GFP_NOFS allocation context so that we avoid lockdep false positives
 	 * by doing GFP_KERNEL allocations inside sb_start_intwrite().
+	 *
+	 * To prevent false positive lockdep warning of circular locking
+	 * dependency between sb_internal and fs_reclaim, disable the
+	 * acquisition of the fs_reclaim pseudo-lock when the superblock
+	 * has been frozen or in the process of being frozen.
 	 */
-	tp = kmem_zone_zalloc(xfs_trans_zone, 0);
+	tp = kmem_zone_zalloc(xfs_trans_zone,
+		mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0);
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_start_intwrite(mp->m_super);