* [PATCH 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs @ 2020-06-15 16:08 Waiman Long 2020-06-15 16:08 ` [PATCH 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long 2020-06-15 16:08 ` [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long 0 siblings, 2 replies; 7+ messages in thread From: Waiman Long @ 2020-06-15 16:08 UTC (permalink / raw) To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton, Waiman Long There is a false positive lockdep warning in how the xfs code handle filesystem freeze: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); *** DEADLOCK *** This patch series works around this problem by adding a PF_MEMALLOC_NOLOCKDEP flag and set during filesystem freeze to avoid the lockdep splat. Waiman Long (2): sched: Add PF_MEMALLOC_NOLOCKDEP flag xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim fs/xfs/xfs_log.c | 9 +++++++++ fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++---- include/linux/sched.h | 7 +++++++ include/linux/sched/mm.h | 15 ++++++++++----- 4 files changed, 52 insertions(+), 9 deletions(-) -- 2.18.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag 2020-06-15 16:08 [PATCH 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long @ 2020-06-15 16:08 ` Waiman Long 2020-06-15 16:08 ` [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long 1 sibling, 0 replies; 7+ messages in thread From: Waiman Long @ 2020-06-15 16:08 UTC (permalink / raw) To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton, Waiman Long There are cases where calling kmalloc() can lead to false positive lockdep splat. One notable example that can happen in the freezing of the xfs filesystem is as follows: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); *** DEADLOCK *** This is a false positive as all the dirty pages are flushed out before the filesystem can be frozen. However, there is no easy way to modify lockdep to handle this situation properly. One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP in the appropriate kmalloc() calls. However, it will be cumbersome to locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it is easy to miss some especially when the code is updated in the future. Another alternative is to have a per-process global state that indicates the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP per-process flag is now added. After adding this new bit, there are still 2 free bits left. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/sched.h | 7 +++++++ include/linux/sched/mm.h | 15 ++++++++++----- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b62e6aaf28f0..44247cbc9073 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1508,6 +1508,7 @@ extern struct pid *cad_pid; #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ #define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only against the bdi I write to, * I am cleaning dirty pages from some other bdi. */ +#define __PF_MEMALLOC_NOLOCKDEP 0x00100000 /* All allocation requests will inherit __GFP_NOLOCKDEP */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ @@ -1519,6 +1520,12 @@ extern struct pid *cad_pid; #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ +#ifdef CONFIG_LOCKDEP +#define PF_MEMALLOC_NOLOCKDEP __PF_MEMALLOC_NOLOCKDEP +#else +#define PF_MEMALLOC_NOLOCKDEP 0 +#endif + /* * Only the _current_ task can read/write to tsk->flags, but other * tasks can access tsk->flags in readonly mode for example diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 480a4d1b7dd8..4a076a148568 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk) * Applies per-task gfp context to the given allocation flags. * PF_MEMALLOC_NOIO implies GFP_NOIO * PF_MEMALLOC_NOFS implies GFP_NOFS + * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP * PF_MEMALLOC_NOCMA implies no allocation from CMA region. */ static inline gfp_t current_gfp_context(gfp_t flags) { - if (unlikely(current->flags & - (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { + unsigned int pflags = current->flags; + + if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | + PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) { /* * NOIO implies both NOIO and NOFS and it is a weaker context * so always make sure it makes precedence */ - if (current->flags & PF_MEMALLOC_NOIO) + if (pflags & PF_MEMALLOC_NOIO) flags &= ~(__GFP_IO | __GFP_FS); - else if (current->flags & PF_MEMALLOC_NOFS) + else if (pflags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; + if (pflags & PF_MEMALLOC_NOLOCKDEP) + flags |= __GFP_NOLOCKDEP; #ifdef CONFIG_CMA - if (current->flags & PF_MEMALLOC_NOCMA) + if (pflags & PF_MEMALLOC_NOCMA) flags &= ~__GFP_MOVABLE; #endif } -- 2.18.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim 2020-06-15 16:08 [PATCH 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long 2020-06-15 16:08 ` [PATCH 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long @ 2020-06-15 16:08 ` Waiman Long 2020-06-15 16:43 ` Darrick J. Wong 1 sibling, 1 reply; 7+ messages in thread From: Waiman Long @ 2020-06-15 16:08 UTC (permalink / raw) To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton, Waiman Long Depending on the workloads, the following circular locking dependency warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo lock) may show up: ====================================================== WARNING: possible circular locking dependency detected 5.0.0-rc1+ #60 Tainted: G W ------------------------------------------------------ fsfreeze/4346 is trying to acquire lock: 0000000026f1d784 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part.19+0x5/0x30 but task is already holding lock: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 which lock already depends on the new lock. : Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); *** DEADLOCK *** 4 locks held by fsfreeze/4346: #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 stack backtrace: Call Trace: dump_stack+0xe0/0x19a print_circular_bug.isra.10.cold.34+0x2f4/0x435 check_prev_add.constprop.19+0xca1/0x15f0 validate_chain.isra.14+0x11af/0x3b50 __lock_acquire+0x728/0x1200 lock_acquire+0x269/0x5a0 fs_reclaim_acquire.part.19+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc+0x3e/0x3f0 kmem_zone_alloc+0x79/0x150 xfs_trans_alloc+0xfa/0x9d0 xfs_sync_sb+0x86/0x170 xfs_log_sbcount+0x10f/0x140 xfs_quiesce_attr+0x134/0x270 xfs_fs_freeze+0x4a/0x70 freeze_super+0x1af/0x290 do_vfs_ioctl+0xedc/0x16c0 ksys_ioctl+0x41/0x80 __x64_sys_ioctl+0x73/0xa9 do_syscall_64+0x18f/0xd23 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is a false positive as all the dirty pages are flushed out before the filesystem can be frozen. Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock may fix the issue. However, that will greatly complicate the logic and may not be worth it. Another way to fix it is to disable the taking of the fs_reclaim pseudo lock when in the freezing code path as a reclaim on the freezed filesystem is not possible. By using the newly introduced PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set. In the freezing path, there is another path where memory allocation is being done without the XFS_TRANS_NO_WRITECOUNT flag: xfs_fs_freeze() => xfs_quiesce_attr() => xfs_log_quiesce() => xfs_log_unmount_write() => xlog_unmount_write() => xfs_log_reserve() => xlog_ticket_alloc() In this case, we just disable fs reclaim for this particular 600 bytes memory allocation. Without this patch, the command sequence below will show that the lock dependency chain sb_internal -> fs_reclaim exists. # fsfreeze -f /home # fsfreeze --unfreeze /home # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal After applying the patch, such sb_internal -> fs_reclaim lock dependency chain can no longer be found. Because of that, the locking dependency warning will not be shown. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Waiman Long <longman@redhat.com> --- fs/xfs/xfs_log.c | 9 +++++++++ fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++---- 2 files changed, 35 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 00fda2e8e738..33244680d0d4 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -830,8 +830,17 @@ xlog_unmount_write( xfs_lsn_t lsn; uint flags = XLOG_UNMOUNT_TRANS; int error; + unsigned long pflags; + /* + * xfs_log_reserve() allocates memory. This can lead to fs reclaim + * which may conflicts with the unmount process. To avoid that, + * disable fs reclaim for this allocation. + */ + current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS); error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS); + if (error) goto out_err; diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 3c94e5ff4316..ddb10ad3f51f 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -255,7 +255,27 @@ xfs_trans_alloc( struct xfs_trans **tpp) { struct xfs_trans *tp; - int error; + int error = 0; + unsigned long pflags = -1; + + /* + * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty + * data pages in the filesystem at this point. So even if fs reclaim + * is being done, it won't happen to this filesystem. In this case, + * PF_MEMALLOC_NOLOCKDEP should be set to avoid false positive + * lockdep splat like: + * + * CPU0 CPU1 + * ---- ---- + * lock(sb_internal); + * lock(fs_reclaim); + * lock(sb_internal); + * lock(fs_reclaim); + * + * *** DEADLOCK *** + */ + if (PF_MEMALLOC_NOLOCKDEP && (flags & XFS_TRANS_NO_WRITECOUNT)) + current_set_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP); /* * Allocate the handle before we do our freeze accounting and setting up @@ -284,13 +304,15 @@ xfs_trans_alloc( error = xfs_trans_reserve(tp, resp, blocks, rtextents); if (error) { xfs_trans_cancel(tp); - return error; + goto out; } trace_xfs_trans_alloc(tp, _RET_IP_); - *tpp = tp; - return 0; +out: + if (PF_MEMALLOC_NOLOCKDEP && (pflags != -1)) + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP); + return error; } /* -- 2.18.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim 2020-06-15 16:08 ` [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long @ 2020-06-15 16:43 ` Darrick J. Wong 2020-06-15 20:53 ` Waiman Long 2020-06-15 23:28 ` Dave Chinner 0 siblings, 2 replies; 7+ messages in thread From: Darrick J. Wong @ 2020-06-15 16:43 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote: > Depending on the workloads, the following circular locking dependency > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > lock) may show up: > > ====================================================== > WARNING: possible circular locking dependency detected > 5.0.0-rc1+ #60 Tainted: G W > ------------------------------------------------------ > fsfreeze/4346 is trying to acquire lock: > 0000000026f1d784 (fs_reclaim){+.+.}, at: > fs_reclaim_acquire.part.19+0x5/0x30 > > but task is already holding lock: > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > which lock already depends on the new lock. > : > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(sb_internal); > lock(fs_reclaim); > lock(sb_internal); > lock(fs_reclaim); > > *** DEADLOCK *** > > 4 locks held by fsfreeze/4346: > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > stack backtrace: > Call Trace: > dump_stack+0xe0/0x19a > print_circular_bug.isra.10.cold.34+0x2f4/0x435 > check_prev_add.constprop.19+0xca1/0x15f0 > validate_chain.isra.14+0x11af/0x3b50 > __lock_acquire+0x728/0x1200 > lock_acquire+0x269/0x5a0 > fs_reclaim_acquire.part.19+0x29/0x30 > fs_reclaim_acquire+0x19/0x20 > kmem_cache_alloc+0x3e/0x3f0 > kmem_zone_alloc+0x79/0x150 > xfs_trans_alloc+0xfa/0x9d0 > xfs_sync_sb+0x86/0x170 > xfs_log_sbcount+0x10f/0x140 > xfs_quiesce_attr+0x134/0x270 > xfs_fs_freeze+0x4a/0x70 > freeze_super+0x1af/0x290 > do_vfs_ioctl+0xedc/0x16c0 > ksys_ioctl+0x41/0x80 > __x64_sys_ioctl+0x73/0xa9 > do_syscall_64+0x18f/0xd23 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > This is a false positive as all the dirty pages are flushed out before > the filesystem can be frozen. > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > may fix the issue. However, that will greatly complicate the logic and > may not be worth it. > > Another way to fix it is to disable the taking of the fs_reclaim > pseudo lock when in the freezing code path as a reclaim on the > freezed filesystem is not possible. By using the newly introduced > PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in > xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set. > > In the freezing path, there is another path where memory allocation > is being done without the XFS_TRANS_NO_WRITECOUNT flag: > > xfs_fs_freeze() > => xfs_quiesce_attr() > => xfs_log_quiesce() > => xfs_log_unmount_write() > => xlog_unmount_write() > => xfs_log_reserve() > => xlog_ticket_alloc() > > In this case, we just disable fs reclaim for this particular 600 bytes > memory allocation. > > Without this patch, the command sequence below will show that the lock > dependency chain sb_internal -> fs_reclaim exists. > > # fsfreeze -f /home > # fsfreeze --unfreeze /home > # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal > > After applying the patch, such sb_internal -> fs_reclaim lock dependency > chain can no longer be found. Because of that, the locking dependency > warning will not be shown. > > Suggested-by: Dave Chinner <david@fromorbit.com> > Signed-off-by: Waiman Long <longman@redhat.com> > --- > fs/xfs/xfs_log.c | 9 +++++++++ > fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++---- > 2 files changed, 35 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 00fda2e8e738..33244680d0d4 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -830,8 +830,17 @@ xlog_unmount_write( > xfs_lsn_t lsn; > uint flags = XLOG_UNMOUNT_TRANS; > int error; > + unsigned long pflags; > > + /* > + * xfs_log_reserve() allocates memory. This can lead to fs reclaim > + * which may conflicts with the unmount process. To avoid that, > + * disable fs reclaim for this allocation. > + */ > + current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS); > error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); > + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS); > + > if (error) > goto out_err; > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3c94e5ff4316..ddb10ad3f51f 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -255,7 +255,27 @@ xfs_trans_alloc( > struct xfs_trans **tpp) > { > struct xfs_trans *tp; > - int error; > + int error = 0; > + unsigned long pflags = -1; > + > + /* > + * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty > + * data pages in the filesystem at this point. That's not true. Look at the other callers of xfs_trans_alloc_empty. Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call chain? --D > + * So even if fs reclaim > + * is being done, it won't happen to this filesystem. In this case, > + * PF_MEMALLOC_NOLOCKDEP should be set to avoid false positive > + * lockdep splat like: > + * > + * CPU0 CPU1 > + * ---- ---- > + * lock(sb_internal); > + * lock(fs_reclaim); > + * lock(sb_internal); > + * lock(fs_reclaim); > + * > + * *** DEADLOCK *** > + */ > + if (PF_MEMALLOC_NOLOCKDEP && (flags & XFS_TRANS_NO_WRITECOUNT)) > + current_set_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP); > > /* > * Allocate the handle before we do our freeze accounting and setting up > @@ -284,13 +304,15 @@ xfs_trans_alloc( > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > if (error) { > xfs_trans_cancel(tp); > - return error; > + goto out; > } > > trace_xfs_trans_alloc(tp, _RET_IP_); > - > *tpp = tp; > - return 0; > +out: > + if (PF_MEMALLOC_NOLOCKDEP && (pflags != -1)) > + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP); > + return error; > } > > /* > -- > 2.18.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim 2020-06-15 16:43 ` Darrick J. Wong @ 2020-06-15 20:53 ` Waiman Long 2020-06-16 16:29 ` Darrick J. Wong 2020-06-15 23:28 ` Dave Chinner 1 sibling, 1 reply; 7+ messages in thread From: Waiman Long @ 2020-06-15 20:53 UTC (permalink / raw) To: Darrick J. Wong Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton On 6/15/20 12:43 PM, Darrick J. Wong wrote: > On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote: >> Depending on the workloads, the following circular locking dependency >> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo >> lock) may show up: >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 5.0.0-rc1+ #60 Tainted: G W >> ------------------------------------------------------ >> fsfreeze/4346 is trying to acquire lock: >> 0000000026f1d784 (fs_reclaim){+.+.}, at: >> fs_reclaim_acquire.part.19+0x5/0x30 >> >> but task is already holding lock: >> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >> >> which lock already depends on the new lock. >> : >> Possible unsafe locking scenario: >> >> CPU0 CPU1 >> ---- ---- >> lock(sb_internal); >> lock(fs_reclaim); >> lock(sb_internal); >> lock(fs_reclaim); >> >> *** DEADLOCK *** >> >> 4 locks held by fsfreeze/4346: >> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 >> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 >> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 >> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >> >> stack backtrace: >> Call Trace: >> dump_stack+0xe0/0x19a >> print_circular_bug.isra.10.cold.34+0x2f4/0x435 >> check_prev_add.constprop.19+0xca1/0x15f0 >> validate_chain.isra.14+0x11af/0x3b50 >> __lock_acquire+0x728/0x1200 >> lock_acquire+0x269/0x5a0 >> fs_reclaim_acquire.part.19+0x29/0x30 >> fs_reclaim_acquire+0x19/0x20 >> kmem_cache_alloc+0x3e/0x3f0 >> kmem_zone_alloc+0x79/0x150 >> xfs_trans_alloc+0xfa/0x9d0 >> xfs_sync_sb+0x86/0x170 >> xfs_log_sbcount+0x10f/0x140 >> xfs_quiesce_attr+0x134/0x270 >> xfs_fs_freeze+0x4a/0x70 >> freeze_super+0x1af/0x290 >> do_vfs_ioctl+0xedc/0x16c0 >> ksys_ioctl+0x41/0x80 >> __x64_sys_ioctl+0x73/0xa9 >> do_syscall_64+0x18f/0xd23 >> entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> This is a false positive as all the dirty pages are flushed out before >> the filesystem can be frozen. >> >> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock >> may fix the issue. However, that will greatly complicate the logic and >> may not be worth it. >> >> Another way to fix it is to disable the taking of the fs_reclaim >> pseudo lock when in the freezing code path as a reclaim on the >> freezed filesystem is not possible. By using the newly introduced >> PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in >> xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set. >> >> In the freezing path, there is another path where memory allocation >> is being done without the XFS_TRANS_NO_WRITECOUNT flag: >> >> xfs_fs_freeze() >> => xfs_quiesce_attr() >> => xfs_log_quiesce() >> => xfs_log_unmount_write() >> => xlog_unmount_write() >> => xfs_log_reserve() >> => xlog_ticket_alloc() >> >> In this case, we just disable fs reclaim for this particular 600 bytes >> memory allocation. >> >> Without this patch, the command sequence below will show that the lock >> dependency chain sb_internal -> fs_reclaim exists. >> >> # fsfreeze -f /home >> # fsfreeze --unfreeze /home >> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal >> >> After applying the patch, such sb_internal -> fs_reclaim lock dependency >> chain can no longer be found. Because of that, the locking dependency >> warning will not be shown. >> >> Suggested-by: Dave Chinner <david@fromorbit.com> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> fs/xfs/xfs_log.c | 9 +++++++++ >> fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++---- >> 2 files changed, 35 insertions(+), 4 deletions(-) >> >> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c >> index 00fda2e8e738..33244680d0d4 100644 >> --- a/fs/xfs/xfs_log.c >> +++ b/fs/xfs/xfs_log.c >> @@ -830,8 +830,17 @@ xlog_unmount_write( >> xfs_lsn_t lsn; >> uint flags = XLOG_UNMOUNT_TRANS; >> int error; >> + unsigned long pflags; >> >> + /* >> + * xfs_log_reserve() allocates memory. This can lead to fs reclaim >> + * which may conflicts with the unmount process. To avoid that, >> + * disable fs reclaim for this allocation. >> + */ >> + current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS); >> error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); >> + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS); >> + >> if (error) >> goto out_err; >> >> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c >> index 3c94e5ff4316..ddb10ad3f51f 100644 >> --- a/fs/xfs/xfs_trans.c >> +++ b/fs/xfs/xfs_trans.c >> @@ -255,7 +255,27 @@ xfs_trans_alloc( >> struct xfs_trans **tpp) >> { >> struct xfs_trans *tp; >> - int error; >> + int error = 0; >> + unsigned long pflags = -1; >> + >> + /* >> + * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty >> + * data pages in the filesystem at this point. > That's not true. Look at the other callers of xfs_trans_alloc_empty. Yes, I am aware of that. I can change it to check the freeze state. > > Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call > chain? I guess we can do that, but it eliminates a potential source for memory reclaim leading to freeze error when not much free memory is left. We can go this route if you think this is not a problem. Cheers, Longman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim 2020-06-15 20:53 ` Waiman Long @ 2020-06-16 16:29 ` Darrick J. Wong 0 siblings, 0 replies; 7+ messages in thread From: Darrick J. Wong @ 2020-06-16 16:29 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen, Andrew Morton On Mon, Jun 15, 2020 at 04:53:38PM -0400, Waiman Long wrote: > On 6/15/20 12:43 PM, Darrick J. Wong wrote: > > On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote: > > > Depending on the workloads, the following circular locking dependency > > > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > > > lock) may show up: > > > > > > ====================================================== > > > WARNING: possible circular locking dependency detected > > > 5.0.0-rc1+ #60 Tainted: G W > > > ------------------------------------------------------ > > > fsfreeze/4346 is trying to acquire lock: > > > 0000000026f1d784 (fs_reclaim){+.+.}, at: > > > fs_reclaim_acquire.part.19+0x5/0x30 > > > > > > but task is already holding lock: > > > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > > > > > which lock already depends on the new lock. > > > : > > > Possible unsafe locking scenario: > > > > > > CPU0 CPU1 > > > ---- ---- > > > lock(sb_internal); > > > lock(fs_reclaim); > > > lock(sb_internal); > > > lock(fs_reclaim); > > > > > > *** DEADLOCK *** > > > > > > 4 locks held by fsfreeze/4346: > > > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 > > > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 > > > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 > > > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > > > > > stack backtrace: > > > Call Trace: > > > dump_stack+0xe0/0x19a > > > print_circular_bug.isra.10.cold.34+0x2f4/0x435 > > > check_prev_add.constprop.19+0xca1/0x15f0 > > > validate_chain.isra.14+0x11af/0x3b50 > > > __lock_acquire+0x728/0x1200 > > > lock_acquire+0x269/0x5a0 > > > fs_reclaim_acquire.part.19+0x29/0x30 > > > fs_reclaim_acquire+0x19/0x20 > > > kmem_cache_alloc+0x3e/0x3f0 > > > kmem_zone_alloc+0x79/0x150 > > > xfs_trans_alloc+0xfa/0x9d0 > > > xfs_sync_sb+0x86/0x170 > > > xfs_log_sbcount+0x10f/0x140 > > > xfs_quiesce_attr+0x134/0x270 > > > xfs_fs_freeze+0x4a/0x70 > > > freeze_super+0x1af/0x290 > > > do_vfs_ioctl+0xedc/0x16c0 > > > ksys_ioctl+0x41/0x80 > > > __x64_sys_ioctl+0x73/0xa9 > > > do_syscall_64+0x18f/0xd23 > > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > > > > This is a false positive as all the dirty pages are flushed out before > > > the filesystem can be frozen. > > > > > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > > > may fix the issue. However, that will greatly complicate the logic and > > > may not be worth it. > > > > > > Another way to fix it is to disable the taking of the fs_reclaim > > > pseudo lock when in the freezing code path as a reclaim on the > > > freezed filesystem is not possible. By using the newly introduced > > > PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in > > > xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set. > > > > > > In the freezing path, there is another path where memory allocation > > > is being done without the XFS_TRANS_NO_WRITECOUNT flag: > > > > > > xfs_fs_freeze() > > > => xfs_quiesce_attr() > > > => xfs_log_quiesce() > > > => xfs_log_unmount_write() > > > => xlog_unmount_write() > > > => xfs_log_reserve() > > > => xlog_ticket_alloc() > > > > > > In this case, we just disable fs reclaim for this particular 600 bytes > > > memory allocation. > > > > > > Without this patch, the command sequence below will show that the lock > > > dependency chain sb_internal -> fs_reclaim exists. > > > > > > # fsfreeze -f /home > > > # fsfreeze --unfreeze /home > > > # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal > > > > > > After applying the patch, such sb_internal -> fs_reclaim lock dependency > > > chain can no longer be found. Because of that, the locking dependency > > > warning will not be shown. > > > > > > Suggested-by: Dave Chinner <david@fromorbit.com> > > > Signed-off-by: Waiman Long <longman@redhat.com> > > > --- > > > fs/xfs/xfs_log.c | 9 +++++++++ > > > fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++---- > > > 2 files changed, 35 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > > index 00fda2e8e738..33244680d0d4 100644 > > > --- a/fs/xfs/xfs_log.c > > > +++ b/fs/xfs/xfs_log.c > > > @@ -830,8 +830,17 @@ xlog_unmount_write( > > > xfs_lsn_t lsn; > > > uint flags = XLOG_UNMOUNT_TRANS; > > > int error; > > > + unsigned long pflags; > > > + /* > > > + * xfs_log_reserve() allocates memory. This can lead to fs reclaim > > > + * which may conflicts with the unmount process. To avoid that, > > > + * disable fs reclaim for this allocation. > > > + */ > > > + current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS); > > > error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); > > > + current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS); > > > + > > > if (error) > > > goto out_err; > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > > > index 3c94e5ff4316..ddb10ad3f51f 100644 > > > --- a/fs/xfs/xfs_trans.c > > > +++ b/fs/xfs/xfs_trans.c > > > @@ -255,7 +255,27 @@ xfs_trans_alloc( > > > struct xfs_trans **tpp) > > > { > > > struct xfs_trans *tp; > > > - int error; > > > + int error = 0; > > > + unsigned long pflags = -1; > > > + > > > + /* > > > + * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty > > > + * data pages in the filesystem at this point. > > That's not true. Look at the other callers of xfs_trans_alloc_empty. > Yes, I am aware of that. I can change it to check the freeze state. <nod> > > Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call > > chain? > > I guess we can do that, but it eliminates a potential source for memory > reclaim leading to freeze error when not much free memory is left. We can go > this route if you think this is not a problem. <shrug> It sounds like you & Dave had already worked that out, so we can leave this as it is. I saw the untrue claim in the code comment and started asking more questions. ;) (Says me who has been checked out the last few days, not following the various lockdep shuttup patch threads...) --D > Cheers, > Longman > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim 2020-06-15 16:43 ` Darrick J. Wong 2020-06-15 20:53 ` Waiman Long @ 2020-06-15 23:28 ` Dave Chinner 1 sibling, 0 replies; 7+ messages in thread From: Dave Chinner @ 2020-06-15 23:28 UTC (permalink / raw) To: Darrick J. Wong Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen, Andrew Morton On Mon, Jun 15, 2020 at 09:43:51AM -0700, Darrick J. Wong wrote: > Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call > chain? Because there's no guarantee that we are always going to do this final work in the freeze syscall context? i.e. the state is specific to the context in which we are running the transaction, not the task context it is running in... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-06-16 16:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-15 16:08 [PATCH 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long 2020-06-15 16:08 ` [PATCH 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long 2020-06-15 16:08 ` [PATCH 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long 2020-06-15 16:43 ` Darrick J. Wong 2020-06-15 20:53 ` Waiman Long 2020-06-16 16:29 ` Darrick J. Wong 2020-06-15 23:28 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).