All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs
@ 2020-06-17 17:53 Waiman Long
  2020-06-17 17:53 ` [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long
  2020-06-17 17:53 ` [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long
  0 siblings, 2 replies; 11+ messages in thread
From: Waiman Long @ 2020-06-17 17:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen,
	Andrew Morton, Waiman Long

 v2:
  - Update patch to add the frozen flag check as the XFS_TRANS_NO_WRITECOUNT
    check alone is insufficient.

There is a false positive lockdep warning in how the xfs code handle
filesystem freeze:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

This patch series works around this problem by adding a
PF_MEMALLOC_NOLOCKDEP flag and set during filesystem freeze to avoid
the lockdep splat.

Waiman Long (2):
  sched: Add PF_MEMALLOC_NOLOCKDEP flag
  xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim

 fs/xfs/xfs_log.c         |  9 +++++++++
 fs/xfs/xfs_trans.c       | 31 +++++++++++++++++++++++++++----
 include/linux/sched.h    |  7 +++++++
 include/linux/sched/mm.h | 15 ++++++++++-----
 4 files changed, 53 insertions(+), 9 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag
  2020-06-17 17:53 [PATCH v2 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long
@ 2020-06-17 17:53 ` Waiman Long
  2020-06-18  0:01   ` Dave Chinner
  2020-06-17 17:53 ` [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long
  1 sibling, 1 reply; 11+ messages in thread
From: Waiman Long @ 2020-06-17 17:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen,
	Andrew Morton, Waiman Long

There are cases where calling kmalloc() can lead to false positive
lockdep splat. One notable example that can happen in the freezing of
the xfs filesystem is as follows:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen. However, there is no easy way to modify
lockdep to handle this situation properly.

One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
in the appropriate kmalloc() calls.  However, it will be cumbersome to
locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
is easy to miss some especially when the code is updated in the future.

Another alternative is to have a per-process global state that indicates
the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
per-process flag is now added. After adding this new bit, there are
still 2 free bits left.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h    |  7 +++++++
 include/linux/sched/mm.h | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b62e6aaf28f0..44247cbc9073 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1508,6 +1508,7 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
 						 * I am cleaning dirty pages from some other bdi. */
+#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
@@ -1519,6 +1520,12 @@ extern struct pid *cad_pid;
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
+#ifdef CONFIG_LOCKDEP
+#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
+#else
+#define PF_MEMALLOC_NOLOCKDEP	0
+#endif
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 480a4d1b7dd8..4a076a148568 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
  * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	if (unlikely(current->flags &
-		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
+	unsigned int pflags = current->flags;
+
+	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
+			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
 		 */
-		if (current->flags & PF_MEMALLOC_NOIO)
+		if (pflags & PF_MEMALLOC_NOIO)
 			flags &= ~(__GFP_IO | __GFP_FS);
-		else if (current->flags & PF_MEMALLOC_NOFS)
+		else if (pflags & PF_MEMALLOC_NOFS)
 			flags &= ~__GFP_FS;
+		if (pflags & PF_MEMALLOC_NOLOCKDEP)
+			flags |= __GFP_NOLOCKDEP;
 #ifdef CONFIG_CMA
-		if (current->flags & PF_MEMALLOC_NOCMA)
+		if (pflags & PF_MEMALLOC_NOCMA)
 			flags &= ~__GFP_MOVABLE;
 #endif
 	}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-17 17:53 [PATCH v2 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long
  2020-06-17 17:53 ` [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long
@ 2020-06-17 17:53 ` Waiman Long
  2020-06-18  0:45   ` Dave Chinner
  2020-06-19 13:21   ` Christoph Hellwig
  1 sibling, 2 replies; 11+ messages in thread
From: Waiman Long @ 2020-06-17 17:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: linux-xfs, linux-kernel, Dave Chinner, Qian Cai, Eric Sandeen,
	Andrew Morton, Waiman Long

Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:

======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G        W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30

but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

which lock already depends on the new lock.
  :
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

4 locks held by fsfreeze/4346:
 #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

stack backtrace:
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 fs_reclaim_acquire.part.19+0x29/0x30
 fs_reclaim_acquire+0x19/0x20
 kmem_cache_alloc+0x3e/0x3f0
 kmem_zone_alloc+0x79/0x150
 xfs_trans_alloc+0xfa/0x9d0
 xfs_sync_sb+0x86/0x170
 xfs_log_sbcount+0x10f/0x140
 xfs_quiesce_attr+0x134/0x270
 xfs_fs_freeze+0x4a/0x70
 freeze_super+0x1af/0x290
 do_vfs_ioctl+0xedc/0x16c0
 ksys_ioctl+0x41/0x80
 __x64_sys_ioctl+0x73/0xa9
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen.

Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
may fix the issue. However, that will greatly complicate the logic and
may not be worth it.

Another way to fix it is to disable the taking of the fs_reclaim
pseudo lock when in the freezing code path as a reclaim on the
freezed filesystem is not possible. By using the newly introduced
PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in
xfs_trans_alloc() if both XFS_TRANS_NO_WRITECOUNT flag and the
frozen flag are set.

In the freezing path, there is another path where memory allocation
is being done without the XFS_TRANS_NO_WRITECOUNT flag:

  xfs_fs_freeze()
  => xfs_quiesce_attr()
     => xfs_log_quiesce()
        => xfs_log_unmount_write()
           => xlog_unmount_write()
              => xfs_log_reserve()
	         => xlog_ticket_alloc()

In this case, we just disable fs reclaim for this particular 600 bytes
memory allocation.

Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.

 # fsfreeze -f /home
 # fsfreeze --unfreeze /home
 # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal

After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/xfs/xfs_log.c   |  9 +++++++++
 fs/xfs/xfs_trans.c | 31 +++++++++++++++++++++++++++----
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00fda2e8e738..33244680d0d4 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -830,8 +830,17 @@ xlog_unmount_write(
 	xfs_lsn_t		lsn;
 	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
+	unsigned long		pflags;
 
+	/*
+	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
+	 * which may conflicts with the unmount process. To avoid that,
+	 * disable fs reclaim for this allocation.
+	 */
+	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
 	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
+	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
+
 	if (error)
 		goto out_err;
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3c94e5ff4316..ecb46939b276 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -255,7 +255,28 @@ xfs_trans_alloc(
 	struct xfs_trans	**tpp)
 {
 	struct xfs_trans	*tp;
-	int			error;
+	int			error = 0;
+	unsigned long		pflags = -1;
+
+	/*
+	 * When both XFS_TRANS_NO_WRITECOUNT and the frozen flag are set,
+	 * it means there are no dirty data pages in the filesystem at this
+	 * point. So even if fs reclaim is being done, it won't happen to
+	 * this filesystem. In this case, PF_MEMALLOC_NOLOCKDEP should be
+	 * set to avoid false positive lockdep splat like:
+	 *
+	 *       CPU0                    CPU1
+	 *       ----                    ----
+	 *  lock(sb_internal);
+	 *                               lock(fs_reclaim);
+	 *                               lock(sb_internal);
+	 *  lock(fs_reclaim);
+	 *
+	 *  *** DEADLOCK ***
+	 */
+	if (PF_MEMALLOC_NOLOCKDEP && (flags & XFS_TRANS_NO_WRITECOUNT) &&
+	    mp->m_super->s_writers.frozen)
+		current_set_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
 
 	/*
 	 * Allocate the handle before we do our freeze accounting and setting up
@@ -284,13 +305,15 @@ xfs_trans_alloc(
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error) {
 		xfs_trans_cancel(tp);
-		return error;
+		goto out;
 	}
 
 	trace_xfs_trans_alloc(tp, _RET_IP_);
-
 	*tpp = tp;
-	return 0;
+out:
+	if (PF_MEMALLOC_NOLOCKDEP && (pflags != -1))
+		current_restore_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
+	return error;
 }
 
 /*
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag
  2020-06-17 17:53 ` [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long
@ 2020-06-18  0:01   ` Dave Chinner
  2020-06-18  1:32     ` Waiman Long
  2020-06-22 19:16     ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-06-18  0:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On Wed, Jun 17, 2020 at 01:53:09PM -0400, Waiman Long wrote:
> There are cases where calling kmalloc() can lead to false positive
> lockdep splat. One notable example that can happen in the freezing of
> the xfs filesystem is as follows:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_internal);
>                                lock(fs_reclaim);
>                                lock(sb_internal);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
> This is a false positive as all the dirty pages are flushed out before
> the filesystem can be frozen. However, there is no easy way to modify
> lockdep to handle this situation properly.
> 
> One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
> in the appropriate kmalloc() calls.  However, it will be cumbersome to
> locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
> is easy to miss some especially when the code is updated in the future.
> 
> Another alternative is to have a per-process global state that indicates
> the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
> individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
> per-process flag is now added. After adding this new bit, there are
> still 2 free bits left.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/sched.h    |  7 +++++++
>  include/linux/sched/mm.h | 15 ++++++++++-----
>  2 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b62e6aaf28f0..44247cbc9073 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1508,6 +1508,7 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
>  #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>  						 * I am cleaning dirty pages from some other bdi. */
> +#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */

Why is this considered a safe thing to do? Any context that sets
__PF_MEMALLOC_NOLOCKDEP will now behave differently in memory
reclaim as it will think that PF_LOCAL_THROTTLE is set when lockdep
is enabled.

>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>  #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
> @@ -1519,6 +1520,12 @@ extern struct pid *cad_pid;
>  #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
>  
> +#ifdef CONFIG_LOCKDEP
> +#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
> +#else
> +#define PF_MEMALLOC_NOLOCKDEP	0
> +#endif
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 480a4d1b7dd8..4a076a148568 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk)
>   * Applies per-task gfp context to the given allocation flags.
>   * PF_MEMALLOC_NOIO implies GFP_NOIO
>   * PF_MEMALLOC_NOFS implies GFP_NOFS
> + * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
>   * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> -	if (unlikely(current->flags &
> -		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
> +	unsigned int pflags = current->flags;
> +
> +	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
> +			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {

That needs a PF_MEMALLOC_MASK.

And, really, if we are playing "re-use existing bits" games because
we've run out of process flags, all these memalloc flags should be
moved to a new field in the task, say current->memalloc_flags. You
could also move PF_SWAPWRITE, PF_LOCAL_THROTTLE, and PF_KSWAPD into
that field as well as they are all memory allocation context process
flags...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-17 17:53 ` [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long
@ 2020-06-18  0:45   ` Dave Chinner
  2020-06-18  1:35     ` Waiman Long
  2020-06-18  1:36     ` Darrick J. Wong
  2020-06-19 13:21   ` Christoph Hellwig
  1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-06-18  0:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On Wed, Jun 17, 2020 at 01:53:10PM -0400, Waiman Long wrote:
>  fs/xfs/xfs_log.c   |  9 +++++++++
>  fs/xfs/xfs_trans.c | 31 +++++++++++++++++++++++++++----
>  2 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00fda2e8e738..33244680d0d4 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -830,8 +830,17 @@ xlog_unmount_write(
>  	xfs_lsn_t		lsn;
>  	uint			flags = XLOG_UNMOUNT_TRANS;
>  	int			error;
> +	unsigned long		pflags;
>  
> +	/*
> +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
> +	 * which may conflicts with the unmount process. To avoid that,
> +	 * disable fs reclaim for this allocation.
> +	 */
> +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>  	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> +
>  	if (error)
>  		goto out_err;

The more I look at this, the more I think Darrick is right and I
somewhat misinterpretted what he meant by "the top of the freeze
path".

i.e. setting PF_MEMALLOC_NOFS here is out of place - only one caller
of xlog_unmount_write requires PF_MEMALLOC_NOFS
context. That context should be set in the caller that requires this
context, and in this case it is xfs_fs_freeze(). This is top of the
final freeze state processing (what I think Darrick meant), not the
top of the freeze syscall call chain (what I thought he meant).

So if set PF_MEMALLOC_NOFS setting in xfs_fs_freeze(), it covers all
the allocations in this problematic path, and it should obliviates
the need for the first patch in the series altogether.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag
  2020-06-18  0:01   ` Dave Chinner
@ 2020-06-18  1:32     ` Waiman Long
  2020-06-22 19:16     ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Waiman Long @ 2020-06-18  1:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On 6/17/20 8:01 PM, Dave Chinner wrote:
> On Wed, Jun 17, 2020 at 01:53:09PM -0400, Waiman Long wrote:
>> There are cases where calling kmalloc() can lead to false positive
>> lockdep splat. One notable example that can happen in the freezing of
>> the xfs filesystem is as follows:
>>
>>   Possible unsafe locking scenario:
>>
>>         CPU0                    CPU1
>>         ----                    ----
>>    lock(sb_internal);
>>                                 lock(fs_reclaim);
>>                                 lock(sb_internal);
>>    lock(fs_reclaim);
>>
>>   *** DEADLOCK ***
>>
>> This is a false positive as all the dirty pages are flushed out before
>> the filesystem can be frozen. However, there is no easy way to modify
>> lockdep to handle this situation properly.
>>
>> One possible workaround is to disable lockdep by setting __GFP_NOLOCKDEP
>> in the appropriate kmalloc() calls.  However, it will be cumbersome to
>> locate all the right kmalloc() calls to insert __GFP_NOLOCKDEP and it
>> is easy to miss some especially when the code is updated in the future.
>>
>> Another alternative is to have a per-process global state that indicates
>> the equivalent of __GFP_NOLOCKDEP without the need to set the gfp_t flag
>> individually. To allow the latter case, a new PF_MEMALLOC_NOLOCKDEP
>> per-process flag is now added. After adding this new bit, there are
>> still 2 free bits left.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   include/linux/sched.h    |  7 +++++++
>>   include/linux/sched/mm.h | 15 ++++++++++-----
>>   2 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b62e6aaf28f0..44247cbc9073 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1508,6 +1508,7 @@ extern struct pid *cad_pid;
>>   #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
>>   #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>>   						 * I am cleaning dirty pages from some other bdi. */
>> +#define __PF_MEMALLOC_NOLOCKDEP	0x00100000	/* All allocation requests will inherit __GFP_NOLOCKDEP */
> Why is this considered a safe thing to do? Any context that sets
> __PF_MEMALLOC_NOLOCKDEP will now behave differently in memory
> reclaim as it will think that PF_LOCAL_THROTTLE is set when lockdep
> is enabled.

Oh, my mistake, it should be 0x01000000 which is not currently being 
used. Thank for catching that. I will repost a new version. I have no 
intention to reuse any existing bit. As said in the commit log, there 
are actually 2 more free bits left.


>
>>   #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>>   #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>>   #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
>> @@ -1519,6 +1520,12 @@ extern struct pid *cad_pid;
>>   #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
>>   #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
>>   
>> +#ifdef CONFIG_LOCKDEP
>> +#define PF_MEMALLOC_NOLOCKDEP	__PF_MEMALLOC_NOLOCKDEP
>> +#else
>> +#define PF_MEMALLOC_NOLOCKDEP	0
>> +#endif
>> +
>>   /*
>>    * Only the _current_ task can read/write to tsk->flags, but other
>>    * tasks can access tsk->flags in readonly mode for example
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 480a4d1b7dd8..4a076a148568 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -177,22 +177,27 @@ static inline bool in_vfork(struct task_struct *tsk)
>>    * Applies per-task gfp context to the given allocation flags.
>>    * PF_MEMALLOC_NOIO implies GFP_NOIO
>>    * PF_MEMALLOC_NOFS implies GFP_NOFS
>> + * PF_MEMALLOC_NOLOCKDEP implies __GFP_NOLOCKDEP
>>    * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>>    */
>>   static inline gfp_t current_gfp_context(gfp_t flags)
>>   {
>> -	if (unlikely(current->flags &
>> -		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
>> +	unsigned int pflags = current->flags;
>> +
>> +	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
>> +			       PF_MEMALLOC_NOCMA | PF_MEMALLOC_NOLOCKDEP))) {
> That needs a PF_MEMALLOC_MASK.

Will add that in the next version.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-18  0:45   ` Dave Chinner
@ 2020-06-18  1:35     ` Waiman Long
  2020-06-18  1:36     ` Darrick J. Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Waiman Long @ 2020-06-18  1:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On 6/17/20 8:45 PM, Dave Chinner wrote:
> On Wed, Jun 17, 2020 at 01:53:10PM -0400, Waiman Long wrote:
>>   fs/xfs/xfs_log.c   |  9 +++++++++
>>   fs/xfs/xfs_trans.c | 31 +++++++++++++++++++++++++++----
>>   2 files changed, 36 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
>> index 00fda2e8e738..33244680d0d4 100644
>> --- a/fs/xfs/xfs_log.c
>> +++ b/fs/xfs/xfs_log.c
>> @@ -830,8 +830,17 @@ xlog_unmount_write(
>>   	xfs_lsn_t		lsn;
>>   	uint			flags = XLOG_UNMOUNT_TRANS;
>>   	int			error;
>> +	unsigned long		pflags;
>>   
>> +	/*
>> +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
>> +	 * which may conflicts with the unmount process. To avoid that,
>> +	 * disable fs reclaim for this allocation.
>> +	 */
>> +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>>   	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
>> +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>> +
>>   	if (error)
>>   		goto out_err;
> The more I look at this, the more I think Darrick is right and I
> somewhat misinterpretted what he meant by "the top of the freeze
> path".
>
> i.e. setting PF_MEMALLOC_NOFS here is out of place - only one caller
> of xlog_unmount_write requires PF_MEMALLOC_NOFS
> context. That context should be set in the caller that requires this
> context, and in this case it is xfs_fs_freeze(). This is top of the
> final freeze state processing (what I think Darrick meant), not the
> top of the freeze syscall call chain (what I thought he meant).
>
> So if set PF_MEMALLOC_NOFS setting in xfs_fs_freeze(), it covers all
> the allocations in this problematic path, and it should obliviates
> the need for the first patch in the series altogether.
>
OK, I will try that and run my test. If it pass, I will post a new patch 
with the suggested change.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-18  0:45   ` Dave Chinner
  2020-06-18  1:35     ` Waiman Long
@ 2020-06-18  1:36     ` Darrick J. Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2020-06-18  1:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On Thu, Jun 18, 2020 at 10:45:05AM +1000, Dave Chinner wrote:
> On Wed, Jun 17, 2020 at 01:53:10PM -0400, Waiman Long wrote:
> >  fs/xfs/xfs_log.c   |  9 +++++++++
> >  fs/xfs/xfs_trans.c | 31 +++++++++++++++++++++++++++----
> >  2 files changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 00fda2e8e738..33244680d0d4 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -830,8 +830,17 @@ xlog_unmount_write(
> >  	xfs_lsn_t		lsn;
> >  	uint			flags = XLOG_UNMOUNT_TRANS;
> >  	int			error;
> > +	unsigned long		pflags;
> >  
> > +	/*
> > +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
> > +	 * which may conflicts with the unmount process. To avoid that,
> > +	 * disable fs reclaim for this allocation.
> > +	 */
> > +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> >  	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> > +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> > +
> >  	if (error)
> >  		goto out_err;
> 
> The more I look at this, the more I think Darrick is right and I
> somewhat misinterpretted what he meant by "the top of the freeze
> path".
> 
> i.e. setting PF_MEMALLOC_NOFS here is out of place - only one caller
> of xlog_unmount_write requires PF_MEMALLOC_NOFS
> context. That context should be set in the caller that requires this
> context, and in this case it is xfs_fs_freeze(). This is top of the
> final freeze state processing (what I think Darrick meant), not the
> top of the freeze syscall call chain (what I thought he meant).

Aha!  Yes, that's exactly what I meant.  Sorry we all kinda muddled
around for a few days. :/

--D

> So if set PF_MEMALLOC_NOFS setting in xfs_fs_freeze(), it covers all
> the allocations in this problematic path, and it should obliviates
> the need for the first patch in the series altogether.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-17 17:53 ` [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long
  2020-06-18  0:45   ` Dave Chinner
@ 2020-06-19 13:21   ` Christoph Hellwig
  2020-06-19 15:08     ` Waiman Long
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2020-06-19 13:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Dave Chinner, Qian Cai,
	Eric Sandeen, Andrew Morton

I find it really confusing that we record this in current->flags.
per-thread state makes total sense for not dipping into fs reclaim.
But for annotating something related to memory allocation passing flags
seems a lot more descriptive to me, as it is about particular locks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim
  2020-06-19 13:21   ` Christoph Hellwig
@ 2020-06-19 15:08     ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2020-06-19 15:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Dave Chinner, Qian Cai,
	Eric Sandeen, Andrew Morton

On 6/19/20 9:21 AM, Christoph Hellwig wrote:
> I find it really confusing that we record this in current->flags.
> per-thread state makes total sense for not dipping into fs reclaim.
> But for annotating something related to memory allocation passing flags
> seems a lot more descriptive to me, as it is about particular locks.
>
I am dropping this patchset as just using PF_MEMALLOC_NOFS is good enough.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag
  2020-06-18  0:01   ` Dave Chinner
  2020-06-18  1:32     ` Waiman Long
@ 2020-06-22 19:16     ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-06-22 19:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Darrick J. Wong, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-xfs, linux-kernel, Qian Cai, Eric Sandeen,
	Andrew Morton

On Thu, Jun 18, 2020 at 10:01:10AM +1000, Dave Chinner wrote:

> And, really, if we are playing "re-use existing bits" games because
> we've run out of process flags, all these memalloc flags should be
> moved to a new field in the task, say current->memalloc_flags. You
> could also move PF_SWAPWRITE, PF_LOCAL_THROTTLE, and PF_KSWAPD into
> that field as well as they are all memory allocation context process
> flags...

FWIW

There's still 23 bits free after task_struct::in_memstall. That word has
'current only' semantics, just like PF.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-22 19:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 17:53 [PATCH v2 0/2] sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs Waiman Long
2020-06-17 17:53 ` [PATCH v2 1/2] sched: Add PF_MEMALLOC_NOLOCKDEP flag Waiman Long
2020-06-18  0:01   ` Dave Chinner
2020-06-18  1:32     ` Waiman Long
2020-06-22 19:16     ` Peter Zijlstra
2020-06-17 17:53 ` [PATCH v2 2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Waiman Long
2020-06-18  0:45   ` Dave Chinner
2020-06-18  1:35     ` Waiman Long
2020-06-18  1:36     ` Darrick J. Wong
2020-06-19 13:21   ` Christoph Hellwig
2020-06-19 15:08     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.