All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-26 11:56 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-26 11:56 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-26 11:56 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, xfs, ceph-devel,
	linux-btrfs, linux-afs

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-26 11:56 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-26 11:56 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups.  There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.



^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-26 11:56 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2016-04-26 11:56   ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, Michal Hocko, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       |  8 ++++++++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  4 ++--
 9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3



^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-26 11:56 ` Michal Hocko
                     ` (3 preceding siblings ...)
  (?)
@ 2016-04-26 11:56   ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, Michal Hocko, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-26 11:56   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-26 11:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+		dump_stack();
+	}
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				ac.nodemask);
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.8.0.rc3



^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-26 11:56   ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
  (?)
  (?)
@ 2016-04-26 22:58     ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> 
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> 
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-26 22:58     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> 
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> 
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-26 22:58     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, Michal Hocko, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> 
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> 
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-26 22:58     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 22:58 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> 
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> 
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-26 11:56   ` Michal Hocko
  (?)
  (?)
@ 2016-04-26 23:07     ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 23:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 23:07     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 23:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 23:07     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 23:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, Michal Hocko, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-26 23:07     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-26 23:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-26 23:07     ` Dave Chinner
  (?)
  (?)
@ 2016-04-27  7:51       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  7:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 09:07:02, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > GFP_NOFS context is used for the following 4 reasons currently
> > 	- to prevent from deadlocks when the lock held by the allocation
> > 	  context would be needed during the memory reclaim
> > 	- to prevent from stack overflows during the reclaim because
> > 	  the allocation is performed from a deep context already
> > 	- to prevent lockups when the allocation context depends on
> > 	  other reclaimers to make a forward progress indirectly
> > 	- just in case because this would be safe from the fs POV
> 
> - silencing lockdep false positives
> 
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> > 
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
> 
> Can you split this into two patches? The first simply does this:
> 
> #define PF_MEMALLOC_NOFS PF_FSTRANS
> 
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
> 
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all.  That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27  7:51       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  7:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 09:07:02, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > GFP_NOFS context is used for the following 4 reasons currently
> > 	- to prevent from deadlocks when the lock held by the allocation
> > 	  context would be needed during the memory reclaim
> > 	- to prevent from stack overflows during the reclaim because
> > 	  the allocation is performed from a deep context already
> > 	- to prevent lockups when the allocation context depends on
> > 	  other reclaimers to make a forward progress indirectly
> > 	- just in case because this would be safe from the fs POV
> 
> - silencing lockdep false positives
> 
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> > 
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
> 
> Can you split this into two patches? The first simply does this:
> 
> #define PF_MEMALLOC_NOFS PF_FSTRANS
> 
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
> 
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all.  That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27  7:51       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  7:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Wed 27-04-16 09:07:02, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > GFP_NOFS context is used for the following 4 reasons currently
> > 	- to prevent from deadlocks when the lock held by the allocation
> > 	  context would be needed during the memory reclaim
> > 	- to prevent from stack overflows during the reclaim because
> > 	  the allocation is performed from a deep context already
> > 	- to prevent lockups when the allocation context depends on
> > 	  other reclaimers to make a forward progress indirectly
> > 	- just in case because this would be safe from the fs POV
> 
> - silencing lockdep false positives
> 
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> > 
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
> 
> Can you split this into two patches? The first simply does this:
> 
> #define PF_MEMALLOC_NOFS PF_FSTRANS
> 
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
> 
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all.  That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27  7:51       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  7:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 27-04-16 09:07:02, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > GFP_NOFS context is used for the following 4 reasons currently
> > 	- to prevent from deadlocks when the lock held by the allocation
> > 	  context would be needed during the memory reclaim
> > 	- to prevent from stack overflows during the reclaim because
> > 	  the allocation is performed from a deep context already
> > 	- to prevent lockups when the allocation context depends on
> > 	  other reclaimers to make a forward progress indirectly
> > 	- just in case because this would be safe from the fs POV
> 
> - silencing lockdep false positives
> 
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> > 
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
> 
> Can you split this into two patches? The first simply does this:
> 
> #define PF_MEMALLOC_NOFS PF_FSTRANS
> 
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
> 
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all.  That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-26 22:58     ` Dave Chinner
  (?)
  (?)
@ 2016-04-27  8:03       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > 
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > 
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
> 
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
> 
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-27  8:03       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > 
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > 
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
> 
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
> 
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-27  8:03       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > 
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > 
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
> 
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
> 
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-27  8:03       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27  8:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > 
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > 
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
> 
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
> 
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-26 11:56   ` Michal Hocko
@ 2016-04-27 10:53     ` Tetsuo Handa
  -1 siblings, 0 replies; 144+ messages in thread
From: Tetsuo Handa @ 2016-04-27 10:53 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, LKML, Michal Hocko

On 2016/04/26 20:56, Michal Hocko wrote:
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 

You arrived at what I wished at
http://lkml.kernel.org/r/201503172305.DIH52162.FOFMFOVJHLOtQS@I-love.SAKURA.ne.jp
(i.e. not CONFIG_DEBUG_* but always enabled).

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.
> 
> This patch shouldn't introduce any functional changes. Xfs code paths
> preserve their semantic. kmem_flags_convert() doesn't need to evaluate
> the flag anymore because it is the page allocator to care about the
> flag. memalloc_noio_flags is renamed to current_gfp_context because it
> now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much and possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
We sometimes delegate some operations to somebody else. Don't we need to
convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
other threads?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27 10:53     ` Tetsuo Handa
  0 siblings, 0 replies; 144+ messages in thread
From: Tetsuo Handa @ 2016-04-27 10:53 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, LKML, Michal Hocko

On 2016/04/26 20:56, Michal Hocko wrote:
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 

You arrived at what I wished at
http://lkml.kernel.org/r/201503172305.DIH52162.FOFMFOVJHLOtQS@I-love.SAKURA.ne.jp
(i.e. not CONFIG_DEBUG_* but always enabled).

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.
> 
> This patch shouldn't introduce any functional changes. Xfs code paths
> preserve their semantic. kmem_flags_convert() doesn't need to evaluate
> the flag anymore because it is the page allocator to care about the
> flag. memalloc_noio_flags is renamed to current_gfp_context because it
> now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much and possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
We sometimes delegate some operations to somebody else. Don't we need to
convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
other threads?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-27 10:53     ` Tetsuo Handa
@ 2016-04-27 11:15       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Dave Chinner, Theodore Ts'o,
	Chris Mason, Jan Kara, LKML

On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
[...]
> > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > usage as much and possible and only use a properly documented
> > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> We sometimes delegate some operations to somebody else. Don't we need to
> convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> other threads?

We can add an api to do that if that is really needed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27 11:15       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Dave Chinner, Theodore Ts'o,
	Chris Mason, Jan Kara, LKML

On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
[...]
> > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > usage as much and possible and only use a properly documented
> > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> We sometimes delegate some operations to somebody else. Don't we need to
> convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> other threads?

We can add an api to do that if that is really needed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2016-04-26 11:56   ` Michal Hocko
                       ` (2 preceding siblings ...)
  (?)
@ 2016-04-27 11:54     ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 11:54     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 11:54     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, Michal Hocko, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.8.0.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 11:54     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 11:54     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -921,7 +921,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.8.0.rc3



^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
  2016-04-27 11:54     ` Michal Hocko
                         ` (3 preceding siblings ...)
  (?)
@ 2016-04-27 11:54       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 11:54       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 11:54       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, Michal Hocko, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 11:54       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, Michal Hocko, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 11:54       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API
@ 2016-04-27 11:54       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 11:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 4 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h         |  2 +-
 include/linux/gfp.h   |  8 ++++++++
 include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
 mm/page_alloc.c       |  8 +++++---
 mm/vmscan.c           |  4 ++--
 5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-- 
2.8.0.rc3



^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
  2016-04-27 11:54       ` Michal Hocko
  (?)
  (?)
@ 2016-04-27 13:07         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 13:07 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

	- silence lockdep false positives
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 13:07         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 13:07 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

	- silence lockdep false positives
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 13:07         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 13:07 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, xfs, ceph-devel,
	linux-btrfs, linux-afs

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

	- silence lockdep false positives
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API
@ 2016-04-27 13:07         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 13:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV

	- silence lockdep false positives
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-27 11:15       ` Michal Hocko
@ 2016-04-27 14:44         ` Tetsuo Handa
  -1 siblings, 0 replies; 144+ messages in thread
From: Tetsuo Handa @ 2016-04-27 14:44 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, david, tytso, clm, jack, linux-kernel, zohar

Michal Hocko wrote:
> On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
> [...]
> > > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > > usage as much and possible and only use a properly documented
> > > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> > 
> > Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> > with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> > We sometimes delegate some operations to somebody else. Don't we need to
> > convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> > other threads?
> 
> We can add an api to do that if that is really needed.
> 

I'm not familiar with integrity subsystem.
But if call traces shown below is possible and evm_verify_hmac() is called from
genuine GFP_NOFS context, we are currently using GFP_KERNEL incorrectly.
Therefore, inserting memalloc_nofs_{save,restore} would avoid possible memory
reclaim deadlock by __GFP_FS.

----------
static enum integrity_status evm_verify_hmac(struct dentry *dentry, const char *xattr_name, char *xattr_value, size_t xattr_value_len, struct integrity_iint_cache *iint) {
  rc = vfs_getxattr_alloc(dentry, XATTR_NAME_EVM, (char **)&xattr_data, 0, GFP_NOFS); /***** GFP_NOFS is used here. *****/
  rc = integrity_digsig_verify(INTEGRITY_KEYRING_EVM, (const char *)xattr_data, xattr_len, calc.digest, sizeof(calc.digest)) {
    keyring[id] = request_key(&key_type_keyring, keyring_name[id], NULL) {
      key = request_key_and_link(type, description, callout_info, callout_len, NULL, NULL, KEY_ALLOC_IN_QUOTA) {
        key = construct_key_and_link(&ctx, callout_info, callout_len, aux, dest_keyring, flags) {
          ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key) {
            key = key_alloc(ctx->index_key.type, ctx->index_key.description, ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred, perm, flags) {
              key = kmem_cache_zalloc(key_jar, GFP_KERNEL); /***** Needs to use GFP_NOFS here if above GFP_NOFS usage is correct. *****/
            }
          }
          ret = construct_key(key, callout_info, callout_len, aux, dest_keyring) {
            cons = kmalloc(sizeof(*cons), GFP_KERNEL); /***** Ditto. *****/
            actor = call_sbin_request_key;
            ret = actor(cons, "create", aux) {
              ret = call_usermodehelper_keys(argv[0], argv, envp, keyring, UMH_WAIT_PROC) {
                info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL, umh_keys_init, umh_keys_cleanup, session_keyring); /***** Ditto. *****/
                return call_usermodehelper_exec(info, wait) {
                  queue_work(system_unbound_wq, &sub_info->work); /***** Queuing a GFP_NOFS work item here if above GFP_NOFS usage is correct. *****/
                  wait_for_completion(&done); /***** But kworker uses GFP_KERNEL to create process for executing userspace program. *****/
                }
              }
            }
          }
        }
      }
    }
  }
}
----------

But there is a path where evm_verify_hmac() calls usermode helper.
If evm_verify_hmac() calls usermode helper from genuine GFP_NOFS context,
we will be still failing to tell kworker to use GFP_NOFS.

More problematic thing might be that we queue both GFP_KERNEL work item
and GFP_NOFS work item into the same work queue. This means that the
kworker will try __GFP_FS reclaim if current GFP_KERNEL work item
and be blocked on a fs lock held by next GFP_NOFS work item. Then, simply
conveying PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to other threads is not
sufficient, and we need to create separate workqueues (and respective
consumers) for GFP_KERNEL work items and GFP_NOFS work items?
(Or we have no such problem because khelper_wq was replaced with
system_unbound_wq ?)

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27 14:44         ` Tetsuo Handa
  0 siblings, 0 replies; 144+ messages in thread
From: Tetsuo Handa @ 2016-04-27 14:44 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, david, tytso, clm, jack, linux-kernel, zohar

Michal Hocko wrote:
> On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
> [...]
> > > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > > usage as much and possible and only use a properly documented
> > > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> > 
> > Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> > with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> > We sometimes delegate some operations to somebody else. Don't we need to
> > convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> > other threads?
> 
> We can add an api to do that if that is really needed.
> 

I'm not familiar with integrity subsystem.
But if call traces shown below is possible and evm_verify_hmac() is called from
genuine GFP_NOFS context, we are currently using GFP_KERNEL incorrectly.
Therefore, inserting memalloc_nofs_{save,restore} would avoid possible memory
reclaim deadlock by __GFP_FS.

----------
static enum integrity_status evm_verify_hmac(struct dentry *dentry, const char *xattr_name, char *xattr_value, size_t xattr_value_len, struct integrity_iint_cache *iint) {
  rc = vfs_getxattr_alloc(dentry, XATTR_NAME_EVM, (char **)&xattr_data, 0, GFP_NOFS); /***** GFP_NOFS is used here. *****/
  rc = integrity_digsig_verify(INTEGRITY_KEYRING_EVM, (const char *)xattr_data, xattr_len, calc.digest, sizeof(calc.digest)) {
    keyring[id] = request_key(&key_type_keyring, keyring_name[id], NULL) {
      key = request_key_and_link(type, description, callout_info, callout_len, NULL, NULL, KEY_ALLOC_IN_QUOTA) {
        key = construct_key_and_link(&ctx, callout_info, callout_len, aux, dest_keyring, flags) {
          ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key) {
            key = key_alloc(ctx->index_key.type, ctx->index_key.description, ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred, perm, flags) {
              key = kmem_cache_zalloc(key_jar, GFP_KERNEL); /***** Needs to use GFP_NOFS here if above GFP_NOFS usage is correct. *****/
            }
          }
          ret = construct_key(key, callout_info, callout_len, aux, dest_keyring) {
            cons = kmalloc(sizeof(*cons), GFP_KERNEL); /***** Ditto. *****/
            actor = call_sbin_request_key;
            ret = actor(cons, "create", aux) {
              ret = call_usermodehelper_keys(argv[0], argv, envp, keyring, UMH_WAIT_PROC) {
                info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL, umh_keys_init, umh_keys_cleanup, session_keyring); /***** Ditto. *****/
                return call_usermodehelper_exec(info, wait) {
                  queue_work(system_unbound_wq, &sub_info->work); /***** Queuing a GFP_NOFS work item here if above GFP_NOFS usage is correct. *****/
                  wait_for_completion(&done); /***** But kworker uses GFP_KERNEL to create process for executing userspace program. *****/
                }
              }
            }
          }
        }
      }
    }
  }
}
----------

But there is a path where evm_verify_hmac() calls usermode helper.
If evm_verify_hmac() calls usermode helper from genuine GFP_NOFS context,
we will be still failing to tell kworker to use GFP_NOFS.

More problematic thing might be that we queue both GFP_KERNEL work item
and GFP_NOFS work item into the same work queue. This means that the
kworker will try __GFP_FS reclaim if current GFP_KERNEL work item
and be blocked on a fs lock held by next GFP_NOFS work item. Then, simply
conveying PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to other threads is not
sufficient, and we need to create separate workqueues (and respective
consumers) for GFP_KERNEL work items and GFP_NOFS work items?
(Or we have no such problem because khelper_wq was replaced with
system_unbound_wq ?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2016-04-27 11:54     ` Michal Hocko
  (?)
@ 2016-04-27 17:41       ` Andreas Dilger
  -1 siblings, 0 replies; 144+ messages in thread
From: Andreas Dilger @ 2016-04-27 17:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux MM, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, Linux NFS Mailing List, logfs, XFS Developers,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 7746 bytes --]

On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
> 
> fs/xfs/kmem.c             |  4 ++--
> fs/xfs/kmem.h             |  2 +-
> fs/xfs/libxfs/xfs_btree.c |  2 +-
> fs/xfs/xfs_aops.c         |  6 +++---
> fs/xfs/xfs_trans.c        | 12 ++++++------
> include/linux/sched.h     |  2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> 	 * the filesystem here and potentially deadlocking.
> 	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		noio_flag = memalloc_noio_save();
> 
> 	lflags = kmem_flags_convert(flags);
> 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> 
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		memalloc_noio_restore(noio_flag);

Not really the fault of this patch, but it brings this nasty bit of code into
the light.  Is all of this machinery still needed given that __vmalloc() can
accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

> 	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> 		lflags = GFP_ATOMIC | __GFP_NOWARN;
> 	} else {
> 		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 			lflags &= ~__GFP_FS;
> 	}
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
> 	struct xfs_btree_split_args	*args = container_of(work,
> 						struct xfs_btree_split_args, work);
> 	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
> 
> 	/*
> 	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
> 	 * We hand off the transaction to the completion thread now, so
> 	 * clear the flag here.
> 	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	return 0;
> }
> 
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
> 	 * thus we need to mark ourselves as being in a transaction manually.
> 	 * Similarly for freeze protection.
> 	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> 
> 	/* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
> 	 * Given that we do not allow direct reclaim to call us, we should
> 	 * never be called while in a filesystem transaction.
> 	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> 		goto redirty;
> 
> 	/* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 748b16aff45a..1d247366c733 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -176,7 +176,7 @@ xfs_trans_reserve(
> 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> 
> 	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	/*
> 	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -186,7 +186,7 @@ xfs_trans_reserve(
> 	if (blocks > 0) {
> 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> 		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 			return -ENOSPC;
> 		}
> 		tp->t_blk_res += blocks;
> @@ -263,7 +263,7 @@ xfs_trans_reserve(
> 		tp->t_blk_res = 0;
> 	}
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	return error;
> }
> @@ -921,7 +921,7 @@ __xfs_trans_commit(
> 
> 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free(tp);
> 
> 	/*
> @@ -951,7 +951,7 @@ __xfs_trans_commit(
> 		if (commit_lsn == -1 && !error)
> 			error = -EIO;
> 	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> 	xfs_trans_free(tp);
> 
> @@ -1005,7 +1005,7 @@ xfs_trans_cancel(
> 		xfs_log_done(mp, tp->t_ticket, NULL, false);
> 
> 	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> 	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..820db8f98bfc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> 
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
> --
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 17:41       ` Andreas Dilger
  0 siblings, 0 replies; 144+ messages in thread
From: Andreas Dilger @ 2016-04-27 17:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux NFS Mailing List, linux-ext4, Theodore Ts'o,
	Chris Mason, linux-ntfs-dev, LKML, reiserfs-devel,
	linux-f2fs-devel, logfs, cluster-devel, Linux MM, Michal Hocko,
	linux-mtd, linux-fsdevel, Jan Kara, Andrew Morton,
	XFS Developers, ceph-devel, linux-btrfs, linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 7746 bytes --]

On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
> 
> fs/xfs/kmem.c             |  4 ++--
> fs/xfs/kmem.h             |  2 +-
> fs/xfs/libxfs/xfs_btree.c |  2 +-
> fs/xfs/xfs_aops.c         |  6 +++---
> fs/xfs/xfs_trans.c        | 12 ++++++------
> include/linux/sched.h     |  2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> 	 * the filesystem here and potentially deadlocking.
> 	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		noio_flag = memalloc_noio_save();
> 
> 	lflags = kmem_flags_convert(flags);
> 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> 
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		memalloc_noio_restore(noio_flag);

Not really the fault of this patch, but it brings this nasty bit of code into
the light.  Is all of this machinery still needed given that __vmalloc() can
accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

> 	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> 		lflags = GFP_ATOMIC | __GFP_NOWARN;
> 	} else {
> 		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 			lflags &= ~__GFP_FS;
> 	}
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
> 	struct xfs_btree_split_args	*args = container_of(work,
> 						struct xfs_btree_split_args, work);
> 	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
> 
> 	/*
> 	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
> 	 * We hand off the transaction to the completion thread now, so
> 	 * clear the flag here.
> 	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	return 0;
> }
> 
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
> 	 * thus we need to mark ourselves as being in a transaction manually.
> 	 * Similarly for freeze protection.
> 	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> 
> 	/* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
> 	 * Given that we do not allow direct reclaim to call us, we should
> 	 * never be called while in a filesystem transaction.
> 	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> 		goto redirty;
> 
> 	/* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 748b16aff45a..1d247366c733 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -176,7 +176,7 @@ xfs_trans_reserve(
> 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> 
> 	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	/*
> 	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -186,7 +186,7 @@ xfs_trans_reserve(
> 	if (blocks > 0) {
> 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> 		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 			return -ENOSPC;
> 		}
> 		tp->t_blk_res += blocks;
> @@ -263,7 +263,7 @@ xfs_trans_reserve(
> 		tp->t_blk_res = 0;
> 	}
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	return error;
> }
> @@ -921,7 +921,7 @@ __xfs_trans_commit(
> 
> 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free(tp);
> 
> 	/*
> @@ -951,7 +951,7 @@ __xfs_trans_commit(
> 		if (commit_lsn == -1 && !error)
> 			error = -EIO;
> 	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> 	xfs_trans_free(tp);
> 
> @@ -1005,7 +1005,7 @@ xfs_trans_cancel(
> 		xfs_log_done(mp, tp->t_ticket, NULL, false);
> 
> 	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> 	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..820db8f98bfc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> 
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
> --
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #1.2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 17:41       ` Andreas Dilger
  0 siblings, 0 replies; 144+ messages in thread
From: Andreas Dilger @ 2016-04-27 17:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
> 
> fs/xfs/kmem.c             |  4 ++--
> fs/xfs/kmem.h             |  2 +-
> fs/xfs/libxfs/xfs_btree.c |  2 +-
> fs/xfs/xfs_aops.c         |  6 +++---
> fs/xfs/xfs_trans.c        | 12 ++++++------
> include/linux/sched.h     |  2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> 	 * the filesystem here and potentially deadlocking.
> 	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		noio_flag = memalloc_noio_save();
> 
> 	lflags = kmem_flags_convert(flags);
> 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> 
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 		memalloc_noio_restore(noio_flag);

Not really the fault of this patch, but it brings this nasty bit of code into
the light.  Is all of this machinery still needed given that __vmalloc() can
accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

> 	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> 		lflags = GFP_ATOMIC | __GFP_NOWARN;
> 	} else {
> 		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> 			lflags &= ~__GFP_FS;
> 	}
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
> 	struct xfs_btree_split_args	*args = container_of(work,
> 						struct xfs_btree_split_args, work);
> 	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
> 
> 	/*
> 	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
> 	 * We hand off the transaction to the completion thread now, so
> 	 * clear the flag here.
> 	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	return 0;
> }
> 
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
> 	 * thus we need to mark ourselves as being in a transaction manually.
> 	 * Similarly for freeze protection.
> 	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> 
> 	/* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
> 	 * Given that we do not allow direct reclaim to call us, we should
> 	 * never be called while in a filesystem transaction.
> 	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> 		goto redirty;
> 
> 	/* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 748b16aff45a..1d247366c733 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -176,7 +176,7 @@ xfs_trans_reserve(
> 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> 
> 	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	/*
> 	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -186,7 +186,7 @@ xfs_trans_reserve(
> 	if (blocks > 0) {
> 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> 		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 			return -ENOSPC;
> 		}
> 		tp->t_blk_res += blocks;
> @@ -263,7 +263,7 @@ xfs_trans_reserve(
> 		tp->t_blk_res = 0;
> 	}
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	return error;
> }
> @@ -921,7 +921,7 @@ __xfs_trans_commit(
> 
> 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> 
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free(tp);
> 
> 	/*
> @@ -951,7 +951,7 @@ __xfs_trans_commit(
> 		if (commit_lsn == -1 && !error)
> 			error = -EIO;
> 	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> 	xfs_trans_free(tp);
> 
> @@ -1005,7 +1005,7 @@ xfs_trans_cancel(
> 		xfs_log_done(mp, tp->t_ticket, NULL, false);
> 
> 	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> 
> 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> 	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..820db8f98bfc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> 
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
> --
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160427/0838d03b/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2016-04-27 17:41       ` Andreas Dilger
  (?)
  (?)
@ 2016-04-27 19:43         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 19:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linux MM, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, Linux NFS Mailing List, logfs, XFS Developers,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > 	 * the filesystem here and potentially deadlocking.
> > 	 */
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		noio_flag = memalloc_noio_save();
> > 
> > 	lflags = kmem_flags_convert(flags);
> > 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> > 
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		memalloc_noio_restore(noio_flag);
> 
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light.  Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 19:43         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 19:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linux MM, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, Linux NFS Mailing List, logfs, XFS Developers,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > 	 * the filesystem here and potentially deadlocking.
> > 	 */
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		noio_flag = memalloc_noio_save();
> > 
> > 	lflags = kmem_flags_convert(flags);
> > 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> > 
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		memalloc_noio_restore(noio_flag);
> 
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light.  Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 19:43         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 19:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linux NFS Mailing List, linux-ext4, Theodore Ts'o,
	Chris Mason, linux-ntfs-dev, LKML, reiserfs-devel,
	linux-f2fs-devel, logfs, cluster-devel, Linux MM, linux-mtd,
	linux-fsdevel, Jan Kara, Andrew Morton, XFS Developers,
	ceph-devel, linux-btrfs, linux-afs

On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > 	 * the filesystem here and potentially deadlocking.
> > 	 */
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		noio_flag = memalloc_noio_save();
> > 
> > 	lflags = kmem_flags_convert(flags);
> > 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> > 
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		memalloc_noio_restore(noio_flag);
> 
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light.  Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2016-04-27 19:43         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 19:43 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > 	 * the filesystem here and potentially deadlocking.
> > 	 */
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		noio_flag = memalloc_noio_save();
> > 
> > 	lflags = kmem_flags_convert(flags);
> > 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> > 
> > -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > 		memalloc_noio_restore(noio_flag);
> 
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light.  Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags?  If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
  2016-04-27 14:44         ` Tetsuo Handa
@ 2016-04-27 20:05           ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, david, tytso, clm, jack, linux-kernel, zohar

On Wed 27-04-16 23:44:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
> > [...]
> > > > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > > > usage as much and possible and only use a properly documented
> > > > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> > > 
> > > Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> > > with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> > > We sometimes delegate some operations to somebody else. Don't we need to
> > > convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> > > other threads?
> > 
> > We can add an api to do that if that is really needed.
> > 
> 
> I'm not familiar with integrity subsystem.
> But if call traces shown below is possible and evm_verify_hmac() is called from
> genuine GFP_NOFS context, we are currently using GFP_KERNEL incorrectly.
> Therefore, inserting memalloc_nofs_{save,restore} would avoid possible memory
> reclaim deadlock by __GFP_FS.

I am not familiar with this code as well but you are definitely right
that scope GFP_NOFS would be better. I have a suspicious that NOFS is
used here improperly and it just copies the same gfp mask used for
all allocations in the same file without any good reason.

This would be a question for Mimi. Could you clarify please?

> ----------
> static enum integrity_status evm_verify_hmac(struct dentry *dentry, const char *xattr_name, char *xattr_value, size_t xattr_value_len, struct integrity_iint_cache *iint) {
>   rc = vfs_getxattr_alloc(dentry, XATTR_NAME_EVM, (char **)&xattr_data, 0, GFP_NOFS); /***** GFP_NOFS is used here. *****/
>   rc = integrity_digsig_verify(INTEGRITY_KEYRING_EVM, (const char *)xattr_data, xattr_len, calc.digest, sizeof(calc.digest)) {
>     keyring[id] = request_key(&key_type_keyring, keyring_name[id], NULL) {
>       key = request_key_and_link(type, description, callout_info, callout_len, NULL, NULL, KEY_ALLOC_IN_QUOTA) {
>         key = construct_key_and_link(&ctx, callout_info, callout_len, aux, dest_keyring, flags) {
>           ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key) {
>             key = key_alloc(ctx->index_key.type, ctx->index_key.description, ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred, perm, flags) {
>               key = kmem_cache_zalloc(key_jar, GFP_KERNEL); /***** Needs to use GFP_NOFS here if above GFP_NOFS usage is correct. *****/
>             }
>           }
>           ret = construct_key(key, callout_info, callout_len, aux, dest_keyring) {
>             cons = kmalloc(sizeof(*cons), GFP_KERNEL); /***** Ditto. *****/
>             actor = call_sbin_request_key;
>             ret = actor(cons, "create", aux) {
>               ret = call_usermodehelper_keys(argv[0], argv, envp, keyring, UMH_WAIT_PROC) {
>                 info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL, umh_keys_init, umh_keys_cleanup, session_keyring); /***** Ditto. *****/
>                 return call_usermodehelper_exec(info, wait) {
>                   queue_work(system_unbound_wq, &sub_info->work); /***** Queuing a GFP_NOFS work item here if above GFP_NOFS usage is correct. *****/
>                   wait_for_completion(&done); /***** But kworker uses GFP_KERNEL to create process for executing userspace program. *****/
>                 }
>               }
>             }
>           }
>         }
>       }
>     }
>   }
> }
> ----------
> 
> But there is a path where evm_verify_hmac() calls usermode helper.
> If evm_verify_hmac() calls usermode helper from genuine GFP_NOFS context,
> we will be still failing to tell kworker to use GFP_NOFS.

This would be a terrible thing to do. Because ...

> More problematic thing might be that we queue both GFP_KERNEL work item
> and GFP_NOFS work item into the same work queue. This means that the
> kworker will try __GFP_FS reclaim if current GFP_KERNEL work item
> and be blocked on a fs lock held by next GFP_NOFS work item. Then, simply
> conveying PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to other threads is not
> sufficient, and we need to create separate workqueues (and respective
> consumers) for GFP_KERNEL work items and GFP_NOFS work items?

... of this very reason. If some GFP_NOFS code path relies on kworkers
and wait for the work synchronously then it really has to make sure that
the WQ has a rescuer and there are no __GFP_FS allocation requeuests
enqueued on the same WQ.

> (Or we have no such problem because khelper_wq was replaced with
> system_unbound_wq ?)

I do not think so. system_unbound_wq still depends to have some workers
and that might be not true under memory pressure.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS
@ 2016-04-27 20:05           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, david, tytso, clm, jack, linux-kernel, zohar

On Wed 27-04-16 23:44:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 27-04-16 19:53:21, Tetsuo Handa wrote:
> > [...]
> > > > Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> > > > usage as much and possible and only use a properly documented
> > > > memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> > > 
> > > Is the story simple enough to monotonically replace GFP_NOFS/GFP_NOIO
> > > with GFP_KERNEL after memalloc_no{fs,io}_{save,restore} are inserted?
> > > We sometimes delegate some operations to somebody else. Don't we need to
> > > convey PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to APIs which interact with
> > > other threads?
> > 
> > We can add an api to do that if that is really needed.
> > 
> 
> I'm not familiar with integrity subsystem.
> But if call traces shown below is possible and evm_verify_hmac() is called from
> genuine GFP_NOFS context, we are currently using GFP_KERNEL incorrectly.
> Therefore, inserting memalloc_nofs_{save,restore} would avoid possible memory
> reclaim deadlock by __GFP_FS.

I am not familiar with this code as well but you are definitely right
that scope GFP_NOFS would be better. I have a suspicious that NOFS is
used here improperly and it just copies the same gfp mask used for
all allocations in the same file without any good reason.

This would be a question for Mimi. Could you clarify please?

> ----------
> static enum integrity_status evm_verify_hmac(struct dentry *dentry, const char *xattr_name, char *xattr_value, size_t xattr_value_len, struct integrity_iint_cache *iint) {
>   rc = vfs_getxattr_alloc(dentry, XATTR_NAME_EVM, (char **)&xattr_data, 0, GFP_NOFS); /***** GFP_NOFS is used here. *****/
>   rc = integrity_digsig_verify(INTEGRITY_KEYRING_EVM, (const char *)xattr_data, xattr_len, calc.digest, sizeof(calc.digest)) {
>     keyring[id] = request_key(&key_type_keyring, keyring_name[id], NULL) {
>       key = request_key_and_link(type, description, callout_info, callout_len, NULL, NULL, KEY_ALLOC_IN_QUOTA) {
>         key = construct_key_and_link(&ctx, callout_info, callout_len, aux, dest_keyring, flags) {
>           ret = construct_alloc_key(ctx, dest_keyring, flags, user, &key) {
>             key = key_alloc(ctx->index_key.type, ctx->index_key.description, ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred, perm, flags) {
>               key = kmem_cache_zalloc(key_jar, GFP_KERNEL); /***** Needs to use GFP_NOFS here if above GFP_NOFS usage is correct. *****/
>             }
>           }
>           ret = construct_key(key, callout_info, callout_len, aux, dest_keyring) {
>             cons = kmalloc(sizeof(*cons), GFP_KERNEL); /***** Ditto. *****/
>             actor = call_sbin_request_key;
>             ret = actor(cons, "create", aux) {
>               ret = call_usermodehelper_keys(argv[0], argv, envp, keyring, UMH_WAIT_PROC) {
>                 info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL, umh_keys_init, umh_keys_cleanup, session_keyring); /***** Ditto. *****/
>                 return call_usermodehelper_exec(info, wait) {
>                   queue_work(system_unbound_wq, &sub_info->work); /***** Queuing a GFP_NOFS work item here if above GFP_NOFS usage is correct. *****/
>                   wait_for_completion(&done); /***** But kworker uses GFP_KERNEL to create process for executing userspace program. *****/
>                 }
>               }
>             }
>           }
>         }
>       }
>     }
>   }
> }
> ----------
> 
> But there is a path where evm_verify_hmac() calls usermode helper.
> If evm_verify_hmac() calls usermode helper from genuine GFP_NOFS context,
> we will be still failing to tell kworker to use GFP_NOFS.

This would be a terrible thing to do. Because ...

> More problematic thing might be that we queue both GFP_KERNEL work item
> and GFP_NOFS work item into the same work queue. This means that the
> kworker will try __GFP_FS reclaim if current GFP_KERNEL work item
> and be blocked on a fs lock held by next GFP_NOFS work item. Then, simply
> conveying PF_MEMALLOC_NOFS/PF_MEMALLOC_NOIO flags to other threads is not
> sufficient, and we need to create separate workqueues (and respective
> consumers) for GFP_KERNEL work items and GFP_NOFS work items?

... of this very reason. If some GFP_NOFS code path relies on kworkers
and wait for the work synchronously then it really has to make sure that
the WQ has a rescuer and there are no __GFP_FS allocation requeuests
enqueued on the same WQ.

> (Or we have no such problem because khelper_wq was replaced with
> system_unbound_wq ?)

I do not think so. system_unbound_wq still depends to have some workers
and that might be not true under memory pressure.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
  2016-04-27 11:54       ` Michal Hocko
  (?)
  (?)
@ 2016-04-27 20:09         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> +		if (flags & KM_NOFS)
>  			lflags &= ~__GFP_FS;
>  	}
>  

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[   53.990821] =================================
[   53.991672] [ INFO: inconsistent lock state ]
[   53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[   53.993458] ---------------------------------
[   53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[   53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   53.993480]  (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   53.993480] {RECLAIM_FS-ON-W} state was registered at:
[   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[   53.993480]   [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[   54.003925]   [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[   54.003925]   [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[   54.003925]   [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[   54.003925]   [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[   54.003925]   [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[   54.003925]   [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[   54.003925]   [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[   54.003925]   [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[   54.003925]   [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[   54.003925]   [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[   54.003925]   [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[   54.003925]   [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[   54.003925]   [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[   54.003925]   [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[   54.003925]   [<ffffffff8112e6b4>] __writepage+0x16/0x34
[   54.003925]   [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[   54.003925]   [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[   54.003925]   [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[   54.003925]   [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[   54.003925]   [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[   54.003925]   [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[   54.003925]   [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[   54.003925]   [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[   54.003925]   [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[   54.003925]   [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[   54.003925]   [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[   54.003925]   [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[   54.003925] irq event stamp: 2998549
[   54.003925] hardirqs last  enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[   54.003925] softirqs last  enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[   54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[   54.003925] 
[   54.003925] other info that might help us debug this:
[   54.003925]  Possible unsafe locking scenario:
[   54.003925] 
[   54.003925]        CPU0
[   54.003925]        ----
[   54.003925]   lock(&xfs_nondir_ilock_class);
[   54.003925]   <Interrupt>
[   54.003925]     lock(&xfs_nondir_ilock_class);
[   54.003925] 
[   54.003925]  *** DEADLOCK ***
[   54.003925] 
[   54.003925] 2 locks held by kswapd0/467:
[   54.003925]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[   54.003925]  #1:  (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[   54.003925] 
[   54.003925] stack backtrace:
[   54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[   54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   54.003925]  0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[   54.003925]  ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[   54.003925]  ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[   54.003925] Call Trace:
[   54.003925]  [<ffffffff81308ded>] dump_stack+0x67/0x90
[   54.003925]  [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[   54.003925]  [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[   54.003925]  [<ffffffff8109439f>] mark_lock+0x381/0x567
[   54.003925]  [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[   54.003925]  [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[   54.003925]  [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[   54.003925]  [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[   54.003925]  [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[   54.003925]  [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[   54.003925]  [<ffffffff811a7db9>] evict+0xb0/0x165
[   54.003925]  [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[   54.003925]  [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[   54.003925]  [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[   54.003925]  [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[   54.003925]  [<ffffffff81139b97>] shrink_zone+0x175/0x263
[   54.003925]  [<ffffffff8113b175>] kswapd+0x7dc/0x948
[   54.003925]  [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[   54.003925]  [<ffffffff8106c911>] kthread+0xed/0xf5
[   54.003925]  [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[   54.003925]  [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 20:09         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> +		if (flags & KM_NOFS)
>  			lflags &= ~__GFP_FS;
>  	}
>  

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[   53.990821] =================================
[   53.991672] [ INFO: inconsistent lock state ]
[   53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[   53.993458] ---------------------------------
[   53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[   53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   53.993480]  (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   53.993480] {RECLAIM_FS-ON-W} state was registered at:
[   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[   53.993480]   [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[   54.003925]   [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[   54.003925]   [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[   54.003925]   [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[   54.003925]   [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[   54.003925]   [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[   54.003925]   [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[   54.003925]   [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[   54.003925]   [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[   54.003925]   [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[   54.003925]   [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[   54.003925]   [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[   54.003925]   [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[   54.003925]   [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[   54.003925]   [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[   54.003925]   [<ffffffff8112e6b4>] __writepage+0x16/0x34
[   54.003925]   [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[   54.003925]   [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[   54.003925]   [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[   54.003925]   [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[   54.003925]   [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[   54.003925]   [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[   54.003925]   [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[   54.003925]   [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[   54.003925]   [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[   54.003925]   [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[   54.003925]   [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[   54.003925]   [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[   54.003925] irq event stamp: 2998549
[   54.003925] hardirqs last  enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[   54.003925] softirqs last  enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[   54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[   54.003925] 
[   54.003925] other info that might help us debug this:
[   54.003925]  Possible unsafe locking scenario:
[   54.003925] 
[   54.003925]        CPU0
[   54.003925]        ----
[   54.003925]   lock(&xfs_nondir_ilock_class);
[   54.003925]   <Interrupt>
[   54.003925]     lock(&xfs_nondir_ilock_class);
[   54.003925] 
[   54.003925]  *** DEADLOCK ***
[   54.003925] 
[   54.003925] 2 locks held by kswapd0/467:
[   54.003925]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[   54.003925]  #1:  (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[   54.003925] 
[   54.003925] stack backtrace:
[   54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[   54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   54.003925]  0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[   54.003925]  ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[   54.003925]  ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[   54.003925] Call Trace:
[   54.003925]  [<ffffffff81308ded>] dump_stack+0x67/0x90
[   54.003925]  [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[   54.003925]  [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[   54.003925]  [<ffffffff8109439f>] mark_lock+0x381/0x567
[   54.003925]  [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[   54.003925]  [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[   54.003925]  [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[   54.003925]  [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[   54.003925]  [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[   54.003925]  [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[   54.003925]  [<ffffffff811a7db9>] evict+0xb0/0x165
[   54.003925]  [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[   54.003925]  [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[   54.003925]  [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[   54.003925]  [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[   54.003925]  [<ffffffff81139b97>] shrink_zone+0x175/0x263
[   54.003925]  [<ffffffff8113b175>] kswapd+0x7dc/0x948
[   54.003925]  [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[   54.003925]  [<ffffffff8106c911>] kthread+0xed/0xf5
[   54.003925]  [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[   54.003925]  [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 20:09         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> +		if (flags & KM_NOFS)
>  			lflags &= ~__GFP_FS;
>  	}
>  

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[   53.990821] =================================
[   53.991672] [ INFO: inconsistent lock state ]
[   53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[   53.993458] ---------------------------------
[   53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[   53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   53.993480]  (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   53.993480] {RECLAIM_FS-ON-W} state was registered at:
[   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[   53.993480]   [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[   54.003925]   [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[   54.003925]   [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[   54.003925]   [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[   54.003925]   [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[   54.003925]   [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[   54.003925]   [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[   54.003925]   [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[   54.003925]   [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[   54.003925]   [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[   54.003925]   [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[   54.003925]   [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[   54.003925]   [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[   54.003925]   [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[   54.003925]   [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[   54.003925]   [<ffffffff8112e6b4>] __writepage+0x16/0x34
[   54.003925]   [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[   54.003925]   [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[   54.003925]   [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[   54.003925]   [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[   54.003925]   [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[   54.003925]   [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[   54.003925]   [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[   54.003925]   [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[   54.003925]   [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[   54.003925]   [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[   54.003925]   [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[   54.003925]   [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[   54.003925] irq event stamp: 2998549
[   54.003925] hardirqs last  enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[   54.003925] softirqs last  enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[   54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[   54.003925] 
[   54.003925] other info that might help us debug this:
[   54.003925]  Possible unsafe locking scenario:
[   54.003925] 
[   54.003925]        CPU0
[   54.003925]        ----
[   54.003925]   lock(&xfs_nondir_ilock_class);
[   54.003925]   <Interrupt>
[   54.003925]     lock(&xfs_nondir_ilock_class);
[   54.003925] 
[   54.003925]  *** DEADLOCK ***
[   54.003925] 
[   54.003925] 2 locks held by kswapd0/467:
[   54.003925]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[   54.003925]  #1:  (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[   54.003925] 
[   54.003925] stack backtrace:
[   54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[   54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   54.003925]  0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[   54.003925]  ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[   54.003925]  ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[   54.003925] Call Trace:
[   54.003925]  [<ffffffff81308ded>] dump_stack+0x67/0x90
[   54.003925]  [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[   54.003925]  [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[   54.003925]  [<ffffffff8109439f>] mark_lock+0x381/0x567
[   54.003925]  [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[   54.003925]  [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[   54.003925]  [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[   54.003925]  [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[   54.003925]  [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[   54.003925]  [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[   54.003925]  [<ffffffff811a7db9>] evict+0xb0/0x165
[   54.003925]  [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[   54.003925]  [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[   54.003925]  [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[   54.003925]  [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[   54.003925]  [<ffffffff81139b97>] shrink_zone+0x175/0x263
[   54.003925]  [<ffffffff8113b175>] kswapd+0x7dc/0x948
[   54.003925]  [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[   54.003925]  [<ffffffff8106c911>] kthread+0xed/0xf5
[   54.003925]  [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[   54.003925]  [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API
@ 2016-04-27 20:09         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> +		if (flags & KM_NOFS)
>  			lflags &= ~__GFP_FS;
>  	}
>  

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[   53.990821] =================================
[   53.991672] [ INFO: inconsistent lock state ]
[   53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[   53.993458] ---------------------------------
[   53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[   53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   53.993480]  (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   53.993480] {RECLAIM_FS-ON-W} state was registered at:
[   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[   53.993480]   [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[   54.003925]   [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[   54.003925]   [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[   54.003925]   [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[   54.003925]   [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[   54.003925]   [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[   54.003925]   [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[   54.003925]   [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[   54.003925]   [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[   54.003925]   [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[   54.003925]   [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[   54.003925]   [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[   54.003925]   [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[   54.003925]   [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[   54.003925]   [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[   54.003925]   [<ffffffff8112e6b4>] __writepage+0x16/0x34
[   54.003925]   [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[   54.003925]   [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[   54.003925]   [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[   54.003925]   [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[   54.003925]   [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[   54.003925]   [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[   54.003925]   [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[   54.003925]   [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[   54.003925]   [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[   54.003925]   [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[   54.003925]   [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[   54.003925]   [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[   54.003925] irq event stamp: 2998549
[   54.003925] hardirqs last  enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[   54.003925] softirqs last  enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[   54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[   54.003925] 
[   54.003925] other info that might help us debug this:
[   54.003925]  Possible unsafe locking scenario:
[   54.003925] 
[   54.003925]        CPU0
[   54.003925]        ----
[   54.003925]   lock(&xfs_nondir_ilock_class);
[   54.003925]   <Interrupt>
[   54.003925]     lock(&xfs_nondir_ilock_class);
[   54.003925] 
[   54.003925]  *** DEADLOCK ***
[   54.003925] 
[   54.003925] 2 locks held by kswapd0/467:
[   54.003925]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[   54.003925]  #1:  (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[   54.003925] 
[   54.003925] stack backtrace:
[   54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[   54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   54.003925]  0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[   54.003925]  ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[   54.003925]  ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[   54.003925] Call Trace:
[   54.003925]  [<ffffffff81308ded>] dump_stack+0x67/0x90
[   54.003925]  [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[   54.003925]  [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[   54.003925]  [<ffffffff8109439f>] mark_lock+0x381/0x567
[   54.003925]  [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[   54.003925]  [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[   54.003925]  [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[   54.003925]  [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[   54.003925]  [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[   54.003925]  [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[   54.003925]  [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[   54.003925]  [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[   54.003925]  [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[   54.003925]  [<ffffffff811a7db9>] evict+0xb0/0x165
[   54.003925]  [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[   54.003925]  [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[   54.003925]  [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[   54.003925]  [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[   54.003925]  [<ffffffff81139b97>] shrink_zone+0x175/0x263
[   54.003925]  [<ffffffff8113b175>] kswapd+0x7dc/0x948
[   54.003925]  [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[   54.003925]  [<ffffffff8106c911>] kthread+0xed/0xf5
[   54.003925]  [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[   54.003925]  [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[   54.003925]  [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
  2016-04-27 20:09         ` Michal Hocko
  (?)
  (?)
@ 2016-04-27 20:30           ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 20:30           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 20:30           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API
@ 2016-04-27 20:30           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 20:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [   53.993480]   [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [   53.993480]   [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [   53.993480]   [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
  2016-04-27 11:54       ` Michal Hocko
                           ` (3 preceding siblings ...)
  (?)
@ 2016-04-27 21:14         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@dhcp22.suse.cz
---
>From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <mhocko@suse.com>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 21:14         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@dhcp22.suse.cz
---
>From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <mhocko@suse.com>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 21:14         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, xfs, ceph-devel,
	linux-btrfs, linux-afs

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@dhcp22.suse.cz
---
>From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <mhocko@suse.com>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 21:14         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@dhcp22.suse.cz
---

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API
@ 2016-04-27 21:14         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544@dhcp22.suse.cz
---
From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <mhocko@suse.com>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API
@ 2016-04-27 21:14         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-27 21:14 UTC (permalink / raw)
  To: cluster-devel.redhat.com

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/20160427200927.GC22544 at dhcp22.suse.cz
---
From 1968c0a8ace4090a9deca8f4c1a206ee546e595a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 27 Apr 2016 22:32:57 +0200
Subject: [PATCH] fold me "mm: introduce memalloc_nofs_{save,restore} API"

Lockdep infrastructure is hooked into early hot paths of the allocator
so __lockdep_trace_alloc has to check for PF_MEMALLOC_NOFS directly and
do not rely on current_gfp_context

Signed-off-by: Michal Hocko <mhocko@suse.com>

---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 716547fdb873..f60124d0871c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2750,7 +2750,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-27  8:03       ` Michal Hocko
  (?)
  (?)
@ 2016-04-27 22:55         ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-27 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > > 
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > > 
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> > 
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> > 
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 19 10:32:49 2015 +1000

    xfs: clean up inode lockdep annotations
    
    Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-27 22:55         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-27 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Theodore Ts'o,
	Chris Mason, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > > 
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > > 
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> > 
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> > 
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 19 10:32:49 2015 +1000

    xfs: clean up inode lockdep annotations
    
    Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-27 22:55         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-27 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > > 
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > > 
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> > 
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> > 
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 19 10:32:49 2015 +1000

    xfs: clean up inode lockdep annotations
    
    Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2016-04-27 22:55         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-27 22:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > > 
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > > 
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> > 
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> > 
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 19 10:32:49 2015 +1000

    xfs: clean up inode lockdep annotations
    
    Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-26 22:58     ` Dave Chinner
  (?)
@ 2016-04-28  8:17       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-28  8:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

[Trim the CC list]
On Wed 27-04-16 08:58:45, Dave Chinner wrote:
[...]
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

As already said in other email, I have tried to revert the above
commit and tried to run it with some fs workloads but didn't manage
to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
have tried to find reports which led to this commit but didn't succeed
much. Everything is from much earlier or later. Do you happen to
remember which loads triggered them, what they looked like or have an
idea what to try to reproduce them? So far I was trying heavy parallel
fs_mark, kernbench inside a tiny virtual machine so any of those have
triggered direct reclaim all the time.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-28  8:17       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-28  8:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

[Trim the CC list]
On Wed 27-04-16 08:58:45, Dave Chinner wrote:
[...]
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

As already said in other email, I have tried to revert the above
commit and tried to run it with some fs workloads but didn't manage
to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
have tried to find reports which led to this commit but didn't succeed
much. Everything is from much earlier or later. Do you happen to
remember which loads triggered them, what they looked like or have an
idea what to try to reproduce them? So far I was trying heavy parallel
fs_mark, kernbench inside a tiny virtual machine so any of those have
triggered direct reclaim all the time.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-28  8:17       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-28  8:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

[Trim the CC list]
On Wed 27-04-16 08:58:45, Dave Chinner wrote:
[...]
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g.  directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

As already said in other email, I have tried to revert the above
commit and tried to run it with some fs workloads but didn't manage
to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
have tried to find reports which led to this commit but didn't succeed
much. Everything is from much earlier or later. Do you happen to
remember which loads triggered them, what they looked like or have an
idea what to try to reproduce them? So far I was trying heavy parallel
fs_mark, kernbench inside a tiny virtual machine so any of those have
triggered direct reclaim all the time.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-28  8:17       ` Michal Hocko
  (?)
@ 2016-04-28 21:51         ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-28 21:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> [Trim the CC list]
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> [...]
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> As already said in other email, I have tried to revert the above
> commit and tried to run it with some fs workloads but didn't manage
> to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> have tried to find reports which led to this commit but didn't succeed
> much. Everything is from much earlier or later. Do you happen to
> remember which loads triggered them, what they looked like or have an
> idea what to try to reproduce them? So far I was trying heavy parallel
> fs_mark, kernbench inside a tiny virtual machine so any of those have
> triggered direct reclaim all the time.

Most of those issues were reported by users and not reproducable by
any obvious means. They may have been fixed since, but I'm sceptical
of that because, generally speaking, developer testing only catches
the obvious lockdep issues. i.e. it's users that report all the
really twisty issues, and they are generally not reproducable except
under their production workloads...

IOWs, the absence of reports in your testing does not mean there
isn't a problem, and that is one of the biggest problems with
lockdep annotations - we have no way of ever knowing if they are
still necessary or not without exposing users to regressions and
potential deadlocks.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-28 21:51         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-28 21:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> [Trim the CC list]
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> [...]
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> As already said in other email, I have tried to revert the above
> commit and tried to run it with some fs workloads but didn't manage
> to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> have tried to find reports which led to this commit but didn't succeed
> much. Everything is from much earlier or later. Do you happen to
> remember which loads triggered them, what they looked like or have an
> idea what to try to reproduce them? So far I was trying heavy parallel
> fs_mark, kernbench inside a tiny virtual machine so any of those have
> triggered direct reclaim all the time.

Most of those issues were reported by users and not reproducable by
any obvious means. They may have been fixed since, but I'm sceptical
of that because, generally speaking, developer testing only catches
the obvious lockdep issues. i.e. it's users that report all the
really twisty issues, and they are generally not reproducable except
under their production workloads...

IOWs, the absence of reports in your testing does not mean there
isn't a problem, and that is one of the biggest problems with
lockdep annotations - we have no way of ever knowing if they are
still necessary or not without exposing users to regressions and
potential deadlocks.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-28 21:51         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-28 21:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> [Trim the CC list]
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> [...]
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
> 
> As already said in other email, I have tried to revert the above
> commit and tried to run it with some fs workloads but didn't manage
> to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> have tried to find reports which led to this commit but didn't succeed
> much. Everything is from much earlier or later. Do you happen to
> remember which loads triggered them, what they looked like or have an
> idea what to try to reproduce them? So far I was trying heavy parallel
> fs_mark, kernbench inside a tiny virtual machine so any of those have
> triggered direct reclaim all the time.

Most of those issues were reported by users and not reproducable by
any obvious means. They may have been fixed since, but I'm sceptical
of that because, generally speaking, developer testing only catches
the obvious lockdep issues. i.e. it's users that report all the
really twisty issues, and they are generally not reproducable except
under their production workloads...

IOWs, the absence of reports in your testing does not mean there
isn't a problem, and that is one of the biggest problems with
lockdep annotations - we have no way of ever knowing if they are
still necessary or not without exposing users to regressions and
potential deadlocks.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-26 11:56 ` Michal Hocko
  (?)
@ 2016-04-29  5:35   ` NeilBrown
  -1 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-29  5:35 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, Theodore Ts'o, Chris Mason,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 4164 bytes --]

On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface.... but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29  5:35   ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-29  5:35 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	reiserfs-devel, linux-f2fs-devel, logfs, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, xfs, ceph-devel,
	linux-btrfs, linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 4164 bytes --]

On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface.... but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29  5:35   ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-29  5:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface.... but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160429/510eda9c/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
  2016-04-29  5:35   ` NeilBrown
                       ` (2 preceding siblings ...)
  (?)
@ 2016-04-29 10:20     ` Steven Whitehouse
  -1 siblings, 0 replies; 144+ messages in thread
From: Steven Whitehouse @ 2016-04-29 10:20 UTC (permalink / raw)
  To: NeilBrown, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	Dave Chinner, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, Chris Mason, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs, cluster-devel

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.

The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,

Steve.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 10:20     ` Steven Whitehouse
  0 siblings, 0 replies; 144+ messages in thread
From: Steven Whitehouse @ 2016-04-29 10:20 UTC (permalink / raw)
  To: NeilBrown, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	Dave Chinner, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, Chris Mason, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs, cluster-devel

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.

The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,

Steve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 10:20     ` Steven Whitehouse
  0 siblings, 0 replies; 144+ messages in thread
From: Steven Whitehouse @ 2016-04-29 10:20 UTC (permalink / raw)
  To: NeilBrown, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-f2fs-devel, Theodore Ts'o, logfs,
	linux-ntfs-dev, xfs, LKML, reiserfs-devel, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, linux-ext4,
	ceph-devel, linux-btrfs, linux-afs

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.

The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,

Steve.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 10:20     ` Steven Whitehouse
  0 siblings, 0 replies; 144+ messages in thread
From: Steven Whitehouse @ 2016-04-29 10:20 UTC (permalink / raw)
  To: NeilBrown, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	Dave Chinner, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, Chris Mason, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.

The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,

Steve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 10:20     ` Steven Whitehouse
  0 siblings, 0 replies; 144+ messages in thread
From: Steven Whitehouse @ 2016-04-29 10:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally.  I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:

  1. Inode is scheduled for eviction (which requires deallocation)
  2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock
  3. Another node in the cluster fails, so needs recovery
  4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)
  5. Recovery is performed using a userland fencing utility
  6. Fencing requires memory and then blocks on the eviction
  7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.

The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,

Steve.



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-29  5:35   ` NeilBrown
  (?)
  (?)
@ 2016-04-29 12:04     ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
: 
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
: 
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
: 
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
: 
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
> 
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.  If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.
> 
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
> 
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups.  There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
> 
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save().  I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test?  Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 12:04     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
: 
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
: 
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
: 
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
: 
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
> 
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.  If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.
> 
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
> 
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups.  There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
> 
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save().  I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test?  Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 12:04     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
: 
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
: 
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
: 
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
: 
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
> 
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.  If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.
> 
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
> 
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups.  There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
> 
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save().  I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test?  Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-29 12:04     ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
: 
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
: 
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
: 
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
: 
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
> 
> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things.  If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.
> 
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
> 
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups.  There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
> 
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save().  I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test?  Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-28 21:51         ` Dave Chinner
  (?)
@ 2016-04-29 12:12           ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > [Trim the CC list]
> > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > [...]
> > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > lockdep gets very unhappy about the same functions being called with
> > > different reclaim contexts. e.g.  directory block mapping might
> > > occur from readdir (no transaction context) or within transactions
> > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > stop lockdep emitting false positive warnings....
> > 
> > As already said in other email, I have tried to revert the above
> > commit and tried to run it with some fs workloads but didn't manage
> > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > have tried to find reports which led to this commit but didn't succeed
> > much. Everything is from much earlier or later. Do you happen to
> > remember which loads triggered them, what they looked like or have an
> > idea what to try to reproduce them? So far I was trying heavy parallel
> > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > triggered direct reclaim all the time.
> 
> Most of those issues were reported by users and not reproducable by
> any obvious means.

I would really appreciate a reference to some of those (my google-fu has
failed me) or at least a pattern of those splats - was it 
"inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
or a different class reports?

> They may have been fixed since, but I'm sceptical
> of that because, generally speaking, developer testing only catches
> the obvious lockdep issues. i.e. it's users that report all the
> really twisty issues, and they are generally not reproducable except
> under their production workloads...
> 
> IOWs, the absence of reports in your testing does not mean there
> isn't a problem, and that is one of the biggest problems with
> lockdep annotations - we have no way of ever knowing if they are
> still necessary or not without exposing users to regressions and
> potential deadlocks.....

I understand your points here but if we are sure that those lockdep
reports are just false positives then we should rather provide an api to
silence lockdep for those paths than abusing GFP_NOFS which a) hurts
the overal reclaim healthiness and b) works around a non-existing
problem with lockdep disabled which is the vast majority of
configurations.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-29 12:12           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > [Trim the CC list]
> > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > [...]
> > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > lockdep gets very unhappy about the same functions being called with
> > > different reclaim contexts. e.g.  directory block mapping might
> > > occur from readdir (no transaction context) or within transactions
> > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > stop lockdep emitting false positive warnings....
> > 
> > As already said in other email, I have tried to revert the above
> > commit and tried to run it with some fs workloads but didn't manage
> > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > have tried to find reports which led to this commit but didn't succeed
> > much. Everything is from much earlier or later. Do you happen to
> > remember which loads triggered them, what they looked like or have an
> > idea what to try to reproduce them? So far I was trying heavy parallel
> > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > triggered direct reclaim all the time.
> 
> Most of those issues were reported by users and not reproducable by
> any obvious means.

I would really appreciate a reference to some of those (my google-fu has
failed me) or at least a pattern of those splats - was it 
"inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
or a different class reports?

> They may have been fixed since, but I'm sceptical
> of that because, generally speaking, developer testing only catches
> the obvious lockdep issues. i.e. it's users that report all the
> really twisty issues, and they are generally not reproducable except
> under their production workloads...
> 
> IOWs, the absence of reports in your testing does not mean there
> isn't a problem, and that is one of the biggest problems with
> lockdep annotations - we have no way of ever knowing if they are
> still necessary or not without exposing users to regressions and
> potential deadlocks.....

I understand your points here but if we are sure that those lockdep
reports are just false positives then we should rather provide an api to
silence lockdep for those paths than abusing GFP_NOFS which a) hurts
the overal reclaim healthiness and b) works around a non-existing
problem with lockdep disabled which is the vast majority of
configurations.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-29 12:12           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-04-29 12:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > [Trim the CC list]
> > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > [...]
> > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > lockdep gets very unhappy about the same functions being called with
> > > different reclaim contexts. e.g.  directory block mapping might
> > > occur from readdir (no transaction context) or within transactions
> > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > stop lockdep emitting false positive warnings....
> > 
> > As already said in other email, I have tried to revert the above
> > commit and tried to run it with some fs workloads but didn't manage
> > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > have tried to find reports which led to this commit but didn't succeed
> > much. Everything is from much earlier or later. Do you happen to
> > remember which loads triggered them, what they looked like or have an
> > idea what to try to reproduce them? So far I was trying heavy parallel
> > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > triggered direct reclaim all the time.
> 
> Most of those issues were reported by users and not reproducable by
> any obvious means.

I would really appreciate a reference to some of those (my google-fu has
failed me) or at least a pattern of those splats - was it 
"inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
or a different class reports?

> They may have been fixed since, but I'm sceptical
> of that because, generally speaking, developer testing only catches
> the obvious lockdep issues. i.e. it's users that report all the
> really twisty issues, and they are generally not reproducable except
> under their production workloads...
> 
> IOWs, the absence of reports in your testing does not mean there
> isn't a problem, and that is one of the biggest problems with
> lockdep annotations - we have no way of ever knowing if they are
> still necessary or not without exposing users to regressions and
> potential deadlocks.....

I understand your points here but if we are sure that those lockdep
reports are just false positives then we should rather provide an api to
silence lockdep for those paths than abusing GFP_NOFS which a) hurts
the overal reclaim healthiness and b) works around a non-existing
problem with lockdep disabled which is the vast majority of
configurations.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-29 12:12           ` Michal Hocko
  (?)
@ 2016-04-29 23:40             ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-29 23:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> > On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > > [Trim the CC list]
> > > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > > [...]
> > > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > > lockdep gets very unhappy about the same functions being called with
> > > > different reclaim contexts. e.g.  directory block mapping might
> > > > occur from readdir (no transaction context) or within transactions
> > > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > > stop lockdep emitting false positive warnings....
> > > 
> > > As already said in other email, I have tried to revert the above
> > > commit and tried to run it with some fs workloads but didn't manage
> > > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > > have tried to find reports which led to this commit but didn't succeed
> > > much. Everything is from much earlier or later. Do you happen to
> > > remember which loads triggered them, what they looked like or have an
> > > idea what to try to reproduce them? So far I was trying heavy parallel
> > > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > > triggered direct reclaim all the time.
> > 
> > Most of those issues were reported by users and not reproducable by
> > any obvious means.
> 
> I would really appreciate a reference to some of those (my google-fu has
> failed me) or at least a pattern of those splats

If you can't find them with google, then I won't. Google is mostly
useless as a patch/mailing list search tool these days. You can try
looking through this list:

https://www.google.com.au/search?q=XFS+lockdep+site:oss.sgi.com+-splice

but I'm not seeing anything particularly relevant in that list -
there isn't a single reclaim related lockdep report in that...

> - was it 
> "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> or a different class reports?

Typically that was involved, but it quite often there'd be a number
of locks and sometimes even interrupt stacks in an interaction
between 5 or 6 different processes. Lockdep covers all sorts of
stuff now (like fs freeze annotations as well as locks and memory
reclaim) so sometimes the only thing we can do is remove the
reclaim context from the stack and see if that makes it go away...
> 
> > They may have been fixed since, but I'm sceptical
> > of that because, generally speaking, developer testing only catches
> > the obvious lockdep issues. i.e. it's users that report all the
> > really twisty issues, and they are generally not reproducable except
> > under their production workloads...
> > 
> > IOWs, the absence of reports in your testing does not mean there
> > isn't a problem, and that is one of the biggest problems with
> > lockdep annotations - we have no way of ever knowing if they are
> > still necessary or not without exposing users to regressions and
> > potential deadlocks.....
> 
> I understand your points here but if we are sure that those lockdep
> reports are just false positives then we should rather provide an api to
> silence lockdep for those paths

I agree with this - please provide such infrastructure before we
need it...

> than abusing GFP_NOFS which a) hurts
> the overal reclaim healthiness

Which doesn't actually seem to be a problem for the vast majority of
users.

> and b) works around a non-existing
> problem with lockdep disabled which is the vast majority of
> configurations.

But the moment we have a lockdep problem, we get bug reports from
all over the place and people complaining about it, so we are
*required* to silence them one way or another. And, like I said,
when the choice is simply adding GFP_NOFS or spending a week or two
completely reworking complex code that has functioned correctly for
15 years, the risk/reward *always* falls on the side of "just add
GFP_NOFS".

Please keep in mind that there is as much code in fs/xfs as there is
in the mm/ subsystem, and XFS has twice that in userspace as well.
I say this, because we have only have 3-4 full time developers to do
all the work required on this code base, unlike the mm/ subsystem
which had 30-40 full time MM developers attending LSFMM. This is why
I push back on suggestions that require significant redesign of
subsystem code to handle memory allocation/reclaim quirks - most
subsystems simply don't have the resources available to do such
work, and so will always look for the quick 2 minute fix when it is
available....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-29 23:40             ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-29 23:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> > On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > > [Trim the CC list]
> > > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > > [...]
> > > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > > lockdep gets very unhappy about the same functions being called with
> > > > different reclaim contexts. e.g.  directory block mapping might
> > > > occur from readdir (no transaction context) or within transactions
> > > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > > stop lockdep emitting false positive warnings....
> > > 
> > > As already said in other email, I have tried to revert the above
> > > commit and tried to run it with some fs workloads but didn't manage
> > > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > > have tried to find reports which led to this commit but didn't succeed
> > > much. Everything is from much earlier or later. Do you happen to
> > > remember which loads triggered them, what they looked like or have an
> > > idea what to try to reproduce them? So far I was trying heavy parallel
> > > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > > triggered direct reclaim all the time.
> > 
> > Most of those issues were reported by users and not reproducable by
> > any obvious means.
> 
> I would really appreciate a reference to some of those (my google-fu has
> failed me) or at least a pattern of those splats

If you can't find them with google, then I won't. Google is mostly
useless as a patch/mailing list search tool these days. You can try
looking through this list:

https://www.google.com.au/search?q=XFS+lockdep+site:oss.sgi.com+-splice

but I'm not seeing anything particularly relevant in that list -
there isn't a single reclaim related lockdep report in that...

> - was it 
> "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> or a different class reports?

Typically that was involved, but it quite often there'd be a number
of locks and sometimes even interrupt stacks in an interaction
between 5 or 6 different processes. Lockdep covers all sorts of
stuff now (like fs freeze annotations as well as locks and memory
reclaim) so sometimes the only thing we can do is remove the
reclaim context from the stack and see if that makes it go away...
> 
> > They may have been fixed since, but I'm sceptical
> > of that because, generally speaking, developer testing only catches
> > the obvious lockdep issues. i.e. it's users that report all the
> > really twisty issues, and they are generally not reproducable except
> > under their production workloads...
> > 
> > IOWs, the absence of reports in your testing does not mean there
> > isn't a problem, and that is one of the biggest problems with
> > lockdep annotations - we have no way of ever knowing if they are
> > still necessary or not without exposing users to regressions and
> > potential deadlocks.....
> 
> I understand your points here but if we are sure that those lockdep
> reports are just false positives then we should rather provide an api to
> silence lockdep for those paths

I agree with this - please provide such infrastructure before we
need it...

> than abusing GFP_NOFS which a) hurts
> the overal reclaim healthiness

Which doesn't actually seem to be a problem for the vast majority of
users.

> and b) works around a non-existing
> problem with lockdep disabled which is the vast majority of
> configurations.

But the moment we have a lockdep problem, we get bug reports from
all over the place and people complaining about it, so we are
*required* to silence them one way or another. And, like I said,
when the choice is simply adding GFP_NOFS or spending a week or two
completely reworking complex code that has functioned correctly for
15 years, the risk/reward *always* falls on the side of "just add
GFP_NOFS".

Please keep in mind that there is as much code in fs/xfs as there is
in the mm/ subsystem, and XFS has twice that in userspace as well.
I say this, because we have only have 3-4 full time developers to do
all the work required on this code base, unlike the mm/ subsystem
which had 30-40 full time MM developers attending LSFMM. This is why
I push back on suggestions that require significant redesign of
subsystem code to handle memory allocation/reclaim quirks - most
subsystems simply don't have the resources available to do such
work, and so will always look for the quick 2 minute fix when it is
available....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-04-29 23:40             ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-29 23:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> On Fri 29-04-16 07:51:45, Dave Chinner wrote:
> > On Thu, Apr 28, 2016 at 10:17:59AM +0200, Michal Hocko wrote:
> > > [Trim the CC list]
> > > On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > > [...]
> > > > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > > > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > > > lockdep gets very unhappy about the same functions being called with
> > > > different reclaim contexts. e.g.  directory block mapping might
> > > > occur from readdir (no transaction context) or within transactions
> > > > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > > > stop lockdep emitting false positive warnings....
> > > 
> > > As already said in other email, I have tried to revert the above
> > > commit and tried to run it with some fs workloads but didn't manage
> > > to hit any lockdep splats (after I fixed my bug in the patch 1.2). I
> > > have tried to find reports which led to this commit but didn't succeed
> > > much. Everything is from much earlier or later. Do you happen to
> > > remember which loads triggered them, what they looked like or have an
> > > idea what to try to reproduce them? So far I was trying heavy parallel
> > > fs_mark, kernbench inside a tiny virtual machine so any of those have
> > > triggered direct reclaim all the time.
> > 
> > Most of those issues were reported by users and not reproducable by
> > any obvious means.
> 
> I would really appreciate a reference to some of those (my google-fu has
> failed me) or at least a pattern of those splats

If you can't find them with google, then I won't. Google is mostly
useless as a patch/mailing list search tool these days. You can try
looking through this list:

https://www.google.com.au/search?q=XFS+lockdep+site:oss.sgi.com+-splice

but I'm not seeing anything particularly relevant in that list -
there isn't a single reclaim related lockdep report in that...

> - was it 
> "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> or a different class reports?

Typically that was involved, but it quite often there'd be a number
of locks and sometimes even interrupt stacks in an interaction
between 5 or 6 different processes. Lockdep covers all sorts of
stuff now (like fs freeze annotations as well as locks and memory
reclaim) so sometimes the only thing we can do is remove the
reclaim context from the stack and see if that makes it go away...
> 
> > They may have been fixed since, but I'm sceptical
> > of that because, generally speaking, developer testing only catches
> > the obvious lockdep issues. i.e. it's users that report all the
> > really twisty issues, and they are generally not reproducable except
> > under their production workloads...
> > 
> > IOWs, the absence of reports in your testing does not mean there
> > isn't a problem, and that is one of the biggest problems with
> > lockdep annotations - we have no way of ever knowing if they are
> > still necessary or not without exposing users to regressions and
> > potential deadlocks.....
> 
> I understand your points here but if we are sure that those lockdep
> reports are just false positives then we should rather provide an api to
> silence lockdep for those paths

I agree with this - please provide such infrastructure before we
need it...

> than abusing GFP_NOFS which a) hurts
> the overal reclaim healthiness

Which doesn't actually seem to be a problem for the vast majority of
users.

> and b) works around a non-existing
> problem with lockdep disabled which is the vast majority of
> configurations.

But the moment we have a lockdep problem, we get bug reports from
all over the place and people complaining about it, so we are
*required* to silence them one way or another. And, like I said,
when the choice is simply adding GFP_NOFS or spending a week or two
completely reworking complex code that has functioned correctly for
15 years, the risk/reward *always* falls on the side of "just add
GFP_NOFS".

Please keep in mind that there is as much code in fs/xfs as there is
in the mm/ subsystem, and XFS has twice that in userspace as well.
I say this, because we have only have 3-4 full time developers to do
all the work required on this code base, unlike the mm/ subsystem
which had 30-40 full time MM developers attending LSFMM. This is why
I push back on suggestions that require significant redesign of
subsystem code to handle memory allocation/reclaim quirks - most
subsystems simply don't have the resources available to do such
work, and so will always look for the quick 2 minute fix when it is
available....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-29  5:35   ` NeilBrown
  (?)
  (?)
@ 2016-04-30  0:11     ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>    If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:11     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>    If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:11     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, xfs, reiserfs-devel, linux-f2fs-devel,
	Michal Hocko, cluster-devel, linux-mm, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, logfs, ceph-devel, linux-btrfs,
	linux-afs

On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>    If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:11     ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>    in ->releasepage().  They used to block waiting for some IO, but that
>    caused deadlocks and wasn't really needed.  I left the timeout because
>    it seemed likely that some throttling would help.  I suspect that a
>    careful analysis will show that there is sufficient throttling
>    elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>    for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>    If it could be changed
>    to just schedule the IO without waiting for it then I think this
>    would be safe to be called in any FS allocation context.  It already
>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>    if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-29 12:04     ` Michal Hocko
  (?)
  (?)
@ 2016-04-30  0:24       ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions....

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:24       ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions....

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:24       ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, linux-ext4, Theodore Ts'o, reiserfs-devel,
	Chris Mason, linux-ntfs-dev, LKML, linux-nfs, linux-f2fs-devel,
	logfs, cluster-devel, linux-mm, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions....

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30  0:24       ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-04-30  0:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions....

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: Steven Whitehouse, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	Dave Chinner, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, Chris Mason, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs, cluster-devel

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: Steven Whitehouse, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, LKML,
	Dave Chinner, reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	logfs-PCqxUs/MD9bYtjvyW6yDsg,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA, Chris Mason,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jan Kara,
	Andrew Morton, xfs-VZNHf3L845pBDgjK7y7TUQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, cluster-devel

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: Steven Whitehouse, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-f2fs-devel, Theodore Ts'o, logfs,
	linux-ntfs-dev, xfs, LKML, reiserfs-devel, cluster-devel,
	Chris Mason, linux-mtd, Jan Kara, Andrew Morton, linux-ext4,
	ceph-devel, linux-btrfs, linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 2973 bytes --]

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: Steven Whitehouse, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, LKML,
	Dave Chinner, reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	logfs-PCqxUs/MD9bYtjvyW6yDsg,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA, Chris Mason,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jan Kara,
	Andrew Morton, xfs-VZNHf3L845pBDgjK7y7TUQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: Steven Whitehouse, Michal Hocko, linux-mm, linux-fsdevel
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, linux-ntfs-dev, LKML,
	Dave Chinner, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, Chris Mason, linux-mtd, Jan Kara, Andrew Morton,
	xfs, ceph-devel, linux-btrfs, linux-afs

[-- Attachment #1: Type: text/plain, Size: 2973 bytes --]

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:17       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160501/1d2a31aa/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-29 12:04     ` Michal Hocko
  (?)
@ 2016-04-30 21:55       ` NeilBrown
  -1 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]

On Fri, Apr 29 2016, Michal Hocko wrote:

>
> One think I have learned is that shrinkers can be really complex and
> getting rid of GFP_NOFS will be really hard so I would really like to
> start the easiest way possible and remove the direct usage and replace
> it by scope one which would at least _explain_ why it is needed. I think
> this is a reasonable _first_ step and a large step ahead because we have
> a good chance to get rid of a large number of those which were used
> "just because I wasn't sure and this should be safe, right?". I wouldn't
> be surprised if we end up with a very small number of both scope and
> direct usage in the end.

Yes, shrinkers can be complex.  About two of them are.  We could fix
lots and lots of call sites, or fix two shrinkers.
OK, that's a bit unfair as fixing one of the shrinkers involves changing
many ->evict_inode() functions.  But that would be a very focused
change.

I think your proposal is little more than re-arranging deck chairs on
the titanic.  Yes, it might give everybody a better view of the iceberg
but the iceberg is still there and in reality we can already see it.

The main iceberg is evict_inode.  It appears in both examples given so
far: xfs and gfs.  There are other little icebergs but they won't last
long after evict_inode is dealt with.

One particular problem with your process-context idea is that it isn't
inherited across threads.
Steve Whitehouse's example in gfs shows how allocation dependencies can
even cross into user space.

A more localized one that I have seen is that NFSv4 sometimes needs to
start up a state-management thread (particularly if the server
restarted).
It uses kthread_run(), which doesn't actually create the thread but asks
kthreadd to do it.  If NFS writeout is waiting for state management it
would need to make sure that kthreadd runs in allocation context to
avoid deadlock.
I feel that I've forgotten some important detail here and this might
have been fixed somehow, but the point still stands that the allocation
context can cross from thread to thread and can effectively become
anything and everything.

It is OK to wait for memory to be freed.  It is not OK to wait for any
particular piece of memory to be freed because you don't always know who
is waiting for you, or who you really are waiting on to free that
memory.

Whenever trying to free memory I think you need to do best-effort
without blocking.

>
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

I think the only part of prune_dcache_sb() that might need care is
iput() which boils down to evict().  The final unlink for NFS
silly-rename might happen in there too (in d_iput).
shrinking the dcache seems rather late to be performing that unlink
though, so I've probably missed some key detail.

If we find a way for evict(), when called from the shrinker, to be
non-blocking, and generally require all shrinkers to be non-blocking,
then many of these allocation problems evaporate.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:55       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 3249 bytes --]

On Fri, Apr 29 2016, Michal Hocko wrote:

>
> One think I have learned is that shrinkers can be really complex and
> getting rid of GFP_NOFS will be really hard so I would really like to
> start the easiest way possible and remove the direct usage and replace
> it by scope one which would at least _explain_ why it is needed. I think
> this is a reasonable _first_ step and a large step ahead because we have
> a good chance to get rid of a large number of those which were used
> "just because I wasn't sure and this should be safe, right?". I wouldn't
> be surprised if we end up with a very small number of both scope and
> direct usage in the end.

Yes, shrinkers can be complex.  About two of them are.  We could fix
lots and lots of call sites, or fix two shrinkers.
OK, that's a bit unfair as fixing one of the shrinkers involves changing
many ->evict_inode() functions.  But that would be a very focused
change.

I think your proposal is little more than re-arranging deck chairs on
the titanic.  Yes, it might give everybody a better view of the iceberg
but the iceberg is still there and in reality we can already see it.

The main iceberg is evict_inode.  It appears in both examples given so
far: xfs and gfs.  There are other little icebergs but they won't last
long after evict_inode is dealt with.

One particular problem with your process-context idea is that it isn't
inherited across threads.
Steve Whitehouse's example in gfs shows how allocation dependencies can
even cross into user space.

A more localized one that I have seen is that NFSv4 sometimes needs to
start up a state-management thread (particularly if the server
restarted).
It uses kthread_run(), which doesn't actually create the thread but asks
kthreadd to do it.  If NFS writeout is waiting for state management it
would need to make sure that kthreadd runs in allocation context to
avoid deadlock.
I feel that I've forgotten some important detail here and this might
have been fixed somehow, but the point still stands that the allocation
context can cross from thread to thread and can effectively become
anything and everything.

It is OK to wait for memory to be freed.  It is not OK to wait for any
particular piece of memory to be freed because you don't always know who
is waiting for you, or who you really are waiting on to free that
memory.

Whenever trying to free memory I think you need to do best-effort
without blocking.

>
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

I think the only part of prune_dcache_sb() that might need care is
iput() which boils down to evict().  The final unlink for NFS
silly-rename might happen in there too (in d_iput).
shrinking the dcache seems rather late to be performing that unlink
though, so I've probably missed some key detail.

If we find a way for evict(), when called from the shrinker, to be
non-blocking, and generally require all shrinkers to be non-blocking,
then many of these allocation problems evaporate.

Thanks,
NeilBrown


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 21:55       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 21:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Apr 29 2016, Michal Hocko wrote:

>
> One think I have learned is that shrinkers can be really complex and
> getting rid of GFP_NOFS will be really hard so I would really like to
> start the easiest way possible and remove the direct usage and replace
> it by scope one which would at least _explain_ why it is needed. I think
> this is a reasonable _first_ step and a large step ahead because we have
> a good chance to get rid of a large number of those which were used
> "just because I wasn't sure and this should be safe, right?". I wouldn't
> be surprised if we end up with a very small number of both scope and
> direct usage in the end.

Yes, shrinkers can be complex.  About two of them are.  We could fix
lots and lots of call sites, or fix two shrinkers.
OK, that's a bit unfair as fixing one of the shrinkers involves changing
many ->evict_inode() functions.  But that would be a very focused
change.

I think your proposal is little more than re-arranging deck chairs on
the titanic.  Yes, it might give everybody a better view of the iceberg
but the iceberg is still there and in reality we can already see it.

The main iceberg is evict_inode.  It appears in both examples given so
far: xfs and gfs.  There are other little icebergs but they won't last
long after evict_inode is dealt with.

One particular problem with your process-context idea is that it isn't
inherited across threads.
Steve Whitehouse's example in gfs shows how allocation dependencies can
even cross into user space.

A more localized one that I have seen is that NFSv4 sometimes needs to
start up a state-management thread (particularly if the server
restarted).
It uses kthread_run(), which doesn't actually create the thread but asks
kthreadd to do it.  If NFS writeout is waiting for state management it
would need to make sure that kthreadd runs in allocation context to
avoid deadlock.
I feel that I've forgotten some important detail here and this might
have been fixed somehow, but the point still stands that the allocation
context can cross from thread to thread and can effectively become
anything and everything.

It is OK to wait for memory to be freed.  It is not OK to wait for any
particular piece of memory to be freed because you don't always know who
is waiting for you, or who you really are waiting on to free that
memory.

Whenever trying to free memory I think you need to do best-effort
without blocking.

>
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

I think the only part of prune_dcache_sb() that might need care is
iput() which boils down to evict().  The final unlink for NFS
silly-rename might happen in there too (in d_iput).
shrinking the dcache seems rather late to be performing that unlink
though, so I've probably missed some key detail.

If we find a way for evict(), when called from the shrinker, to be
non-blocking, and generally require all shrinkers to be non-blocking,
then many of these allocation problems evaporate.

Thanks,
NeilBrown

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160501/5a84b0ae/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-30  0:11     ` Dave Chinner
  (?)
@ 2016-04-30 22:19       ` NeilBrown
  -1 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 22:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 8598 bytes --]

On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface.... but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>    in ->releasepage().  They used to block waiting for some IO, but that
>>    caused deadlocks and wasn't really needed.  I left the timeout because
>>    it seemed likely that some throttling would help.  I suspect that a
>>    careful analysis will show that there is sufficient throttling
>>    elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>    for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim keeps reporting "no progress being
> made" because dirty inodes are skipped. BY blocking on such inodes,
> the shrinker makes progress (slowly) and reclaim sees that memory is
> being freed and so continues without invoking the OOM killer...

I'm very aware of the need to throttle allocation based on IO.  I
remember when NFS didn't quite get this right and filled up memory :-)

balance_dirty_pages() used to force threads to wait on the write-out of
one page for every page that they dirtied (or wait on 128 pages for every 128
dirtied or whatever).  This was exactly to provide the sort of
throttling you are talking about.

We don't do that any more.  It was problematic.  I don't recall all the
reasons but I think that different backing devices having different
clearance rates was part of the problem.
So now we monitor clearance rates and wait for some number of blocks to
be written, rather than waiting for some specific blocks to be written.

We should be able to do the same thing to balance dirty inodes as we do
to balance dirty pages.


>
>>    If it could be changed
>>    to just schedule the IO without waiting for it then I think this
>>    would be safe to be called in any FS allocation context.  It already
>>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>>    if the lock is held.
>
> We could, but then we have the same problem as the inode cache -
> there's no indication of progress going back to the memory reclaim
> subsystem, nor is reclaim able to throttle memory allocation back to
> the rate at which reclaim is making progress.
>
> There's feedback loops all throughout the XFS reclaim code - it's
> designed specifically that way - I made changes to the shrinker
> infrastructure years ago to enable this. It's no different to the
> dirty page throttling that was done at roughly the same time -
> that's also one big feedback loop controlled by the rate at which
> pages can be cleaned. Indeed, it was designed was based on the same
> premise as all the XFS shrinker code: in steady state conditions
> we can't allocate a resource faster than we can reclaim it, so we
> need to make reclaim as efficient at possible...

You seem to be referring here to the same change that I was referred to
above, but seem to be seeing it from a different perspective.

Waiting for inodes to be freed in important.  Waiting for any one
specific inode to be freed is dangerous.

>
>> I think you/we would end up with a much simpler system if instead of
>> focussing on the places where GFP_NOFS is used, we focus on places where
>> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
>> them the remainder could just use __GFP_IO.
>
> The problem with this is that a single kswapd thread can't keep up
> with all of the allocation pressure that occurs. e.g. a 20-core
> intel CPU with local memory will be seen as a single node and so
> will have a single kswapd thread to do reclaim. There's a massive
> imbalance between maximum reclaim rate and maximum allocation rate
> in situations like this. If we want memory reclaim to run faster,
> we to be able to do more work *now*, not defer it to a context with
> limited execution resources.

I agree it makes sense to do *work* locally on the core that sees the
need.  What I want to avoid is *sleeping* locally.
How would it be to make evict_inode non-blocking?  It would do as much work
as it can, which in many cases would presumably complete the whole task.
But if it cannot progress for some reason, it returns -EAGAIN and then
the rest gets queued for later.
Might that work, or is there often still lots of CPU work to be done
after things have been waited for?

Thanks,
NeilBrown


>
> i.e. IMO deferring more work to a single reclaim thread per node is
> going to limit memory reclaim scalability and performance, not
> improve it.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 22:19       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 22:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, xfs, reiserfs-devel, linux-f2fs-devel,
	Michal Hocko, cluster-devel, linux-mm, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, logfs, ceph-devel, linux-btrfs,
	linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 8598 bytes --]

On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface.... but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>    in ->releasepage().  They used to block waiting for some IO, but that
>>    caused deadlocks and wasn't really needed.  I left the timeout because
>>    it seemed likely that some throttling would help.  I suspect that a
>>    careful analysis will show that there is sufficient throttling
>>    elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>    for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim keeps reporting "no progress being
> made" because dirty inodes are skipped. BY blocking on such inodes,
> the shrinker makes progress (slowly) and reclaim sees that memory is
> being freed and so continues without invoking the OOM killer...

I'm very aware of the need to throttle allocation based on IO.  I
remember when NFS didn't quite get this right and filled up memory :-)

balance_dirty_pages() used to force threads to wait on the write-out of
one page for every page that they dirtied (or wait on 128 pages for every 128
dirtied or whatever).  This was exactly to provide the sort of
throttling you are talking about.

We don't do that any more.  It was problematic.  I don't recall all the
reasons but I think that different backing devices having different
clearance rates was part of the problem.
So now we monitor clearance rates and wait for some number of blocks to
be written, rather than waiting for some specific blocks to be written.

We should be able to do the same thing to balance dirty inodes as we do
to balance dirty pages.


>
>>    If it could be changed
>>    to just schedule the IO without waiting for it then I think this
>>    would be safe to be called in any FS allocation context.  It already
>>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>>    if the lock is held.
>
> We could, but then we have the same problem as the inode cache -
> there's no indication of progress going back to the memory reclaim
> subsystem, nor is reclaim able to throttle memory allocation back to
> the rate at which reclaim is making progress.
>
> There's feedback loops all throughout the XFS reclaim code - it's
> designed specifically that way - I made changes to the shrinker
> infrastructure years ago to enable this. It's no different to the
> dirty page throttling that was done at roughly the same time -
> that's also one big feedback loop controlled by the rate at which
> pages can be cleaned. Indeed, it was designed was based on the same
> premise as all the XFS shrinker code: in steady state conditions
> we can't allocate a resource faster than we can reclaim it, so we
> need to make reclaim as efficient at possible...

You seem to be referring here to the same change that I was referred to
above, but seem to be seeing it from a different perspective.

Waiting for inodes to be freed in important.  Waiting for any one
specific inode to be freed is dangerous.

>
>> I think you/we would end up with a much simpler system if instead of
>> focussing on the places where GFP_NOFS is used, we focus on places where
>> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
>> them the remainder could just use __GFP_IO.
>
> The problem with this is that a single kswapd thread can't keep up
> with all of the allocation pressure that occurs. e.g. a 20-core
> intel CPU with local memory will be seen as a single node and so
> will have a single kswapd thread to do reclaim. There's a massive
> imbalance between maximum reclaim rate and maximum allocation rate
> in situations like this. If we want memory reclaim to run faster,
> we to be able to do more work *now*, not defer it to a context with
> limited execution resources.

I agree it makes sense to do *work* locally on the core that sees the
need.  What I want to avoid is *sleeping* locally.
How would it be to make evict_inode non-blocking?  It would do as much work
as it can, which in many cases would presumably complete the whole task.
But if it cannot progress for some reason, it returns -EAGAIN and then
the rest gets queued for later.
Might that work, or is there often still lots of CPU work to be done
after things have been waited for?

Thanks,
NeilBrown


>
> i.e. IMO deferring more work to a single reclaim thread per node is
> going to limit memory reclaim scalability and performance, not
> improve it.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-04-30 22:19       ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-04-30 22:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface.... but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>    in ->releasepage().  They used to block waiting for some IO, but that
>>    caused deadlocks and wasn't really needed.  I left the timeout because
>>    it seemed likely that some throttling would help.  I suspect that a
>>    careful analysis will show that there is sufficient throttling
>>    elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>    for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim keeps reporting "no progress being
> made" because dirty inodes are skipped. BY blocking on such inodes,
> the shrinker makes progress (slowly) and reclaim sees that memory is
> being freed and so continues without invoking the OOM killer...

I'm very aware of the need to throttle allocation based on IO.  I
remember when NFS didn't quite get this right and filled up memory :-)

balance_dirty_pages() used to force threads to wait on the write-out of
one page for every page that they dirtied (or wait on 128 pages for every 128
dirtied or whatever).  This was exactly to provide the sort of
throttling you are talking about.

We don't do that any more.  It was problematic.  I don't recall all the
reasons but I think that different backing devices having different
clearance rates was part of the problem.
So now we monitor clearance rates and wait for some number of blocks to
be written, rather than waiting for some specific blocks to be written.

We should be able to do the same thing to balance dirty inodes as we do
to balance dirty pages.


>
>>    If it could be changed
>>    to just schedule the IO without waiting for it then I think this
>>    would be safe to be called in any FS allocation context.  It already
>>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>>    if the lock is held.
>
> We could, but then we have the same problem as the inode cache -
> there's no indication of progress going back to the memory reclaim
> subsystem, nor is reclaim able to throttle memory allocation back to
> the rate at which reclaim is making progress.
>
> There's feedback loops all throughout the XFS reclaim code - it's
> designed specifically that way - I made changes to the shrinker
> infrastructure years ago to enable this. It's no different to the
> dirty page throttling that was done at roughly the same time -
> that's also one big feedback loop controlled by the rate at which
> pages can be cleaned. Indeed, it was designed was based on the same
> premise as all the XFS shrinker code: in steady state conditions
> we can't allocate a resource faster than we can reclaim it, so we
> need to make reclaim as efficient at possible...

You seem to be referring here to the same change that I was referred to
above, but seem to be seeing it from a different perspective.

Waiting for inodes to be freed in important.  Waiting for any one
specific inode to be freed is dangerous.

>
>> I think you/we would end up with a much simpler system if instead of
>> focussing on the places where GFP_NOFS is used, we focus on places where
>> __GFP_FS is tested, and try to remove them.  If we get rid of enough of
>> them the remainder could just use __GFP_IO.
>
> The problem with this is that a single kswapd thread can't keep up
> with all of the allocation pressure that occurs. e.g. a 20-core
> intel CPU with local memory will be seen as a single node and so
> will have a single kswapd thread to do reclaim. There's a massive
> imbalance between maximum reclaim rate and maximum allocation rate
> in situations like this. If we want memory reclaim to run faster,
> we to be able to do more work *now*, not defer it to a context with
> limited execution resources.

I agree it makes sense to do *work* locally on the core that sees the
need.  What I want to avoid is *sleeping* locally.
How would it be to make evict_inode non-blocking?  It would do as much work
as it can, which in many cases would presumably complete the whole task.
But if it cannot progress for some reason, it returns -EAGAIN and then
the rest gets queued for later.
Might that work, or is there often still lots of CPU work to be done
after things have been waited for?

Thanks,
NeilBrown


>
> i.e. IMO deferring more work to a single reclaim thread per node is
> going to limit memory reclaim scalability and performance, not
> improve it.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david at fromorbit.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160501/5f50986c/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-30 21:55       ` NeilBrown
  (?)
  (?)
@ 2016-05-03 15:13         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it.  If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

> It is OK to wait for memory to be freed.  It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
> 
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-03 15:13         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it.  If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

> It is OK to wait for memory to be freed.  It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
> 
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-03 15:13         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs

Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it.  If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

> It is OK to wait for memory to be freed.  It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
> 
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-03 15:13         ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it.  If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

> It is OK to wait for memory to be freed.  It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
> 
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-04-29 23:40             ` Dave Chinner
  (?)
@ 2016-05-03 15:38               ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
[...]
> > - was it 
> > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > or a different class reports?
> 
> Typically that was involved, but it quite often there'd be a number
> of locks and sometimes even interrupt stacks in an interaction
> between 5 or 6 different processes. Lockdep covers all sorts of
> stuff now (like fs freeze annotations as well as locks and memory
> reclaim) so sometimes the only thing we can do is remove the
> reclaim context from the stack and see if that makes it go away...

That is what I was thinking of. lockdep_reclaim_{disable,enable} or
something like that to tell __lockdep_trace_alloc to not skip
mark_held_locks(). This would effectivelly help to get rid of reclaim
specific reports. It is hard to tell whether there would be others,
though.

> > > They may have been fixed since, but I'm sceptical
> > > of that because, generally speaking, developer testing only catches
> > > the obvious lockdep issues. i.e. it's users that report all the
> > > really twisty issues, and they are generally not reproducable except
> > > under their production workloads...
> > > 
> > > IOWs, the absence of reports in your testing does not mean there
> > > isn't a problem, and that is one of the biggest problems with
> > > lockdep annotations - we have no way of ever knowing if they are
> > > still necessary or not without exposing users to regressions and
> > > potential deadlocks.....
> > 
> > I understand your points here but if we are sure that those lockdep
> > reports are just false positives then we should rather provide an api to
> > silence lockdep for those paths
> 
> I agree with this - please provide such infrastructure before we
> need it...

Do you think a reclaim specific lockdep annotation would be sufficient?

> > than abusing GFP_NOFS which a) hurts
> > the overal reclaim healthiness
> 
> Which doesn't actually seem to be a problem for the vast majority of
> users.

Yes, most users are OK. Those allocations can be triggered by the
userspace (read a malicious user) quite easily and be harmful without a
good way to contain them.
 
> > and b) works around a non-existing
> > problem with lockdep disabled which is the vast majority of
> > configurations.
> 
> But the moment we have a lockdep problem, we get bug reports from
> all over the place and people complaining about it, so we are
> *required* to silence them one way or another. And, like I said,
> when the choice is simply adding GFP_NOFS or spending a week or two
> completely reworking complex code that has functioned correctly for
> 15 years, the risk/reward *always* falls on the side of "just add
> GFP_NOFS".
> 
> Please keep in mind that there is as much code in fs/xfs as there is
> in the mm/ subsystem, and XFS has twice that in userspace as well.
> I say this, because we have only have 3-4 full time developers to do
> all the work required on this code base, unlike the mm/ subsystem
> which had 30-40 full time MM developers attending LSFMM. This is why
> I push back on suggestions that require significant redesign of
> subsystem code to handle memory allocation/reclaim quirks - most
> subsystems simply don't have the resources available to do such
> work, and so will always look for the quick 2 minute fix when it is
> available....

I do understand your concerns and I really do not ask you to redesign
your code. I would like make the code more maintainable and reducing the
number of (undocumented) GFP_NOFS usage to the minimum seems to be like
a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
not that large. If we can reduce the few instances which are using the
flag to silence the lockdep and replace them by a better annotation then
I think this would be an improvement as well. If we can go one step
further and can get rid of mapping_set_gfp_mask(inode->i_mapping,
(gfp_mask & ~(__GFP_FS))) then I would be even happier.

I think other fs and code which interacts with FS layer needs much more
changes than xfs to be honest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-05-03 15:38               ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
[...]
> > - was it 
> > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > or a different class reports?
> 
> Typically that was involved, but it quite often there'd be a number
> of locks and sometimes even interrupt stacks in an interaction
> between 5 or 6 different processes. Lockdep covers all sorts of
> stuff now (like fs freeze annotations as well as locks and memory
> reclaim) so sometimes the only thing we can do is remove the
> reclaim context from the stack and see if that makes it go away...

That is what I was thinking of. lockdep_reclaim_{disable,enable} or
something like that to tell __lockdep_trace_alloc to not skip
mark_held_locks(). This would effectivelly help to get rid of reclaim
specific reports. It is hard to tell whether there would be others,
though.

> > > They may have been fixed since, but I'm sceptical
> > > of that because, generally speaking, developer testing only catches
> > > the obvious lockdep issues. i.e. it's users that report all the
> > > really twisty issues, and they are generally not reproducable except
> > > under their production workloads...
> > > 
> > > IOWs, the absence of reports in your testing does not mean there
> > > isn't a problem, and that is one of the biggest problems with
> > > lockdep annotations - we have no way of ever knowing if they are
> > > still necessary or not without exposing users to regressions and
> > > potential deadlocks.....
> > 
> > I understand your points here but if we are sure that those lockdep
> > reports are just false positives then we should rather provide an api to
> > silence lockdep for those paths
> 
> I agree with this - please provide such infrastructure before we
> need it...

Do you think a reclaim specific lockdep annotation would be sufficient?

> > than abusing GFP_NOFS which a) hurts
> > the overal reclaim healthiness
> 
> Which doesn't actually seem to be a problem for the vast majority of
> users.

Yes, most users are OK. Those allocations can be triggered by the
userspace (read a malicious user) quite easily and be harmful without a
good way to contain them.
 
> > and b) works around a non-existing
> > problem with lockdep disabled which is the vast majority of
> > configurations.
> 
> But the moment we have a lockdep problem, we get bug reports from
> all over the place and people complaining about it, so we are
> *required* to silence them one way or another. And, like I said,
> when the choice is simply adding GFP_NOFS or spending a week or two
> completely reworking complex code that has functioned correctly for
> 15 years, the risk/reward *always* falls on the side of "just add
> GFP_NOFS".
> 
> Please keep in mind that there is as much code in fs/xfs as there is
> in the mm/ subsystem, and XFS has twice that in userspace as well.
> I say this, because we have only have 3-4 full time developers to do
> all the work required on this code base, unlike the mm/ subsystem
> which had 30-40 full time MM developers attending LSFMM. This is why
> I push back on suggestions that require significant redesign of
> subsystem code to handle memory allocation/reclaim quirks - most
> subsystems simply don't have the resources available to do such
> work, and so will always look for the quick 2 minute fix when it is
> available....

I do understand your concerns and I really do not ask you to redesign
your code. I would like make the code more maintainable and reducing the
number of (undocumented) GFP_NOFS usage to the minimum seems to be like
a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
not that large. If we can reduce the few instances which are using the
flag to silence the lockdep and replace them by a better annotation then
I think this would be an improvement as well. If we can go one step
further and can get rid of mapping_set_gfp_mask(inode->i_mapping,
(gfp_mask & ~(__GFP_FS))) then I would be even happier.

I think other fs and code which interacts with FS layer needs much more
changes than xfs to be honest.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-05-03 15:38               ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2016-05-03 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
[...]
> > - was it 
> > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > or a different class reports?
> 
> Typically that was involved, but it quite often there'd be a number
> of locks and sometimes even interrupt stacks in an interaction
> between 5 or 6 different processes. Lockdep covers all sorts of
> stuff now (like fs freeze annotations as well as locks and memory
> reclaim) so sometimes the only thing we can do is remove the
> reclaim context from the stack and see if that makes it go away...

That is what I was thinking of. lockdep_reclaim_{disable,enable} or
something like that to tell __lockdep_trace_alloc to not skip
mark_held_locks(). This would effectivelly help to get rid of reclaim
specific reports. It is hard to tell whether there would be others,
though.

> > > They may have been fixed since, but I'm sceptical
> > > of that because, generally speaking, developer testing only catches
> > > the obvious lockdep issues. i.e. it's users that report all the
> > > really twisty issues, and they are generally not reproducable except
> > > under their production workloads...
> > > 
> > > IOWs, the absence of reports in your testing does not mean there
> > > isn't a problem, and that is one of the biggest problems with
> > > lockdep annotations - we have no way of ever knowing if they are
> > > still necessary or not without exposing users to regressions and
> > > potential deadlocks.....
> > 
> > I understand your points here but if we are sure that those lockdep
> > reports are just false positives then we should rather provide an api to
> > silence lockdep for those paths
> 
> I agree with this - please provide such infrastructure before we
> need it...

Do you think a reclaim specific lockdep annotation would be sufficient?

> > than abusing GFP_NOFS which a) hurts
> > the overal reclaim healthiness
> 
> Which doesn't actually seem to be a problem for the vast majority of
> users.

Yes, most users are OK. Those allocations can be triggered by the
userspace (read a malicious user) quite easily and be harmful without a
good way to contain them.
 
> > and b) works around a non-existing
> > problem with lockdep disabled which is the vast majority of
> > configurations.
> 
> But the moment we have a lockdep problem, we get bug reports from
> all over the place and people complaining about it, so we are
> *required* to silence them one way or another. And, like I said,
> when the choice is simply adding GFP_NOFS or spending a week or two
> completely reworking complex code that has functioned correctly for
> 15 years, the risk/reward *always* falls on the side of "just add
> GFP_NOFS".
> 
> Please keep in mind that there is as much code in fs/xfs as there is
> in the mm/ subsystem, and XFS has twice that in userspace as well.
> I say this, because we have only have 3-4 full time developers to do
> all the work required on this code base, unlike the mm/ subsystem
> which had 30-40 full time MM developers attending LSFMM. This is why
> I push back on suggestions that require significant redesign of
> subsystem code to handle memory allocation/reclaim quirks - most
> subsystems simply don't have the resources available to do such
> work, and so will always look for the quick 2 minute fix when it is
> available....

I do understand your concerns and I really do not ask you to redesign
your code. I would like make the code more maintainable and reducing the
number of (undocumented) GFP_NOFS usage to the minimum seems to be like
a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
not that large. If we can reduce the few instances which are using the
flag to silence the lockdep and replace them by a better annotation then
I think this would be an improvement as well. If we can go one step
further and can get rid of mapping_set_gfp_mask(inode->i_mapping,
(gfp_mask & ~(__GFP_FS))) then I would be even happier.

I think other fs and code which interacts with FS layer needs much more
changes than xfs to be honest.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-05-03 15:13         ` Michal Hocko
  (?)
@ 2016-05-03 23:26           ` NeilBrown
  -1 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-03 23:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 3742 bytes --]

On Wed, May 04 2016, Michal Hocko wrote:

> Hi,
>
> On Sun 01-05-16 07:55:31, NeilBrown wrote:
> [...]
>> One particular problem with your process-context idea is that it isn't
>> inherited across threads.
>> Steve Whitehouse's example in gfs shows how allocation dependencies can
>> even cross into user space.
>
> Hmm, I am still not sure I understand that example completely but making
> a dependency between direct reclaim and userspace can hardly work.

No it can't.  Specifically: if direct reclaim blocks on user-space that
must be a problem.
I think the point of this example is that some filesystem things can
block on user-space in ways that are very hard to encode in with flags
as they are multi-level indirect.
So the conclusion (my conclusion) is that direct reclaim mustn't block.

When I was working on deadlock avoidance in loop-back NFS I went down
the path of adding GFP flags and extended the PF_FSTRANS flag and got it
working (think) but no-one liked it.  It was way too intrusive.

Some how I landed on the idea of making nfs_release_page non blocking
and everything suddenly became much much simpler.  Problems evaporated.

NFS has a distinct advantage here.  The "Close-to-open" cache semantic
means that all dirty pages must be flushed on last close.  So when
->evict_inode is finally called there is nothing much to do - just free
everything up.  So I could fix NFS without worrying about (or even
noticing) evict_inode.


> Especially when the direct reclaim might be sitting on top of hard to
> guess pile of locks. So unless I've missed anything what Steve has
> described is a clear NOFS context.
>
>> A more localized one that I have seen is that NFSv4 sometimes needs to
>> start up a state-management thread (particularly if the server
>> restarted).
>> It uses kthread_run(), which doesn't actually create the thread but asks
>> kthreadd to do it.  If NFS writeout is waiting for state management it
>> would need to make sure that kthreadd runs in allocation context to
>> avoid deadlock.
>> I feel that I've forgotten some important detail here and this might
>> have been fixed somehow, but the point still stands that the allocation
>> context can cross from thread to thread and can effectively become
>> anything and everything.
>
> Not sure I understand your point here but relying on kthread_run
> from GFP_NOFS context has always been deadlock prone with or without
> scope GFP_NOFS semantic so I am not really sure I see your point
> here. Similarly relying on a work item which doesn't have a dedicated
> WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

The point is really that saying "You shouldn't do that" isn't much good
when "that" is exactly what the fs developer wants to do and it seems to
work and never triggers a warning.

You can create lots of rules about what is or is not allowed, or
you can make everything that it not explicit forbidden (ideally at
compile time but possibly at runtime with might_sleep or lockdep),
permitted.

I prefer the latter.

>
>> It is OK to wait for memory to be freed.  It is not OK to wait for any
>> particular piece of memory to be freed because you don't always know who
>> is waiting for you, or who you really are waiting on to free that
>> memory.
>> 
>> Whenever trying to free memory I think you need to do best-effort
>> without blocking.
>
> I agree with that. Or at least you have to wait on something that is
> _guaranteed_ to make a forward progress. I am not really that sure this
> is easy to achieve with the current code base.

I accept that it isn't "easy".  But I claim that it isn't particularly
difficult either.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-03 23:26           ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-03 23:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, reiserfs-devel, linux-f2fs-devel, logfs,
	cluster-devel, linux-mm, linux-mtd, linux-fsdevel, Jan Kara,
	Andrew Morton, xfs, ceph-devel, linux-btrfs, linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 3742 bytes --]

On Wed, May 04 2016, Michal Hocko wrote:

> Hi,
>
> On Sun 01-05-16 07:55:31, NeilBrown wrote:
> [...]
>> One particular problem with your process-context idea is that it isn't
>> inherited across threads.
>> Steve Whitehouse's example in gfs shows how allocation dependencies can
>> even cross into user space.
>
> Hmm, I am still not sure I understand that example completely but making
> a dependency between direct reclaim and userspace can hardly work.

No it can't.  Specifically: if direct reclaim blocks on user-space that
must be a problem.
I think the point of this example is that some filesystem things can
block on user-space in ways that are very hard to encode in with flags
as they are multi-level indirect.
So the conclusion (my conclusion) is that direct reclaim mustn't block.

When I was working on deadlock avoidance in loop-back NFS I went down
the path of adding GFP flags and extended the PF_FSTRANS flag and got it
working (think) but no-one liked it.  It was way too intrusive.

Some how I landed on the idea of making nfs_release_page non blocking
and everything suddenly became much much simpler.  Problems evaporated.

NFS has a distinct advantage here.  The "Close-to-open" cache semantic
means that all dirty pages must be flushed on last close.  So when
->evict_inode is finally called there is nothing much to do - just free
everything up.  So I could fix NFS without worrying about (or even
noticing) evict_inode.


> Especially when the direct reclaim might be sitting on top of hard to
> guess pile of locks. So unless I've missed anything what Steve has
> described is a clear NOFS context.
>
>> A more localized one that I have seen is that NFSv4 sometimes needs to
>> start up a state-management thread (particularly if the server
>> restarted).
>> It uses kthread_run(), which doesn't actually create the thread but asks
>> kthreadd to do it.  If NFS writeout is waiting for state management it
>> would need to make sure that kthreadd runs in allocation context to
>> avoid deadlock.
>> I feel that I've forgotten some important detail here and this might
>> have been fixed somehow, but the point still stands that the allocation
>> context can cross from thread to thread and can effectively become
>> anything and everything.
>
> Not sure I understand your point here but relying on kthread_run
> from GFP_NOFS context has always been deadlock prone with or without
> scope GFP_NOFS semantic so I am not really sure I see your point
> here. Similarly relying on a work item which doesn't have a dedicated
> WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

The point is really that saying "You shouldn't do that" isn't much good
when "that" is exactly what the fs developer wants to do and it seems to
work and never triggers a warning.

You can create lots of rules about what is or is not allowed, or
you can make everything that it not explicit forbidden (ideally at
compile time but possibly at runtime with might_sleep or lockdep),
permitted.

I prefer the latter.

>
>> It is OK to wait for memory to be freed.  It is not OK to wait for any
>> particular piece of memory to be freed because you don't always know who
>> is waiting for you, or who you really are waiting on to free that
>> memory.
>> 
>> Whenever trying to free memory I think you need to do best-effort
>> without blocking.
>
> I agree with that. Or at least you have to wait on something that is
> _guaranteed_ to make a forward progress. I am not really that sure this
> is easy to achieve with the current code base.

I accept that it isn't "easy".  But I claim that it isn't particularly
difficult either.

NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-03 23:26           ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-03 23:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, May 04 2016, Michal Hocko wrote:

> Hi,
>
> On Sun 01-05-16 07:55:31, NeilBrown wrote:
> [...]
>> One particular problem with your process-context idea is that it isn't
>> inherited across threads.
>> Steve Whitehouse's example in gfs shows how allocation dependencies can
>> even cross into user space.
>
> Hmm, I am still not sure I understand that example completely but making
> a dependency between direct reclaim and userspace can hardly work.

No it can't.  Specifically: if direct reclaim blocks on user-space that
must be a problem.
I think the point of this example is that some filesystem things can
block on user-space in ways that are very hard to encode in with flags
as they are multi-level indirect.
So the conclusion (my conclusion) is that direct reclaim mustn't block.

When I was working on deadlock avoidance in loop-back NFS I went down
the path of adding GFP flags and extended the PF_FSTRANS flag and got it
working (think) but no-one liked it.  It was way too intrusive.

Some how I landed on the idea of making nfs_release_page non blocking
and everything suddenly became much much simpler.  Problems evaporated.

NFS has a distinct advantage here.  The "Close-to-open" cache semantic
means that all dirty pages must be flushed on last close.  So when
->evict_inode is finally called there is nothing much to do - just free
everything up.  So I could fix NFS without worrying about (or even
noticing) evict_inode.


> Especially when the direct reclaim might be sitting on top of hard to
> guess pile of locks. So unless I've missed anything what Steve has
> described is a clear NOFS context.
>
>> A more localized one that I have seen is that NFSv4 sometimes needs to
>> start up a state-management thread (particularly if the server
>> restarted).
>> It uses kthread_run(), which doesn't actually create the thread but asks
>> kthreadd to do it.  If NFS writeout is waiting for state management it
>> would need to make sure that kthreadd runs in allocation context to
>> avoid deadlock.
>> I feel that I've forgotten some important detail here and this might
>> have been fixed somehow, but the point still stands that the allocation
>> context can cross from thread to thread and can effectively become
>> anything and everything.
>
> Not sure I understand your point here but relying on kthread_run
> from GFP_NOFS context has always been deadlock prone with or without
> scope GFP_NOFS semantic so I am not really sure I see your point
> here. Similarly relying on a work item which doesn't have a dedicated
> WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

The point is really that saying "You shouldn't do that" isn't much good
when "that" is exactly what the fs developer wants to do and it seems to
work and never triggers a warning.

You can create lots of rules about what is or is not allowed, or
you can make everything that it not explicit forbidden (ideally at
compile time but possibly at runtime with might_sleep or lockdep),
permitted.

I prefer the latter.

>
>> It is OK to wait for memory to be freed.  It is not OK to wait for any
>> particular piece of memory to be freed because you don't always know who
>> is waiting for you, or who you really are waiting on to free that
>> memory.
>> 
>> Whenever trying to free memory I think you need to do best-effort
>> without blocking.
>
> I agree with that. Or at least you have to wait on something that is
> _guaranteed_ to make a forward progress. I am not really that sure this
> is easy to achieve with the current code base.

I accept that it isn't "easy".  But I claim that it isn't particularly
difficult either.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160504/9a271ba2/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2016-05-03 15:38               ` Michal Hocko
  (?)
@ 2016-05-04  0:07                 ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  0:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Tue, May 03, 2016 at 05:38:23PM +0200, Michal Hocko wrote:
> On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> > On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> [...]
> > > - was it 
> > > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > > or a different class reports?
> > 
> > Typically that was involved, but it quite often there'd be a number
> > of locks and sometimes even interrupt stacks in an interaction
> > between 5 or 6 different processes. Lockdep covers all sorts of
> > stuff now (like fs freeze annotations as well as locks and memory
> > reclaim) so sometimes the only thing we can do is remove the
> > reclaim context from the stack and see if that makes it go away...
> 
> That is what I was thinking of. lockdep_reclaim_{disable,enable} or
> something like that to tell __lockdep_trace_alloc to not skip
> mark_held_locks(). This would effectivelly help to get rid of reclaim
> specific reports. It is hard to tell whether there would be others,
> though.

Yeah, though I suspect this would get messy having to scatter it
around the code. I can encapsulate it via internal XFS KM flags,
though, so I do think that will be a real issue.

> > > > They may have been fixed since, but I'm sceptical
> > > > of that because, generally speaking, developer testing only catches
> > > > the obvious lockdep issues. i.e. it's users that report all the
> > > > really twisty issues, and they are generally not reproducable except
> > > > under their production workloads...
> > > > 
> > > > IOWs, the absence of reports in your testing does not mean there
> > > > isn't a problem, and that is one of the biggest problems with
> > > > lockdep annotations - we have no way of ever knowing if they are
> > > > still necessary or not without exposing users to regressions and
> > > > potential deadlocks.....
> > > 
> > > I understand your points here but if we are sure that those lockdep
> > > reports are just false positives then we should rather provide an api to
> > > silence lockdep for those paths
> > 
> > I agree with this - please provide such infrastructure before we
> > need it...
> 
> Do you think a reclaim specific lockdep annotation would be sufficient?

It will help - it'll take some time to work through all the explicit
KM_NOFS calls in XFS, though, to determine if they are just working
around lockdep false positives or some other potential problem....

> I do understand your concerns and I really do not ask you to redesign
> your code. I would like make the code more maintainable and reducing the
> number of (undocumented) GFP_NOFS usage to the minimum seems to be like
> a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
> not that large.

That's true, and if we can reduce them to real cases of GFP_NOFS
being needed vs annotations to silence lockdep false positives we'll
then know what problems we really need to fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-05-04  0:07                 ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  0:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-fsdevel, Andrew Morton, Jan Kara, xfs, LKML

On Tue, May 03, 2016 at 05:38:23PM +0200, Michal Hocko wrote:
> On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> > On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> [...]
> > > - was it 
> > > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > > or a different class reports?
> > 
> > Typically that was involved, but it quite often there'd be a number
> > of locks and sometimes even interrupt stacks in an interaction
> > between 5 or 6 different processes. Lockdep covers all sorts of
> > stuff now (like fs freeze annotations as well as locks and memory
> > reclaim) so sometimes the only thing we can do is remove the
> > reclaim context from the stack and see if that makes it go away...
> 
> That is what I was thinking of. lockdep_reclaim_{disable,enable} or
> something like that to tell __lockdep_trace_alloc to not skip
> mark_held_locks(). This would effectivelly help to get rid of reclaim
> specific reports. It is hard to tell whether there would be others,
> though.

Yeah, though I suspect this would get messy having to scatter it
around the code. I can encapsulate it via internal XFS KM flags,
though, so I do think that will be a real issue.

> > > > They may have been fixed since, but I'm sceptical
> > > > of that because, generally speaking, developer testing only catches
> > > > the obvious lockdep issues. i.e. it's users that report all the
> > > > really twisty issues, and they are generally not reproducable except
> > > > under their production workloads...
> > > > 
> > > > IOWs, the absence of reports in your testing does not mean there
> > > > isn't a problem, and that is one of the biggest problems with
> > > > lockdep annotations - we have no way of ever knowing if they are
> > > > still necessary or not without exposing users to regressions and
> > > > potential deadlocks.....
> > > 
> > > I understand your points here but if we are sure that those lockdep
> > > reports are just false positives then we should rather provide an api to
> > > silence lockdep for those paths
> > 
> > I agree with this - please provide such infrastructure before we
> > need it...
> 
> Do you think a reclaim specific lockdep annotation would be sufficient?

It will help - it'll take some time to work through all the explicit
KM_NOFS calls in XFS, though, to determine if they are just working
around lockdep false positives or some other potential problem....

> I do understand your concerns and I really do not ask you to redesign
> your code. I would like make the code more maintainable and reducing the
> number of (undocumented) GFP_NOFS usage to the minimum seems to be like
> a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
> not that large.

That's true, and if we can reduce them to real cases of GFP_NOFS
being needed vs annotations to silence lockdep false positives we'll
then know what problems we really need to fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2016-05-04  0:07                 ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  0:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jan Kara, LKML, xfs, linux-mm, linux-fsdevel, Andrew Morton

On Tue, May 03, 2016 at 05:38:23PM +0200, Michal Hocko wrote:
> On Sat 30-04-16 09:40:08, Dave Chinner wrote:
> > On Fri, Apr 29, 2016 at 02:12:20PM +0200, Michal Hocko wrote:
> [...]
> > > - was it 
> > > "inconsistent {RECLAIM_FS-ON-[RW]} -> {IN-RECLAIM_FS-[WR]} usage"
> > > or a different class reports?
> > 
> > Typically that was involved, but it quite often there'd be a number
> > of locks and sometimes even interrupt stacks in an interaction
> > between 5 or 6 different processes. Lockdep covers all sorts of
> > stuff now (like fs freeze annotations as well as locks and memory
> > reclaim) so sometimes the only thing we can do is remove the
> > reclaim context from the stack and see if that makes it go away...
> 
> That is what I was thinking of. lockdep_reclaim_{disable,enable} or
> something like that to tell __lockdep_trace_alloc to not skip
> mark_held_locks(). This would effectivelly help to get rid of reclaim
> specific reports. It is hard to tell whether there would be others,
> though.

Yeah, though I suspect this would get messy having to scatter it
around the code. I can encapsulate it via internal XFS KM flags,
though, so I do think that will be a real issue.

> > > > They may have been fixed since, but I'm sceptical
> > > > of that because, generally speaking, developer testing only catches
> > > > the obvious lockdep issues. i.e. it's users that report all the
> > > > really twisty issues, and they are generally not reproducable except
> > > > under their production workloads...
> > > > 
> > > > IOWs, the absence of reports in your testing does not mean there
> > > > isn't a problem, and that is one of the biggest problems with
> > > > lockdep annotations - we have no way of ever knowing if they are
> > > > still necessary or not without exposing users to regressions and
> > > > potential deadlocks.....
> > > 
> > > I understand your points here but if we are sure that those lockdep
> > > reports are just false positives then we should rather provide an api to
> > > silence lockdep for those paths
> > 
> > I agree with this - please provide such infrastructure before we
> > need it...
> 
> Do you think a reclaim specific lockdep annotation would be sufficient?

It will help - it'll take some time to work through all the explicit
KM_NOFS calls in XFS, though, to determine if they are just working
around lockdep false positives or some other potential problem....

> I do understand your concerns and I really do not ask you to redesign
> your code. I would like make the code more maintainable and reducing the
> number of (undocumented) GFP_NOFS usage to the minimum seems to be like
> a first step. Now the direct usage of GFP_NOFS (resp. KM_NOFS) in xfs is
> not that large.

That's true, and if we can reduce them to real cases of GFP_NOFS
being needed vs annotations to silence lockdep false positives we'll
then know what problems we really need to fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-04-30 22:19       ` NeilBrown
  (?)
  (?)
@ 2016-05-04  1:00         ` Dave Chinner
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  1:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible....

> >>    If it could be changed
> >>    to just schedule the IO without waiting for it then I think this
> >>    would be safe to be called in any FS allocation context.  It already
> >>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>    if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode....

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and hence prevents excessive
concurrent direct reclaim from trashing the working set of clean
inodes whilst ignoring dirty inodes) we still get appropriate
amounts of inode cache reclaim occurring...

> > The problem with this is that a single kswapd thread can't keep
> > up with all of the allocation pressure that occurs. e.g. a
> > 20-core intel CPU with local memory will be seen as a single
> > node and so will have a single kswapd thread to do reclaim.
> > There's a massive imbalance between maximum reclaim rate and
> > maximum allocation rate in situations like this. If we want
> > memory reclaim to run faster, we to be able to do more work
> > *now*, not defer it to a context with limited execution
> > resources.
> 
> I agree it makes sense to do *work* locally on the core that sees
> the need.  What I want to avoid is *sleeping* locally.

With direct reclaim, sleeping locally is the only way we can
throttle it. If we want to avoid reclaim from doing this, then we
need to move to a reclaim model similar to the IO-less dirty page
throttling mechanism which directly limits reclaim concurrency and
so individual subsystem shrinkers don't need to do this to prevent
reclaim from trashing the working set.

Indeed, I've proposed this in the past, and been told by various
developers that it can't possibly work and, besides, fs developers
know nothing about mm/ design.... I've got better things to do than
bash my head against brick walls like this, even though I know such
algorithms both perform and scale for memory reclaim because that's
exactly where I got the original inspriation for IO-less dirty page
throttling from: the Irix memory reclaim algorithms....

> How would
> it be to make evict_inode non-blocking?

We can't without completely redesigning how iput_final() and evict()
work and order operations. Indeed, by the time we call
->evict_inode(), we've already marked the inode as I_FREEING,
removed the inode from the superblock inode list, dirty writeback
lists, the LRU lists and blocked waiting for any inode writeback in
progress to complete.

We'd also have to redesign the shrinker code, because it does bulk
operations to isolate inodes from the LRU onto temporary lists after
marking them I_FREEING, which then does evict() calls from a
separate context. We'd then have to change all that code to handle
inodes that can't be reclaimed and re-insert them into the LRUs
approriately, and that would have to be managed in a way that we
don't end up live-locking the shrinker lru scanning on unreferenced
inodes that cannot be evicted. I'm also not sure whether
dispose_list() function can add inodes back onto the LRU safely
because contexts calling evict() make the assumption the inode is no
longer on the LRU and don't need to access LRU locks at all...

Also, evict() is called when unmounting the filesystem, and so in
those cases we actually want it to block and do everything that is
necessary to free the inode, so this would introduce more complex
interactions depending on caller context....

> It would do as much work
> as it can, which in many cases would presumably complete the whole
> task.  But if it cannot progress for some reason, it returns
> -EAGAIN and then the rest gets queued for later.  Might that work,
> or is there often still lots of CPU work to be done after things
> have been waited for?

Freeing the inode occurs after it's been waited for, usually. i.e.
no wait, no inode being freed and no feedback for throttling of the
inode reclaim rate...

FWIW, I don't think making evict() non-blocking is going to be worth
the effort here. Making memory reclaim wait on a priority ordered
queue while asynchronous reclaim threads run reclaim as efficiently
as possible and wakes waiters as it frees the memory the waiters
require is a model that has been proven to work in the past, and
this appears to me to be the model you are advocating for. I agree
that direct reclaim needs to die and be replaced with something
entirely more predictable, controllable and less prone to deadlock
contexts - you just need to convince the mm developers that it will
perform and scale better than what we have now.

In the mean time, having a slightly more fine grained GFP_NOFS
equivalent context will allow us to avoid the worst of the current
GFP_NOFS problems with very little extra code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-04  1:00         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  1:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible....

> >>    If it could be changed
> >>    to just schedule the IO without waiting for it then I think this
> >>    would be safe to be called in any FS allocation context.  It already
> >>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>    if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode....

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and hence prevents excessive
concurrent direct reclaim from trashing the working set of clean
inodes whilst ignoring dirty inodes) we still get appropriate
amounts of inode cache reclaim occurring...

> > The problem with this is that a single kswapd thread can't keep
> > up with all of the allocation pressure that occurs. e.g. a
> > 20-core intel CPU with local memory will be seen as a single
> > node and so will have a single kswapd thread to do reclaim.
> > There's a massive imbalance between maximum reclaim rate and
> > maximum allocation rate in situations like this. If we want
> > memory reclaim to run faster, we to be able to do more work
> > *now*, not defer it to a context with limited execution
> > resources.
> 
> I agree it makes sense to do *work* locally on the core that sees
> the need.  What I want to avoid is *sleeping* locally.

With direct reclaim, sleeping locally is the only way we can
throttle it. If we want to avoid reclaim from doing this, then we
need to move to a reclaim model similar to the IO-less dirty page
throttling mechanism which directly limits reclaim concurrency and
so individual subsystem shrinkers don't need to do this to prevent
reclaim from trashing the working set.

Indeed, I've proposed this in the past, and been told by various
developers that it can't possibly work and, besides, fs developers
know nothing about mm/ design.... I've got better things to do than
bash my head against brick walls like this, even though I know such
algorithms both perform and scale for memory reclaim because that's
exactly where I got the original inspriation for IO-less dirty page
throttling from: the Irix memory reclaim algorithms....

> How would
> it be to make evict_inode non-blocking?

We can't without completely redesigning how iput_final() and evict()
work and order operations. Indeed, by the time we call
->evict_inode(), we've already marked the inode as I_FREEING,
removed the inode from the superblock inode list, dirty writeback
lists, the LRU lists and blocked waiting for any inode writeback in
progress to complete.

We'd also have to redesign the shrinker code, because it does bulk
operations to isolate inodes from the LRU onto temporary lists after
marking them I_FREEING, which then does evict() calls from a
separate context. We'd then have to change all that code to handle
inodes that can't be reclaimed and re-insert them into the LRUs
approriately, and that would have to be managed in a way that we
don't end up live-locking the shrinker lru scanning on unreferenced
inodes that cannot be evicted. I'm also not sure whether
dispose_list() function can add inodes back onto the LRU safely
because contexts calling evict() make the assumption the inode is no
longer on the LRU and don't need to access LRU locks at all...

Also, evict() is called when unmounting the filesystem, and so in
those cases we actually want it to block and do everything that is
necessary to free the inode, so this would introduce more complex
interactions depending on caller context....

> It would do as much work
> as it can, which in many cases would presumably complete the whole
> task.  But if it cannot progress for some reason, it returns
> -EAGAIN and then the rest gets queued for later.  Might that work,
> or is there often still lots of CPU work to be done after things
> have been waited for?

Freeing the inode occurs after it's been waited for, usually. i.e.
no wait, no inode being freed and no feedback for throttling of the
inode reclaim rate...

FWIW, I don't think making evict() non-blocking is going to be worth
the effort here. Making memory reclaim wait on a priority ordered
queue while asynchronous reclaim threads run reclaim as efficiently
as possible and wakes waiters as it frees the memory the waiters
require is a model that has been proven to work in the past, and
this appears to me to be the model you are advocating for. I agree
that direct reclaim needs to die and be replaced with something
entirely more predictable, controllable and less prone to deadlock
contexts - you just need to convince the mm developers that it will
perform and scale better than what we have now.

In the mean time, having a slightly more fine grained GFP_NOFS
equivalent context will allow us to avoid the worst of the current
GFP_NOFS problems with very little extra code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-04  1:00         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  1:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, xfs, reiserfs-devel, linux-f2fs-devel,
	Michal Hocko, cluster-devel, linux-mm, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, logfs, ceph-devel, linux-btrfs,
	linux-afs

On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible....

> >>    If it could be changed
> >>    to just schedule the IO without waiting for it then I think this
> >>    would be safe to be called in any FS allocation context.  It already
> >>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>    if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode....

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and hence prevents excessive
concurrent direct reclaim from trashing the working set of clean
inodes whilst ignoring dirty inodes) we still get appropriate
amounts of inode cache reclaim occurring...

> > The problem with this is that a single kswapd thread can't keep
> > up with all of the allocation pressure that occurs. e.g. a
> > 20-core intel CPU with local memory will be seen as a single
> > node and so will have a single kswapd thread to do reclaim.
> > There's a massive imbalance between maximum reclaim rate and
> > maximum allocation rate in situations like this. If we want
> > memory reclaim to run faster, we to be able to do more work
> > *now*, not defer it to a context with limited execution
> > resources.
> 
> I agree it makes sense to do *work* locally on the core that sees
> the need.  What I want to avoid is *sleeping* locally.

With direct reclaim, sleeping locally is the only way we can
throttle it. If we want to avoid reclaim from doing this, then we
need to move to a reclaim model similar to the IO-less dirty page
throttling mechanism which directly limits reclaim concurrency and
so individual subsystem shrinkers don't need to do this to prevent
reclaim from trashing the working set.

Indeed, I've proposed this in the past, and been told by various
developers that it can't possibly work and, besides, fs developers
know nothing about mm/ design.... I've got better things to do than
bash my head against brick walls like this, even though I know such
algorithms both perform and scale for memory reclaim because that's
exactly where I got the original inspriation for IO-less dirty page
throttling from: the Irix memory reclaim algorithms....

> How would
> it be to make evict_inode non-blocking?

We can't without completely redesigning how iput_final() and evict()
work and order operations. Indeed, by the time we call
->evict_inode(), we've already marked the inode as I_FREEING,
removed the inode from the superblock inode list, dirty writeback
lists, the LRU lists and blocked waiting for any inode writeback in
progress to complete.

We'd also have to redesign the shrinker code, because it does bulk
operations to isolate inodes from the LRU onto temporary lists after
marking them I_FREEING, which then does evict() calls from a
separate context. We'd then have to change all that code to handle
inodes that can't be reclaimed and re-insert them into the LRUs
approriately, and that would have to be managed in a way that we
don't end up live-locking the shrinker lru scanning on unreferenced
inodes that cannot be evicted. I'm also not sure whether
dispose_list() function can add inodes back onto the LRU safely
because contexts calling evict() make the assumption the inode is no
longer on the LRU and don't need to access LRU locks at all...

Also, evict() is called when unmounting the filesystem, and so in
those cases we actually want it to block and do everything that is
necessary to free the inode, so this would introduce more complex
interactions depending on caller context....

> It would do as much work
> as it can, which in many cases would presumably complete the whole
> task.  But if it cannot progress for some reason, it returns
> -EAGAIN and then the rest gets queued for later.  Might that work,
> or is there often still lots of CPU work to be done after things
> have been waited for?

Freeing the inode occurs after it's been waited for, usually. i.e.
no wait, no inode being freed and no feedback for throttling of the
inode reclaim rate...

FWIW, I don't think making evict() non-blocking is going to be worth
the effort here. Making memory reclaim wait on a priority ordered
queue while asynchronous reclaim threads run reclaim as efficiently
as possible and wakes waiters as it frees the memory the waiters
require is a model that has been proven to work in the past, and
this appears to me to be the model you are advocating for. I agree
that direct reclaim needs to die and be replaced with something
entirely more predictable, controllable and less prone to deadlock
contexts - you just need to convince the mm developers that it will
perform and scale better than what we have now.

In the mean time, having a slightly more fine grained GFP_NOFS
equivalent context will allow us to avoid the worst of the current
GFP_NOFS problems with very little extra code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-04  1:00         ` Dave Chinner
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2016-05-04  1:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible....

> >>    If it could be changed
> >>    to just schedule the IO without waiting for it then I think this
> >>    would be safe to be called in any FS allocation context.  It already
> >>    uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>    if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode....

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and hence prevents excessive
concurrent direct reclaim from trashing the working set of clean
inodes whilst ignoring dirty inodes) we still get appropriate
amounts of inode cache reclaim occurring...

> > The problem with this is that a single kswapd thread can't keep
> > up with all of the allocation pressure that occurs. e.g. a
> > 20-core intel CPU with local memory will be seen as a single
> > node and so will have a single kswapd thread to do reclaim.
> > There's a massive imbalance between maximum reclaim rate and
> > maximum allocation rate in situations like this. If we want
> > memory reclaim to run faster, we to be able to do more work
> > *now*, not defer it to a context with limited execution
> > resources.
> 
> I agree it makes sense to do *work* locally on the core that sees
> the need.  What I want to avoid is *sleeping* locally.

With direct reclaim, sleeping locally is the only way we can
throttle it. If we want to avoid reclaim from doing this, then we
need to move to a reclaim model similar to the IO-less dirty page
throttling mechanism which directly limits reclaim concurrency and
so individual subsystem shrinkers don't need to do this to prevent
reclaim from trashing the working set.

Indeed, I've proposed this in the past, and been told by various
developers that it can't possibly work and, besides, fs developers
know nothing about mm/ design.... I've got better things to do than
bash my head against brick walls like this, even though I know such
algorithms both perform and scale for memory reclaim because that's
exactly where I got the original inspriation for IO-less dirty page
throttling from: the Irix memory reclaim algorithms....

> How would
> it be to make evict_inode non-blocking?

We can't without completely redesigning how iput_final() and evict()
work and order operations. Indeed, by the time we call
->evict_inode(), we've already marked the inode as I_FREEING,
removed the inode from the superblock inode list, dirty writeback
lists, the LRU lists and blocked waiting for any inode writeback in
progress to complete.

We'd also have to redesign the shrinker code, because it does bulk
operations to isolate inodes from the LRU onto temporary lists after
marking them I_FREEING, which then does evict() calls from a
separate context. We'd then have to change all that code to handle
inodes that can't be reclaimed and re-insert them into the LRUs
approriately, and that would have to be managed in a way that we
don't end up live-locking the shrinker lru scanning on unreferenced
inodes that cannot be evicted. I'm also not sure whether
dispose_list() function can add inodes back onto the LRU safely
because contexts calling evict() make the assumption the inode is no
longer on the LRU and don't need to access LRU locks at all...

Also, evict() is called when unmounting the filesystem, and so in
those cases we actually want it to block and do everything that is
necessary to free the inode, so this would introduce more complex
interactions depending on caller context....

> It would do as much work
> as it can, which in many cases would presumably complete the whole
> task.  But if it cannot progress for some reason, it returns
> -EAGAIN and then the rest gets queued for later.  Might that work,
> or is there often still lots of CPU work to be done after things
> have been waited for?

Freeing the inode occurs after it's been waited for, usually. i.e.
no wait, no inode being freed and no feedback for throttling of the
inode reclaim rate...

FWIW, I don't think making evict() non-blocking is going to be worth
the effort here. Making memory reclaim wait on a priority ordered
queue while asynchronous reclaim threads run reclaim as efficiently
as possible and wakes waiters as it frees the memory the waiters
require is a model that has been proven to work in the past, and
this appears to me to be the model you are advocating for. I agree
that direct reclaim needs to die and be replaced with something
entirely more predictable, controllable and less prone to deadlock
contexts - you just need to convince the mm developers that it will
perform and scale better than what we have now.

In the mean time, having a slightly more fine grained GFP_NOFS
equivalent context will allow us to avoid the worst of the current
GFP_NOFS problems with very little extra code.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
  2016-05-04  1:00         ` Dave Chinner
  (?)
@ 2016-05-06  3:20           ` NeilBrown
  -1 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-06  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, linux-mm, linux-fsdevel, Andrew Morton,
	Theodore Ts'o, Chris Mason, Jan Kara, ceph-devel,
	cluster-devel, linux-nfs, logfs, xfs, linux-ext4, linux-btrfs,
	linux-mtd, reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel,
	linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 4610 bytes --]

On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here.  The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction.  It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex.  Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself:  can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny.   They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode.  So my first proposal would be to rectify that.

 - redefine .nr_cached_objects and .free_cached_objects so that, if they
   are defined, they are responsible for s_dentry_lru and s_inode_lru.
   e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
   two calls to list_lru_shrink_count.  This would require exporting
   prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
   straight forward change.
   If nr_cached_objects were defined, super_cache_scan would no longer
   abort without __GFP_FS - that test would be left to the filesystem.

 - Now any filesystem that wants to can stash it's super_block pointer
   in current->journal_info while doing memory allocations, and abort
   any reclaim attempts (release_page, shrinker, nr_cached_objects) if
   and only if current->journal_info == "my superblock".  This can be
   done without the core mm code knowing any more than it already does.

 - A more sophisticated filesystem might import much of the code for
   prune_icache_sb() - either by copy/paste or by exporting some vfs
   internals - and then store an inode pointer in current->journal_info
   and only abort reclaim which touches that inode.

 - if a filesystem happens to know that it will never block in any of
   these reclaim calls, it can always allow prune_dcache_sb to run, and
   never needs to use GFP_NOFS.  I think NFS might be close to being
   able to do this as it flushes everything on last-close.  But that is
   something that NFS developers can care about (or not) quite
   independently from mm people.

 - Maybe some fs developer will try to enable free_cached_objects to do
   as much work as possible for every inode, but never deadlock.  It
   could do its own fs-specfic deadlock detection, or could queue work
   to a work queue and wait a limited time for it.  Or something.
   If some filesystem developer comes up with something that works
   really well, developers of other filesystems might copy it - or not
   as they choose.

Maybe ->journal_info isn't perfect for this.  It is currently only safe
for reclaim code to compare it against a known value.  It is not safe to
dereference it to see if it points to a known value.  That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state.  Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow.  Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-06  3:20           ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-06  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nfs, linux-ext4, Theodore Ts'o, Chris Mason,
	linux-ntfs-dev, LKML, xfs, reiserfs-devel, linux-f2fs-devel,
	Michal Hocko, cluster-devel, linux-mm, linux-mtd, linux-fsdevel,
	Jan Kara, Andrew Morton, logfs, ceph-devel, linux-btrfs,
	linux-afs


[-- Attachment #1.1: Type: text/plain, Size: 4610 bytes --]

On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here.  The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction.  It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex.  Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself:  can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny.   They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode.  So my first proposal would be to rectify that.

 - redefine .nr_cached_objects and .free_cached_objects so that, if they
   are defined, they are responsible for s_dentry_lru and s_inode_lru.
   e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
   two calls to list_lru_shrink_count.  This would require exporting
   prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
   straight forward change.
   If nr_cached_objects were defined, super_cache_scan would no longer
   abort without __GFP_FS - that test would be left to the filesystem.

 - Now any filesystem that wants to can stash it's super_block pointer
   in current->journal_info while doing memory allocations, and abort
   any reclaim attempts (release_page, shrinker, nr_cached_objects) if
   and only if current->journal_info == "my superblock".  This can be
   done without the core mm code knowing any more than it already does.

 - A more sophisticated filesystem might import much of the code for
   prune_icache_sb() - either by copy/paste or by exporting some vfs
   internals - and then store an inode pointer in current->journal_info
   and only abort reclaim which touches that inode.

 - if a filesystem happens to know that it will never block in any of
   these reclaim calls, it can always allow prune_dcache_sb to run, and
   never needs to use GFP_NOFS.  I think NFS might be close to being
   able to do this as it flushes everything on last-close.  But that is
   something that NFS developers can care about (or not) quite
   independently from mm people.

 - Maybe some fs developer will try to enable free_cached_objects to do
   as much work as possible for every inode, but never deadlock.  It
   could do its own fs-specfic deadlock detection, or could queue work
   to a work queue and wait a limited time for it.  Or something.
   If some filesystem developer comes up with something that works
   really well, developers of other filesystems might copy it - or not
   as they choose.

Maybe ->journal_info isn't perfect for this.  It is currently only safe
for reclaim code to compare it against a known value.  It is not safe to
dereference it to see if it points to a known value.  That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state.  Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow.  Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
@ 2016-05-06  3:20           ` NeilBrown
  0 siblings, 0 replies; 144+ messages in thread
From: NeilBrown @ 2016-05-06  3:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here.  The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction.  It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex.  Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself:  can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny.   They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode.  So my first proposal would be to rectify that.

 - redefine .nr_cached_objects and .free_cached_objects so that, if they
   are defined, they are responsible for s_dentry_lru and s_inode_lru.
   e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
   two calls to list_lru_shrink_count.  This would require exporting
   prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
   straight forward change.
   If nr_cached_objects were defined, super_cache_scan would no longer
   abort without __GFP_FS - that test would be left to the filesystem.

 - Now any filesystem that wants to can stash it's super_block pointer
   in current->journal_info while doing memory allocations, and abort
   any reclaim attempts (release_page, shrinker, nr_cached_objects) if
   and only if current->journal_info == "my superblock".  This can be
   done without the core mm code knowing any more than it already does.

 - A more sophisticated filesystem might import much of the code for
   prune_icache_sb() - either by copy/paste or by exporting some vfs
   internals - and then store an inode pointer in current->journal_info
   and only abort reclaim which touches that inode.

 - if a filesystem happens to know that it will never block in any of
   these reclaim calls, it can always allow prune_dcache_sb to run, and
   never needs to use GFP_NOFS.  I think NFS might be close to being
   able to do this as it flushes everything on last-close.  But that is
   something that NFS developers can care about (or not) quite
   independently from mm people.

 - Maybe some fs developer will try to enable free_cached_objects to do
   as much work as possible for every inode, but never deadlock.  It
   could do its own fs-specfic deadlock detection, or could queue work
   to a work queue and wait a limited time for it.  Or something.
   If some filesystem developer comes up with something that works
   really well, developers of other filesystems might copy it - or not
   as they choose.

Maybe ->journal_info isn't perfect for this.  It is currently only safe
for reclaim code to compare it against a known value.  It is not safe to
dereference it to see if it points to a known value.  That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state.  Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow.  Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20160506/e621860c/attachment.sig>

^ permalink raw reply	[flat|nested] 144+ messages in thread

end of thread, other threads:[~2016-05-06  3:21 UTC | newest]

Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-26 11:56 [PATCH 0/2] scop GFP_NOFS api Michal Hocko
2016-04-26 11:56 ` [Cluster-devel] " Michal Hocko
2016-04-26 11:56 ` Michal Hocko
2016-04-26 11:56 ` Michal Hocko
2016-04-26 11:56 ` Michal Hocko
2016-04-26 11:56 ` [PATCH 1/2] mm: add PF_MEMALLOC_NOFS Michal Hocko
2016-04-26 11:56   ` [Cluster-devel] " Michal Hocko
2016-04-26 11:56   ` Michal Hocko
2016-04-26 11:56   ` Michal Hocko
2016-04-26 11:56   ` Michal Hocko
2016-04-26 23:07   ` Dave Chinner
2016-04-26 23:07     ` [Cluster-devel] " Dave Chinner
2016-04-26 23:07     ` Dave Chinner
2016-04-26 23:07     ` Dave Chinner
2016-04-27  7:51     ` Michal Hocko
2016-04-27  7:51       ` [Cluster-devel] " Michal Hocko
2016-04-27  7:51       ` Michal Hocko
2016-04-27  7:51       ` Michal Hocko
2016-04-27 10:53   ` Tetsuo Handa
2016-04-27 10:53     ` Tetsuo Handa
2016-04-27 11:15     ` Michal Hocko
2016-04-27 11:15       ` Michal Hocko
2016-04-27 14:44       ` Tetsuo Handa
2016-04-27 14:44         ` Tetsuo Handa
2016-04-27 20:05         ` Michal Hocko
2016-04-27 20:05           ` Michal Hocko
2016-04-27 11:54   ` [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS Michal Hocko
2016-04-27 11:54     ` [Cluster-devel] " Michal Hocko
2016-04-27 11:54     ` Michal Hocko
2016-04-27 11:54     ` Michal Hocko
2016-04-27 11:54     ` Michal Hocko
2016-04-27 11:54     ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 11:54       ` [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2016-04-27 11:54       ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 11:54       ` Michal Hocko
2016-04-27 11:54       ` Michal Hocko
2016-04-27 11:54       ` Michal Hocko
2016-04-27 13:07       ` Michal Hocko
2016-04-27 13:07         ` [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2016-04-27 13:07         ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 13:07         ` Michal Hocko
2016-04-27 20:09       ` Michal Hocko
2016-04-27 20:09         ` [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2016-04-27 20:09         ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 20:09         ` Michal Hocko
2016-04-27 20:30         ` Michal Hocko
2016-04-27 20:30           ` [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2016-04-27 20:30           ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 20:30           ` Michal Hocko
2016-04-27 21:14       ` Michal Hocko
2016-04-27 21:14         ` [Cluster-devel] [PATCH 1.2/2] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2016-04-27 21:14         ` [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2016-04-27 21:14         ` Michal Hocko
2016-04-27 21:14         ` Michal Hocko
2016-04-27 21:14         ` Michal Hocko
2016-04-27 17:41     ` [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS Andreas Dilger
2016-04-27 17:41       ` [Cluster-devel] " Andreas Dilger
2016-04-27 17:41       ` Andreas Dilger
2016-04-27 19:43       ` Michal Hocko
2016-04-27 19:43         ` [Cluster-devel] " Michal Hocko
2016-04-27 19:43         ` Michal Hocko
2016-04-27 19:43         ` Michal Hocko
2016-04-26 11:56 ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2016-04-26 11:56   ` [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Michal Hocko
2016-04-26 11:56   ` Michal Hocko
2016-04-26 11:56   ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2016-04-26 11:56   ` [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Michal Hocko
2016-04-26 11:56   ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2016-04-26 22:58   ` Dave Chinner
2016-04-26 22:58     ` [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Dave Chinner
2016-04-26 22:58     ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Dave Chinner
2016-04-26 22:58     ` Dave Chinner
2016-04-27  8:03     ` Michal Hocko
2016-04-27  8:03       ` [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Michal Hocko
2016-04-27  8:03       ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2016-04-27  8:03       ` Michal Hocko
2016-04-27 22:55       ` Dave Chinner
2016-04-27 22:55         ` [Cluster-devel] [PATCH 2/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Dave Chinner
2016-04-27 22:55         ` [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Dave Chinner
2016-04-27 22:55         ` Dave Chinner
2016-04-28  8:17     ` Michal Hocko
2016-04-28  8:17       ` Michal Hocko
2016-04-28  8:17       ` Michal Hocko
2016-04-28 21:51       ` Dave Chinner
2016-04-28 21:51         ` Dave Chinner
2016-04-28 21:51         ` Dave Chinner
2016-04-29 12:12         ` Michal Hocko
2016-04-29 12:12           ` Michal Hocko
2016-04-29 12:12           ` Michal Hocko
2016-04-29 23:40           ` Dave Chinner
2016-04-29 23:40             ` Dave Chinner
2016-04-29 23:40             ` Dave Chinner
2016-05-03 15:38             ` Michal Hocko
2016-05-03 15:38               ` Michal Hocko
2016-05-03 15:38               ` Michal Hocko
2016-05-04  0:07               ` Dave Chinner
2016-05-04  0:07                 ` Dave Chinner
2016-05-04  0:07                 ` Dave Chinner
2016-04-29  5:35 ` [PATCH 0/2] scop GFP_NOFS api NeilBrown
2016-04-29  5:35   ` [Cluster-devel] " NeilBrown
2016-04-29  5:35   ` NeilBrown
2016-04-29 10:20   ` [Cluster-devel] " Steven Whitehouse
2016-04-29 10:20     ` Steven Whitehouse
2016-04-29 10:20     ` Steven Whitehouse
2016-04-29 10:20     ` Steven Whitehouse
2016-04-29 10:20     ` Steven Whitehouse
2016-04-30 21:17     ` NeilBrown
2016-04-30 21:17       ` NeilBrown
2016-04-30 21:17       ` NeilBrown
2016-04-30 21:17       ` NeilBrown
2016-04-30 21:17       ` NeilBrown
2016-04-30 21:17       ` NeilBrown
2016-04-29 12:04   ` Michal Hocko
2016-04-29 12:04     ` [Cluster-devel] " Michal Hocko
2016-04-29 12:04     ` Michal Hocko
2016-04-29 12:04     ` Michal Hocko
2016-04-30  0:24     ` Dave Chinner
2016-04-30  0:24       ` [Cluster-devel] " Dave Chinner
2016-04-30  0:24       ` Dave Chinner
2016-04-30  0:24       ` Dave Chinner
2016-04-30 21:55     ` NeilBrown
2016-04-30 21:55       ` [Cluster-devel] " NeilBrown
2016-04-30 21:55       ` NeilBrown
2016-05-03 15:13       ` Michal Hocko
2016-05-03 15:13         ` [Cluster-devel] " Michal Hocko
2016-05-03 15:13         ` Michal Hocko
2016-05-03 15:13         ` Michal Hocko
2016-05-03 23:26         ` NeilBrown
2016-05-03 23:26           ` [Cluster-devel] " NeilBrown
2016-05-03 23:26           ` NeilBrown
2016-04-30  0:11   ` Dave Chinner
2016-04-30  0:11     ` [Cluster-devel] " Dave Chinner
2016-04-30  0:11     ` Dave Chinner
2016-04-30  0:11     ` Dave Chinner
2016-04-30 22:19     ` NeilBrown
2016-04-30 22:19       ` [Cluster-devel] " NeilBrown
2016-04-30 22:19       ` NeilBrown
2016-05-04  1:00       ` Dave Chinner
2016-05-04  1:00         ` [Cluster-devel] " Dave Chinner
2016-05-04  1:00         ` Dave Chinner
2016-05-04  1:00         ` Dave Chinner
2016-05-06  3:20         ` NeilBrown
2016-05-06  3:20           ` [Cluster-devel] " NeilBrown
2016-05-06  3:20           ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.