linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Overhaul memalloc_no*
@ 2020-06-25 11:31 Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
for an upcoming patch series, and Jens also wants it for non-blocking
io_uring.  It turns out we already have dm-bufio which could benefit
from memalloc_nowait, so it may as well go into the tree now.

The biggest problem is that we're basically out of PF_ flags, so we need
to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
out the PF_ flags are really supposed to be used for flags which are
accessed from other tasks, and the MEMALLOC flags are only going to
be used by this task.  So shuffling everything around frees up some PF
flags and generally makes the world a better place.

Patch series also available from
http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc

Matthew Wilcox (Oracle) (6):
  mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
  mm: Add become_kswapd and restore_kswapd
  xfs: Convert to memalloc_nofs_save
  mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
  mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma
  mm: Add memalloc_nowait

 drivers/block/loop.c           |  3 +-
 drivers/md/dm-bufio.c          | 30 ++++--------
 drivers/md/dm-zoned-metadata.c |  5 +-
 fs/iomap/buffered-io.c         |  2 +-
 fs/xfs/kmem.c                  |  2 +-
 fs/xfs/libxfs/xfs_btree.c      | 14 +++---
 fs/xfs/xfs_aops.c              |  4 +-
 fs/xfs/xfs_buf.c               |  2 +-
 fs/xfs/xfs_linux.h             |  6 ---
 fs/xfs/xfs_trans.c             | 14 +++---
 fs/xfs/xfs_trans.h             |  2 +-
 include/linux/sched.h          |  7 +--
 include/linux/sched/mm.h       | 84 ++++++++++++++++++++++++++--------
 kernel/sys.c                   |  8 ++--
 mm/vmscan.c                    | 16 +------
 15 files changed, 105 insertions(+), 94 deletions(-)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 12:22   ` Michal Hocko
  2020-06-25 11:31 ` [PATCH 2/6] mm: Add become_kswapd and restore_kswapd Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

We're short on PF_* flags, so make memalloc_noio its own bit where we
have plenty of space.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/block/loop.c           |  3 ++-
 drivers/md/dm-zoned-metadata.c |  5 ++---
 include/linux/sched.h          |  2 +-
 include/linux/sched/mm.h       | 30 +++++++++++++++++++++++-------
 kernel/sys.c                   |  8 +++-----
 5 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 475e1a738560..c8742e25e58a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -52,6 +52,7 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/stat.h>
@@ -929,7 +930,7 @@ static void loop_unprepare_queue(struct loop_device *lo)
 
 static int loop_kthread_worker_fn(void *worker_ptr)
 {
-	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
+	set_current_io_flusher();
 	return kthread_worker_fn(worker_ptr);
 }
 
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 130b5a6d9f12..1c5ae674ba20 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -1599,9 +1599,8 @@ static int dmz_update_zone(struct dmz_metadata *zmd, struct dm_zone *zone)
 
 	/*
 	 * Get zone information from disk. Since blkdev_report_zones() uses
-	 * GFP_KERNEL by default for memory allocations, set the per-task
-	 * PF_MEMALLOC_NOIO flag so that all allocations are done as if
-	 * GFP_NOIO was specified.
+	 * GFP_KERNEL by default for memory allocations, use
+	 * memalloc_noio_save() to prevent recursion into the driver.
 	 */
 	noio_flag = memalloc_noio_save();
 	ret = blkdev_report_zones(dev->bdev, dmz_start_sect(zmd, zone), 1,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b62e6aaf28f0..cf18a3d2bc4c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -801,6 +801,7 @@ struct task_struct {
 	/* Stalled due to lack of memory */
 	unsigned			in_memstall:1;
 #endif
+	unsigned			memalloc_noio:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
@@ -1505,7 +1506,6 @@ extern struct pid *cad_pid;
 #define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
-#define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 480a4d1b7dd8..1a7e1ab1be85 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -175,19 +175,18 @@ static inline bool in_vfork(struct task_struct *tsk)
 
 /*
  * Applies per-task gfp context to the given allocation flags.
- * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
  * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	if (unlikely(current->flags &
-		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
+	if (unlikely(current->flags & (PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA) ||
+		     current->memalloc_noio)) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
 		 */
-		if (current->flags & PF_MEMALLOC_NOIO)
+		if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (current->flags & PF_MEMALLOC_NOFS)
 			flags &= ~__GFP_FS;
@@ -224,8 +223,8 @@ static inline void fs_reclaim_release(gfp_t gfp_mask) { }
  */
 static inline unsigned int memalloc_noio_save(void)
 {
-	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
-	current->flags |= PF_MEMALLOC_NOIO;
+	unsigned int flags = current->memalloc_noio;
+	current->memalloc_noio = 1;
 	return flags;
 }
 
@@ -239,7 +238,7 @@ static inline unsigned int memalloc_noio_save(void)
  */
 static inline void memalloc_noio_restore(unsigned int flags)
 {
-	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
+	current->memalloc_noio = flags ? 1 : 0;
 }
 
 /**
@@ -309,6 +308,23 @@ static inline void memalloc_nocma_restore(unsigned int flags)
 }
 #endif
 
+static inline void set_current_io_flusher(void)
+{
+	current->flags |= PF_LOCAL_THROTTLE;
+	current->memalloc_noio = 1;
+}
+
+static inline void clear_current_io_flusher(void)
+{
+	current->flags &= ~PF_LOCAL_THROTTLE;
+	current->memalloc_noio = 0;
+}
+
+static inline bool get_current_io_flusher(void)
+{
+	return current->flags & PF_LOCAL_THROTTLE;
+}
+
 #ifdef CONFIG_MEMCG
 /**
  * memalloc_use_memcg - Starts the remote memcg charging scope.
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..78c90d1e92f4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2275,8 +2275,6 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 	return -EINVAL;
 }
 
-#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
-
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2512,9 +2510,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 
 		if (arg2 == 1)
-			current->flags |= PR_IO_FLUSHER;
+			set_current_io_flusher();
 		else if (!arg2)
-			current->flags &= ~PR_IO_FLUSHER;
+			clear_current_io_flusher();
 		else
 			return -EINVAL;
 		break;
@@ -2525,7 +2523,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
 
-		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
+		error = get_current_io_flusher();
 		break;
 	default:
 		error = -EINVAL;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/6] mm: Add become_kswapd and restore_kswapd
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 12:31   ` Michal Hocko
  2020-06-25 11:31 ` [PATCH 3/6] xfs: Convert to memalloc_nofs_save Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

Since XFS needs to pretend to be kswapd in some of its worker threads,
create methods to save & restore kswapd state.  Don't bother restoring
kswapd state in kswapd -- the only time we reach this code is when we're
exiting and the task_struct is about to be destroyed anyway.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/libxfs/xfs_btree.c | 14 ++++++++------
 include/linux/sched/mm.h  | 26 ++++++++++++++++++++++++++
 mm/vmscan.c               | 16 +---------------
 3 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 2d25bab68764..a04a44238aab 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2813,8 +2813,9 @@ xfs_btree_split_worker(
 {
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
+	bool			is_kswapd = args->kswapd;
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
+	int			memalloc_nofs;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
@@ -2822,16 +2823,17 @@ xfs_btree_split_worker(
 	 * temporarily to ensure that we don't block waiting for memory reclaim
 	 * in any way.
 	 */
-	if (args->kswapd)
-		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
-
-	current_set_flags_nested(&pflags, new_pflags);
+	if (is_kswapd)
+		pflags = become_kswapd();
+	memalloc_nofs = memalloc_nofs_save();
 
 	args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
 					 args->key, args->curp, args->stat);
 	complete(args->done);
 
-	current_restore_flags_nested(&pflags, new_pflags);
+	memalloc_nofs_restore(memalloc_nofs);
+	if (is_kswapd)
+		restore_kswapd(pflags);
 }
 
 /*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1a7e1ab1be85..b0089eadc367 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -308,6 +308,32 @@ static inline void memalloc_nocma_restore(unsigned int flags)
 }
 #endif
 
+/*
+ * Tell the memory management that we're a "memory allocator",
+ * and that if we need more memory we should get access to it
+ * regardless (see "__alloc_pages()"). "kswapd" should
+ * never get caught in the normal page freeing logic.
+ *
+ * (Kswapd normally doesn't need memory anyway, but sometimes
+ * you need a small amount of memory in order to be able to
+ * page out something else, and this flag essentially protects
+ * us from recursively trying to free more memory as we're
+ * trying to free the first piece of memory in the first place).
+ */
+#define KSWAPD_PF_FLAGS		(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD)
+
+static inline unsigned long become_kswapd(void)
+{
+	unsigned long flags = current->flags & KSWAPD_PF_FLAGS;
+	current->flags |= KSWAPD_PF_FLAGS;
+	return flags;
+}
+
+static inline void restore_kswapd(unsigned long flags)
+{
+	current->flags &= ~(flags ^ KSWAPD_PF_FLAGS);
+}
+
 static inline void set_current_io_flusher(void)
 {
 	current->flags |= PF_LOCAL_THROTTLE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..27ae76699899 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3870,19 +3870,7 @@ static int kswapd(void *p)
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 
-	/*
-	 * Tell the memory management that we're a "memory allocator",
-	 * and that if we need more memory we should get access to it
-	 * regardless (see "__alloc_pages()"). "kswapd" should
-	 * never get caught in the normal page freeing logic.
-	 *
-	 * (Kswapd normally doesn't need memory anyway, but sometimes
-	 * you need a small amount of memory in order to be able to
-	 * page out something else, and this flag essentially protects
-	 * us from recursively trying to free more memory as we're
-	 * trying to free the first piece of memory in the first place).
-	 */
-	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+	become_kswapd();
 	set_freezable();
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
@@ -3932,8 +3920,6 @@ static int kswapd(void *p)
 			goto kswapd_try_sleep;
 	}
 
-	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
-
 	return 0;
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 3/6] xfs: Convert to memalloc_nofs_save
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 2/6] mm: Add become_kswapd and restore_kswapd Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

Instead of using custom macros to set/restore PF_MEMALLOC_NOFS, use
memalloc_nofs_save() like the rest of the kernel.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/kmem.c      |  2 +-
 fs/xfs/xfs_aops.c  |  4 ++--
 fs/xfs/xfs_buf.c   |  2 +-
 fs/xfs/xfs_linux.h |  6 ------
 fs/xfs/xfs_trans.c | 14 +++++++-------
 fs/xfs/xfs_trans.h |  2 +-
 6 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index f1366475c389..c2d237159bfc 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -35,7 +35,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
  * __vmalloc() will allocate data pages and auxiliary structures (e.g.
  * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context here. Hence
  * we need to tell memory reclaim that we are in such a context via
- * PF_MEMALLOC_NOFS to prevent memory reclaim re-entering the filesystem here
+ * memalloc_nofs to prevent memory reclaim re-entering the filesystem here
  * and potentially deadlocking.
  */
 static void *
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..e3a4806e519d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -62,7 +62,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	memalloc_nofs_restore(tp->t_memalloc);
 	return 0;
 }
 
@@ -125,7 +125,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	tp->t_memalloc = memalloc_nofs_save();
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 20b748f7e186..b2c3d01c690b 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -470,7 +470,7 @@ _xfs_buf_map_pages(
 		 * vm_map_ram() will allocate auxiliary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
+		 * that we are in such a context via memalloc_nofs to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 9f70d2f68e05..e1daf242a53b 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -104,12 +104,6 @@ typedef __u32			xfs_nlink_t;
 #define current_cpu()		(raw_smp_processor_id())
 #define current_pid()		(current->pid)
 #define current_test_flags(f)	(current->flags & (f))
-#define current_set_flags_nested(sp, f)		\
-		(*(sp) = current->flags, current->flags |= (f))
-#define current_clear_flags_nested(sp, f)	\
-		(*(sp) = current->flags, current->flags &= ~(f))
-#define current_restore_flags_nested(sp, f)	\
-		(current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
 
 #define NBBY		8		/* number of bits per byte */
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3c94e5ff4316..4ef1a0ff0a11 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -118,7 +118,7 @@ xfs_trans_dup(
 
 	ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used;
 	tp->t_rtx_res = tp->t_rtx_res_used;
-	ntp->t_pflags = tp->t_pflags;
+	ntp->t_memalloc = tp->t_memalloc;
 
 	/* move deferred ops over to the new tp */
 	xfs_defer_move(ntp, tp);
@@ -153,7 +153,7 @@ xfs_trans_reserve(
 	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	tp->t_memalloc = memalloc_nofs_save();
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -163,7 +163,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+			memalloc_nofs_restore(tp->t_memalloc);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -240,7 +240,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	memalloc_nofs_restore(tp->t_memalloc);
 
 	return error;
 }
@@ -861,7 +861,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	memalloc_nofs_restore(tp->t_memalloc);
 	xfs_trans_free(tp);
 
 	/*
@@ -893,7 +893,7 @@ __xfs_trans_commit(
 			xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket);
 		tp->t_ticket = NULL;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	memalloc_nofs_restore(tp->t_memalloc);
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
 
@@ -954,7 +954,7 @@ xfs_trans_cancel(
 	}
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	memalloc_nofs_restore(tp->t_memalloc);
 
 	xfs_trans_free_items(tp, dirty);
 	xfs_trans_free(tp);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e40..7aa2d5ff9245 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -118,6 +118,7 @@ typedef struct xfs_trans {
 	unsigned int		t_rtx_res;	/* # of rt extents resvd */
 	unsigned int		t_rtx_res_used;	/* # of resvd rt extents used */
 	unsigned int		t_flags;	/* misc flags */
+	unsigned int		t_memalloc;	/* saved memalloc state */
 	xfs_fsblock_t		t_firstblock;	/* first block allocated */
 	struct xlog_ticket	*t_ticket;	/* log mgr ticket */
 	struct xfs_mount	*t_mountp;	/* ptr to fs mount struct */
@@ -144,7 +145,6 @@ typedef struct xfs_trans {
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	struct list_head	t_dfops;	/* deferred operations */
-	unsigned long		t_pflags;	/* saved process flags state */
 } xfs_trans_t;
 
 /*
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2020-06-25 11:31 ` [PATCH 3/6] xfs: Convert to memalloc_nofs_save Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 13:35   ` Michal Hocko
  2020-06-25 11:31 ` [PATCH 5/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

We're short on PF_* flags, so make memalloc_nofs its own bit where we
have plenty of space.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c   |  2 +-
 include/linux/sched.h    |  2 +-
 include/linux/sched/mm.h | 13 ++++++-------
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..87d66c13bf5c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1502,7 +1502,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called in a recursive filesystem reclaim context.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+	if (WARN_ON_ONCE(current->memalloc_nofs))
 		goto redirty;
 
 	/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf18a3d2bc4c..eaf36ae1fde2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -802,6 +802,7 @@ struct task_struct {
 	unsigned			in_memstall:1;
 #endif
 	unsigned			memalloc_noio:1;
+	unsigned			memalloc_nofs:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
@@ -1505,7 +1506,6 @@ extern struct pid *cad_pid;
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
 #define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
-#define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b0089eadc367..08bc9d0606a8 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -175,20 +175,19 @@ static inline bool in_vfork(struct task_struct *tsk)
 
 /*
  * Applies per-task gfp context to the given allocation flags.
- * PF_MEMALLOC_NOFS implies GFP_NOFS
  * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	if (unlikely(current->flags & (PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA) ||
-		     current->memalloc_noio)) {
+	if (unlikely((current->flags & PF_MEMALLOC_NOCMA) ||
+		     current->memalloc_noio || current->memalloc_nofs)) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
 		 */
 		if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
-		else if (current->flags & PF_MEMALLOC_NOFS)
+		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;
 #ifdef CONFIG_CMA
 		if (current->flags & PF_MEMALLOC_NOCMA)
@@ -254,8 +253,8 @@ static inline void memalloc_noio_restore(unsigned int flags)
  */
 static inline unsigned int memalloc_nofs_save(void)
 {
-	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
-	current->flags |= PF_MEMALLOC_NOFS;
+	unsigned int flags = current->memalloc_nofs;
+	current->memalloc_nofs = 1;
 	return flags;
 }
 
@@ -269,7 +268,7 @@ static inline unsigned int memalloc_nofs_save(void)
  */
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
-	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+	current->memalloc_nofs = flags ? 1 : 0;
 }
 
 static inline unsigned int memalloc_noreclaim_save(void)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 5/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2020-06-25 11:31 ` [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

We're short on PF_* flags, so make memalloc_nocma its own bit where we
have plenty of space.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/sched.h    |  2 +-
 include/linux/sched/mm.h | 15 +++++++--------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eaf36ae1fde2..90336850e940 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -803,6 +803,7 @@ struct task_struct {
 #endif
 	unsigned			memalloc_noio:1;
 	unsigned			memalloc_nofs:1;
+	unsigned			memalloc_nocma:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
@@ -1514,7 +1515,6 @@ extern struct pid *cad_pid;
 #define PF_UMH			0x02000000	/* I'm an Usermodehelper process */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
-#define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
 #define PF_IO_WORKER		0x20000000	/* Task is an IO worker */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 08bc9d0606a8..6f7b59a848a6 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -175,12 +175,11 @@ static inline bool in_vfork(struct task_struct *tsk)
 
 /*
  * Applies per-task gfp context to the given allocation flags.
- * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	if (unlikely((current->flags & PF_MEMALLOC_NOCMA) ||
-		     current->memalloc_noio || current->memalloc_nofs)) {
+	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
+		     current->memalloc_nocma)) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
@@ -190,7 +189,8 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;
 #ifdef CONFIG_CMA
-		if (current->flags & PF_MEMALLOC_NOCMA)
+		/* memalloc_nocma implies no allocation from movable region */
+		if (current->memalloc_nocma)
 			flags &= ~__GFP_MOVABLE;
 #endif
 	}
@@ -286,15 +286,14 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 #ifdef CONFIG_CMA
 static inline unsigned int memalloc_nocma_save(void)
 {
-	unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
-
-	current->flags |= PF_MEMALLOC_NOCMA;
+	unsigned int flags = current->memalloc_nocma;
+	current->memalloc_nocma = 1;
 	return flags;
 }
 
 static inline void memalloc_nocma_restore(unsigned int flags)
 {
-	current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+	current->memalloc_nocma = flags ? 1 : 0;
 }
 #else
 static inline unsigned int memalloc_nocma_save(void)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2020-06-25 11:31 ` [PATCH 5/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma Matthew Wilcox (Oracle)
@ 2020-06-25 11:31 ` Matthew Wilcox (Oracle)
  2020-06-25 12:40   ` Michal Hocko
                     ` (5 more replies)
  2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
  2020-06-26 15:02 ` Mikulas Patocka
  7 siblings, 6 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-25 11:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
guarantees we will not sleep to reclaim memory.  Use it to simplify
dm-bufio's allocations.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/md/dm-bufio.c    | 30 ++++++++----------------------
 include/linux/sched.h    |  1 +
 include/linux/sched/mm.h | 12 ++++++++----
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 6d1565021d74..140ada9a2c8f 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -412,23 +412,6 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
 
 	*data_mode = DATA_MODE_VMALLOC;
 
-	/*
-	 * __vmalloc allocates the data pages and auxiliary structures with
-	 * gfp_flags that were specified, but pagetables are always allocated
-	 * with GFP_KERNEL, no matter what was specified as gfp_mask.
-	 *
-	 * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that
-	 * all allocations done by this process (including pagetables) are done
-	 * as if GFP_NOIO was specified.
-	 */
-	if (gfp_mask & __GFP_NORETRY) {
-		unsigned noio_flag = memalloc_noio_save();
-		void *ptr = __vmalloc(c->block_size, gfp_mask);
-
-		memalloc_noio_restore(noio_flag);
-		return ptr;
-	}
-
 	return __vmalloc(c->block_size, gfp_mask);
 }
 
@@ -866,9 +849,6 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 * dm-bufio is resistant to allocation failures (it just keeps
 	 * one buffer reserved in cases all the allocations fail).
 	 * So set flags to not try too hard:
-	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
-	 *		    mutex and wait ourselves.
-	 *	__GFP_NORETRY: don't retry and rather return failure
 	 *	__GFP_NOMEMALLOC: don't use emergency reserves
 	 *	__GFP_NOWARN: don't print a warning in case of failure
 	 *
@@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 */
 	while (1) {
 		if (dm_bufio_cache_size_latch != 1) {
-			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			unsigned nowait_flag = memalloc_nowait_save();
+			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			memalloc_nowait_restore(nowait_flag);
 			if (b)
 				return b;
 		}
@@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 			return NULL;
 
 		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
+			unsigned noio_flag;
+
 			dm_bufio_unlock(c);
-			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			noio_flag = memalloc_noio_save();
+			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			memalloc_noio_restore(noio_flag);
 			dm_bufio_lock(c);
 			if (b)
 				return b;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90336850e940..b1c2cddd366c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -803,6 +803,7 @@ struct task_struct {
 #endif
 	unsigned			memalloc_noio:1;
 	unsigned			memalloc_nofs:1;
+	unsigned			memalloc_nowait:1;
 	unsigned			memalloc_nocma:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 6f7b59a848a6..6484569f50df 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -179,12 +179,16 @@ static inline bool in_vfork(struct task_struct *tsk)
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
 	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
-		     current->memalloc_nocma)) {
+		     current->memalloc_nocma) || current->memalloc_nowait) {
 		/*
-		 * NOIO implies both NOIO and NOFS and it is a weaker context
-		 * so always make sure it makes precedence
+		 * Clearing DIRECT_RECLAIM means we won't get to the point
+		 * of testing IO or FS, so we don't need to bother clearing
+		 * them.  noio implies neither IO nor FS and it is a weaker
+		 * context so always make sure it takes precedence.
 		 */
-		if (current->memalloc_noio)
+		if (current->memalloc_nowait)
+			flags &= ~__GFP_DIRECT_RECLAIM;
+		else if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
  2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
@ 2020-06-25 12:22   ` Michal Hocko
  2020-06-25 12:34     ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 12:22 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 12:31:17, Matthew Wilcox wrote:
> We're short on PF_* flags, so make memalloc_noio its own bit where we
> have plenty of space.

I do not mind moving that outside of the PF_* space. Unless I
misremember all flags in this space were intented to be set only on the
current which rules out any RMW races and therefore they can be
lockless. I am not sure this holds for the bitfield you are adding this
to. At least in_memstall seem to be set on external task as well. But
this would require double checking. Maybe that is not really intended or
just a bug.

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/block/loop.c           |  3 ++-
>  drivers/md/dm-zoned-metadata.c |  5 ++---
>  include/linux/sched.h          |  2 +-
>  include/linux/sched/mm.h       | 30 +++++++++++++++++++++++-------
>  kernel/sys.c                   |  8 +++-----
>  5 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 475e1a738560..c8742e25e58a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -52,6 +52,7 @@
>  #include <linux/module.h>
>  #include <linux/moduleparam.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
>  #include <linux/stat.h>
> @@ -929,7 +930,7 @@ static void loop_unprepare_queue(struct loop_device *lo)
>  
>  static int loop_kthread_worker_fn(void *worker_ptr)
>  {
> -	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> +	set_current_io_flusher();
>  	return kthread_worker_fn(worker_ptr);
>  }
>  
> diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
> index 130b5a6d9f12..1c5ae674ba20 100644
> --- a/drivers/md/dm-zoned-metadata.c
> +++ b/drivers/md/dm-zoned-metadata.c
> @@ -1599,9 +1599,8 @@ static int dmz_update_zone(struct dmz_metadata *zmd, struct dm_zone *zone)
>  
>  	/*
>  	 * Get zone information from disk. Since blkdev_report_zones() uses
> -	 * GFP_KERNEL by default for memory allocations, set the per-task
> -	 * PF_MEMALLOC_NOIO flag so that all allocations are done as if
> -	 * GFP_NOIO was specified.
> +	 * GFP_KERNEL by default for memory allocations, use
> +	 * memalloc_noio_save() to prevent recursion into the driver.
>  	 */
>  	noio_flag = memalloc_noio_save();
>  	ret = blkdev_report_zones(dev->bdev, dmz_start_sect(zmd, zone), 1,
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b62e6aaf28f0..cf18a3d2bc4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -801,6 +801,7 @@ struct task_struct {
>  	/* Stalled due to lack of memory */
>  	unsigned			in_memstall:1;
>  #endif
> +	unsigned			memalloc_noio:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>  
> @@ -1505,7 +1506,6 @@ extern struct pid *cad_pid;
>  #define PF_FROZEN		0x00010000	/* Frozen for system suspend */
>  #define PF_KSWAPD		0x00020000	/* I am kswapd */
>  #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
> -#define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
>  #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 480a4d1b7dd8..1a7e1ab1be85 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -175,19 +175,18 @@ static inline bool in_vfork(struct task_struct *tsk)
>  
>  /*
>   * Applies per-task gfp context to the given allocation flags.
> - * PF_MEMALLOC_NOIO implies GFP_NOIO
>   * PF_MEMALLOC_NOFS implies GFP_NOFS
>   * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> -	if (unlikely(current->flags &
> -		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
> +	if (unlikely(current->flags & (PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA) ||
> +		     current->memalloc_noio)) {
>  		/*
>  		 * NOIO implies both NOIO and NOFS and it is a weaker context
>  		 * so always make sure it makes precedence
>  		 */
> -		if (current->flags & PF_MEMALLOC_NOIO)
> +		if (current->memalloc_noio)
>  			flags &= ~(__GFP_IO | __GFP_FS);
>  		else if (current->flags & PF_MEMALLOC_NOFS)
>  			flags &= ~__GFP_FS;
> @@ -224,8 +223,8 @@ static inline void fs_reclaim_release(gfp_t gfp_mask) { }
>   */
>  static inline unsigned int memalloc_noio_save(void)
>  {
> -	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
> -	current->flags |= PF_MEMALLOC_NOIO;
> +	unsigned int flags = current->memalloc_noio;
> +	current->memalloc_noio = 1;
>  	return flags;
>  }
>  
> @@ -239,7 +238,7 @@ static inline unsigned int memalloc_noio_save(void)
>   */
>  static inline void memalloc_noio_restore(unsigned int flags)
>  {
> -	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
> +	current->memalloc_noio = flags ? 1 : 0;
>  }
>  
>  /**
> @@ -309,6 +308,23 @@ static inline void memalloc_nocma_restore(unsigned int flags)
>  }
>  #endif
>  
> +static inline void set_current_io_flusher(void)
> +{
> +	current->flags |= PF_LOCAL_THROTTLE;
> +	current->memalloc_noio = 1;
> +}
> +
> +static inline void clear_current_io_flusher(void)
> +{
> +	current->flags &= ~PF_LOCAL_THROTTLE;
> +	current->memalloc_noio = 0;
> +}
> +
> +static inline bool get_current_io_flusher(void)
> +{
> +	return current->flags & PF_LOCAL_THROTTLE;
> +}
> +
>  #ifdef CONFIG_MEMCG
>  /**
>   * memalloc_use_memcg - Starts the remote memcg charging scope.
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..78c90d1e92f4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2275,8 +2275,6 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>  	return -EINVAL;
>  }
>  
> -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> -
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2512,9 +2510,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  
>  		if (arg2 == 1)
> -			current->flags |= PR_IO_FLUSHER;
> +			set_current_io_flusher();
>  		else if (!arg2)
> -			current->flags &= ~PR_IO_FLUSHER;
> +			clear_current_io_flusher();
>  		else
>  			return -EINVAL;
>  		break;
> @@ -2525,7 +2523,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		if (arg2 || arg3 || arg4 || arg5)
>  			return -EINVAL;
>  
> -		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> +		error = get_current_io_flusher();
>  		break;
>  	default:
>  		error = -EINVAL;
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: Add become_kswapd and restore_kswapd
  2020-06-25 11:31 ` [PATCH 2/6] mm: Add become_kswapd and restore_kswapd Matthew Wilcox (Oracle)
@ 2020-06-25 12:31   ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 12:31 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 12:31:18, Matthew Wilcox wrote:
> Since XFS needs to pretend to be kswapd in some of its worker threads,
> create methods to save & restore kswapd state.  Don't bother restoring
> kswapd state in kswapd -- the only time we reach this code is when we're
> exiting and the task_struct is about to be destroyed anyway.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Certainly better than an opencoded PF_$FOO manipulation
Acked-by: Michal Hocko <mhocko@suse.com>

I would just ask for a clarification because this is rellying to have a
good MM knowledge to follow

> +/*
> + * Tell the memory management that we're a "memory allocator",

I would go with.
Tell the memory management that the caller is working on behalf of the
background memory reclaim (aka kswapd) and help it to make a forward
progress. That means that it will get an access to memory reserves
should there be a need to allocate memory in order to make a forward
progress. Note that the caller has to be extremely careful when doing
that.

Or something like that.

> + * and that if we need more memory we should get access to it
> + * regardless (see "__alloc_pages()"). "kswapd" should
> + * never get caught in the normal page freeing logic.
> + *
> + * (Kswapd normally doesn't need memory anyway, but sometimes
> + * you need a small amount of memory in order to be able to
> + * page out something else, and this flag essentially protects
> + * us from recursively trying to free more memory as we're
> + * trying to free the first piece of memory in the first place).
> + */
> +#define KSWAPD_PF_FLAGS		(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD)
> +
> +static inline unsigned long become_kswapd(void)
> +{
> +	unsigned long flags = current->flags & KSWAPD_PF_FLAGS;
> +	current->flags |= KSWAPD_PF_FLAGS;
> +	return flags;
> +}
> +
> +static inline void restore_kswapd(unsigned long flags)
> +{
> +	current->flags &= ~(flags ^ KSWAPD_PF_FLAGS);
> +}
> +
>  static inline void set_current_io_flusher(void)
>  {
>  	current->flags |= PF_LOCAL_THROTTLE;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..27ae76699899 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3870,19 +3870,7 @@ static int kswapd(void *p)
>  	if (!cpumask_empty(cpumask))
>  		set_cpus_allowed_ptr(tsk, cpumask);
>  
> -	/*
> -	 * Tell the memory management that we're a "memory allocator",
> -	 * and that if we need more memory we should get access to it
> -	 * regardless (see "__alloc_pages()"). "kswapd" should
> -	 * never get caught in the normal page freeing logic.
> -	 *
> -	 * (Kswapd normally doesn't need memory anyway, but sometimes
> -	 * you need a small amount of memory in order to be able to
> -	 * page out something else, and this flag essentially protects
> -	 * us from recursively trying to free more memory as we're
> -	 * trying to free the first piece of memory in the first place).
> -	 */
> -	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> +	become_kswapd();
>  	set_freezable();
>  
>  	WRITE_ONCE(pgdat->kswapd_order, 0);
> @@ -3932,8 +3920,6 @@ static int kswapd(void *p)
>  			goto kswapd_try_sleep;
>  	}
>  
> -	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
> -
>  	return 0;
>  }
>  
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
  2020-06-25 12:22   ` Michal Hocko
@ 2020-06-25 12:34     ` Matthew Wilcox
  2020-06-25 12:42       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-25 12:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu, Jun 25, 2020 at 02:22:39PM +0200, Michal Hocko wrote:
> On Thu 25-06-20 12:31:17, Matthew Wilcox wrote:
> > We're short on PF_* flags, so make memalloc_noio its own bit where we
> > have plenty of space.
> 
> I do not mind moving that outside of the PF_* space. Unless I
> misremember all flags in this space were intented to be set only on the
> current which rules out any RMW races and therefore they can be
> lockless. I am not sure this holds for the bitfield you are adding this
> to. At least in_memstall seem to be set on external task as well. But
> this would require double checking. Maybe that is not really intended or
> just a bug.

I was going from the comment:

        /* Unserialized, strictly 'current' */
(which you can't see from the context of the diff, but is above the block)

The situation with ->flags is a little more ambiguous:

/*
 * Only the _current_ task can read/write to tsk->flags, but other
 * tasks can access tsk->flags in readonly mode for example
 * with tsk_used_math (like during threaded core dumping).
 * There is however an exception to this rule during ptrace
 * or during fork: the ptracer task is allowed to write to the
 * child->flags of its traced child (same goes for fork, the parent
 * can write to the child->flags), because we're guaranteed the
 * child is not running and in turn not changing child->flags
 * at the same time the parent does it.
 */

but it wasn't unsafe to use the PF_ flags in the way that you were.
It's just crowded.

If in_memstall is set on other tasks, then it should be moved to the
PFA flags, which there are plenty of.

But a quick grep shows it only being read on other tasks and always
set on current:

kernel/sched/psi.c:     *flags = current->in_memstall;
kernel/sched/psi.c:      * in_memstall setting & accounting needs to be atomic wrt
kernel/sched/psi.c:     current->in_memstall = 1;
kernel/sched/psi.c:      * in_memstall clearing & accounting needs to be atomic wrt
kernel/sched/psi.c:     current->in_memstall = 0;
kernel/sched/psi.c:     if (task->in_memstall)
kernel/sched/stats.h:           if (p->in_memstall)
kernel/sched/stats.h:           if (p->in_memstall)
kernel/sched/stats.h:   if (unlikely(p->in_iowait || p->in_memstall)) {
kernel/sched/stats.h:           if (p->in_memstall)
kernel/sched/stats.h:   if (unlikely(rq->curr->in_memstall))

so I think everything is fine.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
@ 2020-06-25 12:40   ` Michal Hocko
  2020-06-25 13:10     ` Matthew Wilcox
  2020-06-25 19:05   ` kernel test robot
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 12:40 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.

memalloc_nowait is a good idea! I suspect the primary usecase would be
vmalloc.

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

[...]
> @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			unsigned nowait_flag = memalloc_nowait_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_nowait_restore(nowait_flag);

This looks confusing though. I am not familiar with alloc_buffer and
there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
we have 
		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
  2020-06-25 12:34     ` Matthew Wilcox
@ 2020-06-25 12:42       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 12:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 13:34:18, Matthew Wilcox wrote:
> On Thu, Jun 25, 2020 at 02:22:39PM +0200, Michal Hocko wrote:
> > On Thu 25-06-20 12:31:17, Matthew Wilcox wrote:
> > > We're short on PF_* flags, so make memalloc_noio its own bit where we
> > > have plenty of space.
> > 
> > I do not mind moving that outside of the PF_* space. Unless I
> > misremember all flags in this space were intented to be set only on the
> > current which rules out any RMW races and therefore they can be
> > lockless. I am not sure this holds for the bitfield you are adding this
> > to. At least in_memstall seem to be set on external task as well. But
> > this would require double checking. Maybe that is not really intended or
> > just a bug.
> 
> I was going from the comment:
> 
>         /* Unserialized, strictly 'current' */
> (which you can't see from the context of the diff, but is above the block)
> 
> The situation with ->flags is a little more ambiguous:
> 
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
>  * with tsk_used_math (like during threaded core dumping).
>  * There is however an exception to this rule during ptrace
>  * or during fork: the ptracer task is allowed to write to the
>  * child->flags of its traced child (same goes for fork, the parent
>  * can write to the child->flags), because we're guaranteed the
>  * child is not running and in turn not changing child->flags
>  * at the same time the parent does it.
>  */

OK, I have obviously missed that.

> but it wasn't unsafe to use the PF_ flags in the way that you were.
> It's just crowded.
> 
> If in_memstall is set on other tasks, then it should be moved to the
> PFA flags, which there are plenty of.
> 
> But a quick grep shows it only being read on other tasks and always
> set on current:
> 
> kernel/sched/psi.c:     *flags = current->in_memstall;
> kernel/sched/psi.c:      * in_memstall setting & accounting needs to be atomic wrt
> kernel/sched/psi.c:     current->in_memstall = 1;
> kernel/sched/psi.c:      * in_memstall clearing & accounting needs to be atomic wrt
> kernel/sched/psi.c:     current->in_memstall = 0;
> kernel/sched/psi.c:     if (task->in_memstall)

Have a look at cgroup_move_task. So I believe this is something to be
fixed but independent on your change.

Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> kernel/sched/stats.h:           if (p->in_memstall)
> kernel/sched/stats.h:           if (p->in_memstall)
> kernel/sched/stats.h:   if (unlikely(p->in_iowait || p->in_memstall)) {
> kernel/sched/stats.h:           if (p->in_memstall)
> kernel/sched/stats.h:   if (unlikely(rq->curr->in_memstall))
> 
> so I think everything is fine.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 12:40   ` Michal Hocko
@ 2020-06-25 13:10     ` Matthew Wilcox
  2020-06-25 13:34       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-25 13:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu, Jun 25, 2020 at 02:40:17PM +0200, Michal Hocko wrote:
> On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > dm-bufio's allocations.
> 
> memalloc_nowait is a good idea! I suspect the primary usecase would be
> vmalloc.

That's funny.  My use case is allocating page tables in an RCU protected
page fault handler.  Jens' use case is allocating page cache.  This one
is a vmalloc consumer (which is also indirectly page table allocation).

> > @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >  	 */
> >  	while (1) {
> >  		if (dm_bufio_cache_size_latch != 1) {
> > -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			unsigned nowait_flag = memalloc_nowait_save();
> > +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			memalloc_nowait_restore(nowait_flag);
> 
> This looks confusing though. I am not familiar with alloc_buffer and
> there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
> which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
> we have 
> 		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);

Actually, I wanted to ask about the proliferation of __GFP_NOMEMALLOC
in the block layer.  Am I right in thinking it really has no effect
unless GFP_ATOMIC is set?  It seems like a magic flag that some driver
developers are sprinkling around randomly, so we probably need to clarify
the documentation on it.

What I was trying to do was just use the memalloc_nofoo API to control
what was going on and then the driver can just use GFP_KERNEL.  I should
probably have completed that thought before sending the patches out.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 13:10     ` Matthew Wilcox
@ 2020-06-25 13:34       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 13:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 14:10:55, Matthew Wilcox wrote:
> On Thu, Jun 25, 2020 at 02:40:17PM +0200, Michal Hocko wrote:
> > On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> > > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > > dm-bufio's allocations.
> > 
> > memalloc_nowait is a good idea! I suspect the primary usecase would be
> > vmalloc.
> 
> That's funny.  My use case is allocating page tables in an RCU protected
> page fault handler.  Jens' use case is allocating page cache.  This one
> is a vmalloc consumer (which is also indirectly page table allocation).
> 
> > > @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >  	 */
> > >  	while (1) {
> > >  		if (dm_bufio_cache_size_latch != 1) {
> > > -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			unsigned nowait_flag = memalloc_nowait_save();
> > > +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			memalloc_nowait_restore(nowait_flag);
> > 
> > This looks confusing though. I am not familiar with alloc_buffer and
> > there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
> > which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
> > we have 
> > 		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> 
> Actually, I wanted to ask about the proliferation of __GFP_NOMEMALLOC
> in the block layer.  Am I right in thinking it really has no effect
> unless GFP_ATOMIC is set?

It does have an effect as an __GFP_MEMALLOC resp PF_MEMALLOC inhibitor.

> It seems like a magic flag that some driver
> developers are sprinkling around randomly, so we probably need to clarify
> the documentation on it.

Would the following make more sense?
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..014aa7a6d36a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -116,8 +116,9 @@ struct vm_area_struct;
  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
  * before using this flag.
  *
- * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
- * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
+ * %__GFP_NOMEMALLOC is used to inhibit __GFP_MEMALLOC resp. process scope
+ * variant of it PF_MEMALLOC. So use this flag if the caller of the allocation
+ * context might contain one or the other.
  */
 #define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)

> What I was trying to do was just use the memalloc_nofoo API to control
> what was going on and then the driver can just use GFP_KERNEL.  I should
> probably have completed that thought before sending the patches out.

Yes the effect will be the same but it just really hit my eyes as this
was in the same diff. IMHO GFP_NOWAIT would be easier to grasp but
nothing I would dare to insist on.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
  2020-06-25 11:31 ` [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs Matthew Wilcox (Oracle)
@ 2020-06-25 13:35   ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 13:35 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu 25-06-20 12:31:20, Matthew Wilcox wrote:
> We're short on PF_* flags, so make memalloc_nofs its own bit where we
> have plenty of space.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

forgot to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/iomap/buffered-io.c   |  2 +-
>  include/linux/sched.h    |  2 +-
>  include/linux/sched/mm.h | 13 ++++++-------
>  3 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..87d66c13bf5c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1502,7 +1502,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  	 * Given that we do not allow direct reclaim to call us, we should
>  	 * never be called in a recursive filesystem reclaim context.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> +	if (WARN_ON_ONCE(current->memalloc_nofs))
>  		goto redirty;
>  
>  	/*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cf18a3d2bc4c..eaf36ae1fde2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -802,6 +802,7 @@ struct task_struct {
>  	unsigned			in_memstall:1;
>  #endif
>  	unsigned			memalloc_noio:1;
> +	unsigned			memalloc_nofs:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>  
> @@ -1505,7 +1506,6 @@ extern struct pid *cad_pid;
>  #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
>  #define PF_FROZEN		0x00010000	/* Frozen for system suspend */
>  #define PF_KSWAPD		0x00020000	/* I am kswapd */
> -#define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
>  #define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only against the bdi I write to,
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index b0089eadc367..08bc9d0606a8 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -175,20 +175,19 @@ static inline bool in_vfork(struct task_struct *tsk)
>  
>  /*
>   * Applies per-task gfp context to the given allocation flags.
> - * PF_MEMALLOC_NOFS implies GFP_NOFS
>   * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> -	if (unlikely(current->flags & (PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA) ||
> -		     current->memalloc_noio)) {
> +	if (unlikely((current->flags & PF_MEMALLOC_NOCMA) ||
> +		     current->memalloc_noio || current->memalloc_nofs)) {
>  		/*
>  		 * NOIO implies both NOIO and NOFS and it is a weaker context
>  		 * so always make sure it makes precedence
>  		 */
>  		if (current->memalloc_noio)
>  			flags &= ~(__GFP_IO | __GFP_FS);
> -		else if (current->flags & PF_MEMALLOC_NOFS)
> +		else if (current->memalloc_nofs)
>  			flags &= ~__GFP_FS;
>  #ifdef CONFIG_CMA
>  		if (current->flags & PF_MEMALLOC_NOCMA)
> @@ -254,8 +253,8 @@ static inline void memalloc_noio_restore(unsigned int flags)
>   */
>  static inline unsigned int memalloc_nofs_save(void)
>  {
> -	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> -	current->flags |= PF_MEMALLOC_NOFS;
> +	unsigned int flags = current->memalloc_nofs;
> +	current->memalloc_nofs = 1;
>  	return flags;
>  }
>  
> @@ -269,7 +268,7 @@ static inline unsigned int memalloc_nofs_save(void)
>   */
>  static inline void memalloc_nofs_restore(unsigned int flags)
>  {
> -	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> +	current->memalloc_nofs = flags ? 1 : 0;
>  }
>  
>  static inline unsigned int memalloc_noreclaim_save(void)
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
@ 2020-06-25 18:48 ` Darrick J. Wong
  2020-06-25 20:34   ` Matthew Wilcox
  2020-06-25 20:36   ` Michal Hocko
  2020-06-26 15:02 ` Mikulas Patocka
  7 siblings, 2 replies; 41+ messages in thread
From: Darrick J. Wong @ 2020-06-25 18:48 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown, Yafang Shao

On Thu, Jun 25, 2020 at 12:31:16PM +0100, Matthew Wilcox (Oracle) wrote:
> I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> for an upcoming patch series, and Jens also wants it for non-blocking
> io_uring.  It turns out we already have dm-bufio which could benefit
> from memalloc_nowait, so it may as well go into the tree now.
> 
> The biggest problem is that we're basically out of PF_ flags, so we need
> to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> out the PF_ flags are really supposed to be used for flags which are
> accessed from other tasks, and the MEMALLOC flags are only going to
> be used by this task.  So shuffling everything around frees up some PF
> flags and generally makes the world a better place.

So, uh, how does this intersect with the patch "xfs: reintroduce
PF_FSTRANS for transaction reservation recursion protection" that
re-adds PF_TRANS because uh I guess we lost some subtlety or another at
some point?

I don't have any strong opinion on this series one way or another, but
seeing as that patch was generating discussion about how PF_MEMALLOC_NO*
is not quite the same as PF_TRANS, I kinda want all these competing PF
tweaks and reworks to come together into a single series to review,
rather than the four(?) different people submitting conflicting changes.

[adding Yafang Shao to cc]

--D

> Patch series also available from
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
> 
> Matthew Wilcox (Oracle) (6):
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
>   mm: Add become_kswapd and restore_kswapd
>   xfs: Convert to memalloc_nofs_save
>   mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma
>   mm: Add memalloc_nowait
> 
>  drivers/block/loop.c           |  3 +-
>  drivers/md/dm-bufio.c          | 30 ++++--------
>  drivers/md/dm-zoned-metadata.c |  5 +-
>  fs/iomap/buffered-io.c         |  2 +-
>  fs/xfs/kmem.c                  |  2 +-
>  fs/xfs/libxfs/xfs_btree.c      | 14 +++---
>  fs/xfs/xfs_aops.c              |  4 +-
>  fs/xfs/xfs_buf.c               |  2 +-
>  fs/xfs/xfs_linux.h             |  6 ---
>  fs/xfs/xfs_trans.c             | 14 +++---
>  fs/xfs/xfs_trans.h             |  2 +-
>  include/linux/sched.h          |  7 +--
>  include/linux/sched/mm.h       | 84 ++++++++++++++++++++++++++--------
>  kernel/sys.c                   |  8 ++--
>  mm/vmscan.c                    | 16 +------
>  15 files changed, 105 insertions(+), 94 deletions(-)
> 
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
  2020-06-25 12:40   ` Michal Hocko
@ 2020-06-25 19:05   ` kernel test robot
  2020-06-25 23:51   ` kernel test robot
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2020-06-25 19:05 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-kernel, linux-mm
  Cc: kbuild-all, Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 4141 bytes --]

Hi "Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on dm/for-next linus/master v5.8-rc2]
[cannot apply to xfs-linux/for-next next-20200625]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Overhaul-memalloc_no/20200625-193357
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87e867b4269f29dac8190bca13912d08163a277f
config: nds32-randconfig-r011-20200624 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/md/dm-bufio.c: In function '__alloc_buffer_wait_no_callback':
>> drivers/md/dm-bufio.c:860:27: error: implicit declaration of function 'memalloc_nowait_save'; did you mean 'memalloc_noio_save'? [-Werror=implicit-function-declaration]
     860 |    unsigned nowait_flag = memalloc_nowait_save();
         |                           ^~~~~~~~~~~~~~~~~~~~
         |                           memalloc_noio_save
>> drivers/md/dm-bufio.c:862:4: error: implicit declaration of function 'memalloc_nowait_restore'; did you mean 'memalloc_noio_restore'? [-Werror=implicit-function-declaration]
     862 |    memalloc_nowait_restore(nowait_flag);
         |    ^~~~~~~~~~~~~~~~~~~~~~~
         |    memalloc_noio_restore
   cc1: some warnings being treated as errors

vim +860 drivers/md/dm-bufio.c

   836	
   837	/*
   838	 * Allocate a new buffer. If the allocation is not possible, wait until
   839	 * some other thread frees a buffer.
   840	 *
   841	 * May drop the lock and regain it.
   842	 */
   843	static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
   844	{
   845		struct dm_buffer *b;
   846		bool tried_noio_alloc = false;
   847	
   848		/*
   849		 * dm-bufio is resistant to allocation failures (it just keeps
   850		 * one buffer reserved in cases all the allocations fail).
   851		 * So set flags to not try too hard:
   852		 *	__GFP_NOMEMALLOC: don't use emergency reserves
   853		 *	__GFP_NOWARN: don't print a warning in case of failure
   854		 *
   855		 * For debugging, if we set the cache size to 1, no new buffers will
   856		 * be allocated.
   857		 */
   858		while (1) {
   859			if (dm_bufio_cache_size_latch != 1) {
 > 860				unsigned nowait_flag = memalloc_nowait_save();
   861				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 > 862				memalloc_nowait_restore(nowait_flag);
   863				if (b)
   864					return b;
   865			}
   866	
   867			if (nf == NF_PREFETCH)
   868				return NULL;
   869	
   870			if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
   871				unsigned noio_flag;
   872	
   873				dm_bufio_unlock(c);
   874				noio_flag = memalloc_noio_save();
   875				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
   876				memalloc_noio_restore(noio_flag);
   877				dm_bufio_lock(c);
   878				if (b)
   879					return b;
   880				tried_noio_alloc = true;
   881			}
   882	
   883			if (!list_empty(&c->reserved_buffers)) {
   884				b = list_entry(c->reserved_buffers.next,
   885					       struct dm_buffer, lru_list);
   886				list_del(&b->lru_list);
   887				c->need_reserved_buffers++;
   888	
   889				return b;
   890			}
   891	
   892			b = __get_unclaimed_buffer(c);
   893			if (b)
   894				return b;
   895	
   896			__wait_for_free_buffer(c);
   897		}
   898	}
   899	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21838 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
@ 2020-06-25 20:34   ` Matthew Wilcox
  2020-06-25 20:36   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-25 20:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown, Yafang Shao

On Thu, Jun 25, 2020 at 11:48:32AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 25, 2020 at 12:31:16PM +0100, Matthew Wilcox (Oracle) wrote:
> > I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> > for an upcoming patch series, and Jens also wants it for non-blocking
> > io_uring.  It turns out we already have dm-bufio which could benefit
> > from memalloc_nowait, so it may as well go into the tree now.
> > 
> > The biggest problem is that we're basically out of PF_ flags, so we need
> > to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> > out the PF_ flags are really supposed to be used for flags which are
> > accessed from other tasks, and the MEMALLOC flags are only going to
> > be used by this task.  So shuffling everything around frees up some PF
> > flags and generally makes the world a better place.
> 
> So, uh, how does this intersect with the patch "xfs: reintroduce
> PF_FSTRANS for transaction reservation recursion protection" that
> re-adds PF_TRANS because uh I guess we lost some subtlety or another at
> some point?

I don't know.  I read that thread, but I couldn't follow the argument.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
  2020-06-25 20:34   ` Matthew Wilcox
@ 2020-06-25 20:36   ` Michal Hocko
  2020-06-25 20:40     ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-06-25 20:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown, Yafang Shao

On Thu 25-06-20 11:48:32, Darrick J. Wong wrote:
> On Thu, Jun 25, 2020 at 12:31:16PM +0100, Matthew Wilcox (Oracle) wrote:
> > I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> > for an upcoming patch series, and Jens also wants it for non-blocking
> > io_uring.  It turns out we already have dm-bufio which could benefit
> > from memalloc_nowait, so it may as well go into the tree now.
> > 
> > The biggest problem is that we're basically out of PF_ flags, so we need
> > to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> > out the PF_ flags are really supposed to be used for flags which are
> > accessed from other tasks, and the MEMALLOC flags are only going to
> > be used by this task.  So shuffling everything around frees up some PF
> > flags and generally makes the world a better place.
> 
> So, uh, how does this intersect with the patch "xfs: reintroduce
> PF_FSTRANS for transaction reservation recursion protection" that
> re-adds PF_TRANS because uh I guess we lost some subtlety or another at
> some point?

This is independent, really. It just relocates the NOFS flag. PF_TRANS
is reintroduced for a different reason. When I have replaced the
original PF_TRANS by PF_MEMALLOC_NOFS I didn't realized that xfs doesn't
need only the NOFS semantic but also the transaction tracking so this
cannot be a single bit only. So it has to be added back. But
PF_MEMALLOC_NOFS needs to stay for the scoped NOFS semantic.

Hope this clarifies it a bit.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-25 20:36   ` Michal Hocko
@ 2020-06-25 20:40     ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-25 20:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Darrick J. Wong, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown, Yafang Shao

On Thu, Jun 25, 2020 at 10:36:11PM +0200, Michal Hocko wrote:
> On Thu 25-06-20 11:48:32, Darrick J. Wong wrote:
> > On Thu, Jun 25, 2020 at 12:31:16PM +0100, Matthew Wilcox (Oracle) wrote:
> > > I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> > > for an upcoming patch series, and Jens also wants it for non-blocking
> > > io_uring.  It turns out we already have dm-bufio which could benefit
> > > from memalloc_nowait, so it may as well go into the tree now.
> > > 
> > > The biggest problem is that we're basically out of PF_ flags, so we need
> > > to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> > > out the PF_ flags are really supposed to be used for flags which are
> > > accessed from other tasks, and the MEMALLOC flags are only going to
> > > be used by this task.  So shuffling everything around frees up some PF
> > > flags and generally makes the world a better place.
> > 
> > So, uh, how does this intersect with the patch "xfs: reintroduce
> > PF_FSTRANS for transaction reservation recursion protection" that
> > re-adds PF_TRANS because uh I guess we lost some subtlety or another at
> > some point?
> 
> This is independent, really. It just relocates the NOFS flag. PF_TRANS
> is reintroduced for a different reason. When I have replaced the
> original PF_TRANS by PF_MEMALLOC_NOFS I didn't realized that xfs doesn't
> need only the NOFS semantic but also the transaction tracking so this
> cannot be a single bit only. So it has to be added back. But
> PF_MEMALLOC_NOFS needs to stay for the scoped NOFS semantic.

If XFS needs to track transactions, why doesn't it use
current->journal_info like btrfs/ceph/ext4/gfs2/nilfs2/reiserfs?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
  2020-06-25 12:40   ` Michal Hocko
  2020-06-25 19:05   ` kernel test robot
@ 2020-06-25 23:51   ` kernel test robot
  2020-06-29  5:08   ` Mike Rapoport
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2020-06-25 23:51 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-kernel, linux-mm
  Cc: kbuild-all, clang-built-linux, Matthew Wilcox (Oracle),
	linux-xfs, dm-devel, Mikulas Patocka, Jens Axboe, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 4569 bytes --]

Hi "Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on dm/for-next linus/master v5.8-rc2]
[cannot apply to xfs-linux/for-next next-20200625]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Overhaul-memalloc_no/20200625-193357
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87e867b4269f29dac8190bca13912d08163a277f
config: x86_64-allyesconfig (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 8911a35180c6777188fefe0954a2451a2b91deaf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/md/dm-bufio.c:860:27: error: implicit declaration of function 'memalloc_nowait_save' [-Werror,-Wimplicit-function-declaration]
                           unsigned nowait_flag = memalloc_nowait_save();
                                                  ^
   drivers/md/dm-bufio.c:860:27: note: did you mean 'memalloc_noio_save'?
   include/linux/sched/mm.h:227:28: note: 'memalloc_noio_save' declared here
   static inline unsigned int memalloc_noio_save(void)
                              ^
>> drivers/md/dm-bufio.c:862:4: error: implicit declaration of function 'memalloc_nowait_restore' [-Werror,-Wimplicit-function-declaration]
                           memalloc_nowait_restore(nowait_flag);
                           ^
   drivers/md/dm-bufio.c:862:4: note: did you mean 'memalloc_noio_restore'?
   include/linux/sched/mm.h:242:20: note: 'memalloc_noio_restore' declared here
   static inline void memalloc_noio_restore(unsigned int flags)
                      ^
   2 errors generated.

vim +/memalloc_nowait_save +860 drivers/md/dm-bufio.c

   836	
   837	/*
   838	 * Allocate a new buffer. If the allocation is not possible, wait until
   839	 * some other thread frees a buffer.
   840	 *
   841	 * May drop the lock and regain it.
   842	 */
   843	static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
   844	{
   845		struct dm_buffer *b;
   846		bool tried_noio_alloc = false;
   847	
   848		/*
   849		 * dm-bufio is resistant to allocation failures (it just keeps
   850		 * one buffer reserved in cases all the allocations fail).
   851		 * So set flags to not try too hard:
   852		 *	__GFP_NOMEMALLOC: don't use emergency reserves
   853		 *	__GFP_NOWARN: don't print a warning in case of failure
   854		 *
   855		 * For debugging, if we set the cache size to 1, no new buffers will
   856		 * be allocated.
   857		 */
   858		while (1) {
   859			if (dm_bufio_cache_size_latch != 1) {
 > 860				unsigned nowait_flag = memalloc_nowait_save();
   861				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 > 862				memalloc_nowait_restore(nowait_flag);
   863				if (b)
   864					return b;
   865			}
   866	
   867			if (nf == NF_PREFETCH)
   868				return NULL;
   869	
   870			if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
   871				unsigned noio_flag;
   872	
   873				dm_bufio_unlock(c);
   874				noio_flag = memalloc_noio_save();
   875				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
   876				memalloc_noio_restore(noio_flag);
   877				dm_bufio_lock(c);
   878				if (b)
   879					return b;
   880				tried_noio_alloc = true;
   881			}
   882	
   883			if (!list_empty(&c->reserved_buffers)) {
   884				b = list_entry(c->reserved_buffers.next,
   885					       struct dm_buffer, lru_list);
   886				list_del(&b->lru_list);
   887				c->need_reserved_buffers++;
   888	
   889				return b;
   890			}
   891	
   892			b = __get_unclaimed_buffer(c);
   893			if (b)
   894				return b;
   895	
   896			__wait_for_free_buffer(c);
   897		}
   898	}
   899	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 75285 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
@ 2020-06-26 15:02 ` Mikulas Patocka
  2020-06-26 23:08   ` Dave Chinner
  7 siblings, 1 reply; 41+ messages in thread
From: Mikulas Patocka @ 2020-06-26 15:02 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe, NeilBrown

Hi

I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
prevents both filesystem recursion and i/o recursion.

Note that any I/O can recurse into a filesystem via the loop device, thus 
it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
and PF_MEMALLOC_NOIO is not set.

Mikulas

On Thu, 25 Jun 2020, Matthew Wilcox (Oracle) wrote:

> I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> for an upcoming patch series, and Jens also wants it for non-blocking
> io_uring.  It turns out we already have dm-bufio which could benefit
> from memalloc_nowait, so it may as well go into the tree now.
> 
> The biggest problem is that we're basically out of PF_ flags, so we need
> to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> out the PF_ flags are really supposed to be used for flags which are
> accessed from other tasks, and the MEMALLOC flags are only going to
> be used by this task.  So shuffling everything around frees up some PF
> flags and generally makes the world a better place.
> 
> Patch series also available from
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
> 
> Matthew Wilcox (Oracle) (6):
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
>   mm: Add become_kswapd and restore_kswapd
>   xfs: Convert to memalloc_nofs_save
>   mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma
>   mm: Add memalloc_nowait
> 
>  drivers/block/loop.c           |  3 +-
>  drivers/md/dm-bufio.c          | 30 ++++--------
>  drivers/md/dm-zoned-metadata.c |  5 +-
>  fs/iomap/buffered-io.c         |  2 +-
>  fs/xfs/kmem.c                  |  2 +-
>  fs/xfs/libxfs/xfs_btree.c      | 14 +++---
>  fs/xfs/xfs_aops.c              |  4 +-
>  fs/xfs/xfs_buf.c               |  2 +-
>  fs/xfs/xfs_linux.h             |  6 ---
>  fs/xfs/xfs_trans.c             | 14 +++---
>  fs/xfs/xfs_trans.h             |  2 +-
>  include/linux/sched.h          |  7 +--
>  include/linux/sched/mm.h       | 84 ++++++++++++++++++++++++++--------
>  kernel/sys.c                   |  8 ++--
>  mm/vmscan.c                    | 16 +------
>  15 files changed, 105 insertions(+), 94 deletions(-)
> 
> -- 
> 2.27.0
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-26 15:02 ` Mikulas Patocka
@ 2020-06-26 23:08   ` Dave Chinner
  2020-06-27 13:09     ` Mikulas Patocka
  2020-06-29  8:22     ` [PATCH 0/6] Overhaul memalloc_no* Michal Hocko
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Chinner @ 2020-06-26 23:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown

On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> Hi
> 
> I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> prevents both filesystem recursion and i/o recursion.
> 
> Note that any I/O can recurse into a filesystem via the loop device, thus 
> it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> and PF_MEMALLOC_NOIO is not set.

Correct me if I'm wrong, but I think that will prevent swapping from
GFP_NOFS memory reclaim contexts. IOWs, this will substantially
change the behaviour of the memory reclaim system under sustained
GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
quite common, so I really don't think we want to telling memory
reclaim "you can't do IO at all" when all we are trying to do is
prevent recursion back into the same filesystem.

Given that the loop device IO path already operates under
memalloc_noio context, (i.e. the recursion restriction is applied in
only the context that needs is) I see no reason for making that a
global reclaim limitation....

In reality, we need to be moving the other way with GFP_NOFS - to
fine grained anti-recursion contexts, not more broad contexts.

That is, GFP_NOFS prevents recursion into any filesystem, not just
the one that we are actively operating on and needing to prevent
recursion back into. We can safely have reclaim do relcaim work on
other filesysetms without fear of recursion deadlocks, but the
memory reclaim infrastructure does not provide that capability.(*)

e.g. if memalloc_nofs_save() took a reclaim context structure that
the filesystem put the superblock, the superblock's nesting depth
(because layering on loop devices can create cross-filesystem
recursion dependencies), and any other filesyetm private data the
fs wanted to add, we could actually have reclaim only avoid reclaim
from filesytsems where there is a deadlock possiblity. e.g:

	- superblock nesting depth is different, apply GFP_NOFS
	  reclaim unconditionally
	- superblock different apply GFP_KERNEL reclaim
	- superblock the same, pass context to filesystem to
	  decide if reclaim from the sueprblock is safe.

At this point, we get memory reclaim able to always be able to
reclaim from filesystems that are not at risk of recursion
deadlocks. Direct reclaim is much more likely to be able to make
progress now because it is much less restricted in what it can
reclaim. That's going to make direct relcaim faster and more
efficient, and taht's the ultimate goal we are aiming to acheive
here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-26 23:08   ` Dave Chinner
@ 2020-06-27 13:09     ` Mikulas Patocka
  2020-06-29  0:35       ` Dave Chinner
  2020-06-29  8:22     ` [PATCH 0/6] Overhaul memalloc_no* Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: Mikulas Patocka @ 2020-06-27 13:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown



On Sat, 27 Jun 2020, Dave Chinner wrote:

> On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > prevents both filesystem recursion and i/o recursion.
> > 
> > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > and PF_MEMALLOC_NOIO is not set.
> 
> Correct me if I'm wrong, but I think that will prevent swapping from
> GFP_NOFS memory reclaim contexts.

Yes.

> IOWs, this will substantially
> change the behaviour of the memory reclaim system under sustained
> GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> quite common, so I really don't think we want to telling memory
> reclaim "you can't do IO at all" when all we are trying to do is
> prevent recursion back into the same filesystem.

So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.

> Given that the loop device IO path already operates under
> memalloc_noio context, (i.e. the recursion restriction is applied in
> only the context that needs is) I see no reason for making that a
> global reclaim limitation....

I think this is a problem.

Suppose that a filesystem does GFP_NOFS allocation, the allocation 
triggers an IO and waits for it to finish, the loop device driver 
redirects the IO to the same filesystem that did the GFP_NOFS allocation.

I saw this deadlock in the past in the dm-bufio subsystem - see the commit 
9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it.

Other subsystems that do IO in GFP_NOFS context may deadlock just like 
bufio.

Mikulas



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-27 13:09     ` Mikulas Patocka
@ 2020-06-29  0:35       ` Dave Chinner
  2020-06-29 13:43         ` Mikulas Patocka
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2020-06-29  0:35 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown

On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote:
> 
> 
> On Sat, 27 Jun 2020, Dave Chinner wrote:
> 
> > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > > prevents both filesystem recursion and i/o recursion.
> > > 
> > > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > > and PF_MEMALLOC_NOIO is not set.
> > 
> > Correct me if I'm wrong, but I think that will prevent swapping from
> > GFP_NOFS memory reclaim contexts.
> 
> Yes.
> 
> > IOWs, this will substantially
> > change the behaviour of the memory reclaim system under sustained
> > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> > quite common, so I really don't think we want to telling memory
> > reclaim "you can't do IO at all" when all we are trying to do is
> > prevent recursion back into the same filesystem.
> 
> So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.

Uh, why?

Exactly what problem are you trying to solve here?

> > Given that the loop device IO path already operates under
> > memalloc_noio context, (i.e. the recursion restriction is applied in
> > only the context that needs is) I see no reason for making that a
> > global reclaim limitation....
> 
> I think this is a problem.
> 
> Suppose that a filesystem does GFP_NOFS allocation, the allocation 
> triggers an IO and waits for it to finish, the loop device driver 
> redirects the IO to the same filesystem that did the GFP_NOFS allocation.

The loop device IO path is under memalloc_noio. By -definition-,
allocations in that context cannot recurse back into filesystem
level reclaim.

So either your aren't explaining the problem you are trying to solve
clearly, or you're talking about allocations in the IO path that are
broken because they don't use GFP_NOIO correctly...

> I saw this deadlock in the past in the dm-bufio subsystem - see the commit 
> 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it.

2014?

/me looks closer.

Hmmm. Only sent to dm-devel, no comments, no review, just merged.
No surprise that nobody else actually knows about this commit. Well,
time to review it ~6 years after it was merged....

| dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
| device that makes calls into the filesystem. If __GFP_IO is present and
| __GFP_FS isn't, dm-bufio could still block on filesystem operations if it
| runs on a loop block device.

OK, so from an architectural POV, this commit is fundamentally
broken - block/device layer allocation should not allow relcaim
recursion into filesystems because filesystems are dependent on
the block layer making forwards progress. This commit is trying to
work around the loop device doing GFP_KERNEL/GFP_NOFS context
allocation back end IO path of the loop device. This part of the
loop device is a block device, so needs to run under GFP_NOIO
context.

IOWs, this commit just papered over the reclaim context layering
violation in the loop device by trying to avoid blocking filesystem
IO in the dm-bufio shrinker context just in case it was IO from a
loop device that was incorrectly tagged as GFP_KERNEL.

So, step forward 5 years to 2019, and this change was made:

commit d0a255e795ab976481565f6ac178314b34fbf891
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Thu Aug 8 11:17:01 2019 -0400

    loop: set PF_MEMALLOC_NOIO for the worker thread
    
    A deadlock with this stacktrace was observed.
    
    The loop thread does a GFP_KERNEL allocation, it calls into dm-bufio
    shrinker and the shrinker depends on I/O completion in the dm-bufio
    subsystem.
    
    In order to fix the deadlock (and other similar ones), we set the flag
    PF_MEMALLOC_NOIO at loop thread entry.
    
    PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
       #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
       #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
       #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
       #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
       #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
       #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
       #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
       #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
       #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
       #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
      #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
      #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
      #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
      #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
      #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
    
      PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
       #0 [ffff88272f5af228] __schedule at ffffffff8173f405
       #1 [ffff88272f5af280] schedule at ffffffff8173fa27
       #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
       #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
       #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
       #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
       #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
       #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
       #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
       #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
      #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
      #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
      #12 [ffff88272f5af760] new_slab at ffffffff811f4523
      #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
      #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
      #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
      #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
      #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
      #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
      #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
      #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
      #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
      #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
      #23 [ffff88272f5afec0] kthread at ffffffff810a8428
      #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242
    
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

That's the *same bug* as the 2014 commit was trying to address.  But
because the 2014 commit didn't actually address the underlying
architectural layering issue in the loop device, the problem was
still there.

The 2019 commit corrected the allocation context of the loop device
to unconditionally use GFP_NOIO, and so prevents recursion back into
both the filesystem and the block/device layers from the loop device
IO path. Hence reclaim contexts are layered correctly again, and the
deadlock in the dm-bufio code goes away.

And that means you probably should revert the 2014 commit because
it's a nasty layering violation that never should have been made....

> Other subsystems that do IO in GFP_NOFS context may deadlock just like 
> bufio.

Then they are as buggy as the dm-bufio code. Shrinkers needing to be
aware of reclaim contexts above the layer the shrinker belongs to is
a Big Red Flag that indicates something is violating reclaim
recursion rules. i.e. that an allocation is simply not using
GFP_NOFS/GFP_NOIO correctly. That's not a bug in the shrinker,
that's a bug in the code that is doing the memory allocation.

I still don't see a need for more reclaim recursion layers here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
                     ` (2 preceding siblings ...)
  2020-06-25 23:51   ` kernel test robot
@ 2020-06-29  5:08   ` Mike Rapoport
  2020-06-29 12:18     ` Matthew Wilcox
  2020-09-24  0:39   ` Mike Snitzer
  2020-10-23 14:49   ` Daniel Vetter
  5 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2020-06-29  5:08 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu, Jun 25, 2020 at 12:31:22PM +0100, Matthew Wilcox (Oracle) wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/md/dm-bufio.c    | 30 ++++++++----------------------
>  include/linux/sched.h    |  1 +
>  include/linux/sched/mm.h | 12 ++++++++----
>  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6d1565021d74..140ada9a2c8f 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -412,23 +412,6 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
>  
>  	*data_mode = DATA_MODE_VMALLOC;
>  
> -	/*
> -	 * __vmalloc allocates the data pages and auxiliary structures with
> -	 * gfp_flags that were specified, but pagetables are always allocated
> -	 * with GFP_KERNEL, no matter what was specified as gfp_mask.
> -	 *
> -	 * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that
> -	 * all allocations done by this process (including pagetables) are done
> -	 * as if GFP_NOIO was specified.
> -	 */
> -	if (gfp_mask & __GFP_NORETRY) {
> -		unsigned noio_flag = memalloc_noio_save();
> -		void *ptr = __vmalloc(c->block_size, gfp_mask);
> -
> -		memalloc_noio_restore(noio_flag);
> -		return ptr;
> -	}
> -
>  	return __vmalloc(c->block_size, gfp_mask);
>  }
>  
> @@ -866,9 +849,6 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> -	 *		    mutex and wait ourselves.
> -	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
>  	 *
> @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			unsigned nowait_flag = memalloc_nowait_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_nowait_restore(nowait_flag);
>  			if (b)
>  				return b;
>  		}
> @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  			return NULL;
>  
>  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +			unsigned noio_flag;
> +
>  			dm_bufio_unlock(c);
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			noio_flag = memalloc_noio_save();

I've read the series twice and I'm still missing the definition of
memalloc_noio_save().

And also it would be nice to have a paragraph about it in
Documentation/core-api/memory-allocation.rst

> +			b = alloc_buffer(c, GFP_KERNEL |
> __GFP_NOMEMALLOC | __GFP_NOWARN); +
> memalloc_noio_restore(noio_flag); dm_bufio_lock(c); if (b)
>  				return b;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 90336850e940..b1c2cddd366c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -803,6 +803,7 @@ struct task_struct {
>  #endif
>  	unsigned			memalloc_noio:1;
>  	unsigned			memalloc_nofs:1;
> +	unsigned			memalloc_nowait:1;
>  	unsigned			memalloc_nocma:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6f7b59a848a6..6484569f50df 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -179,12 +179,16 @@ static inline bool in_vfork(struct task_struct *tsk)
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
>  	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
> -		     current->memalloc_nocma)) {
> +		     current->memalloc_nocma) || current->memalloc_nowait) {
>  		/*
> -		 * NOIO implies both NOIO and NOFS and it is a weaker context
> -		 * so always make sure it makes precedence
> +		 * Clearing DIRECT_RECLAIM means we won't get to the point
> +		 * of testing IO or FS, so we don't need to bother clearing
> +		 * them.  noio implies neither IO nor FS and it is a weaker
> +		 * context so always make sure it takes precedence.
>  		 */
> -		if (current->memalloc_noio)
> +		if (current->memalloc_nowait)
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> +		else if (current->memalloc_noio)
>  			flags &= ~(__GFP_IO | __GFP_FS);
>  		else if (current->memalloc_nofs)
>  			flags &= ~__GFP_FS;
> -- 
> 2.27.0
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-26 23:08   ` Dave Chinner
  2020-06-27 13:09     ` Mikulas Patocka
@ 2020-06-29  8:22     ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-06-29  8:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mikulas Patocka, Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown

On Sat 27-06-20 09:08:47, Dave Chinner wrote:
> On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > prevents both filesystem recursion and i/o recursion.
> > 
> > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > and PF_MEMALLOC_NOIO is not set.
> 
> Correct me if I'm wrong, but I think that will prevent swapping from
> GFP_NOFS memory reclaim contexts. IOWs, this will substantially
> change the behaviour of the memory reclaim system under sustained
> GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> quite common, so I really don't think we want to telling memory
> reclaim "you can't do IO at all" when all we are trying to do is
> prevent recursion back into the same filesystem.
> 
> Given that the loop device IO path already operates under
> memalloc_noio context, (i.e. the recursion restriction is applied in
> only the context that needs is) I see no reason for making that a
> global reclaim limitation....
> 
> In reality, we need to be moving the other way with GFP_NOFS - to
> fine grained anti-recursion contexts, not more broad contexts.

Absolutely agreed! It is not really hard to see system struggling due to
heavy FS metadata workload while there are objects which could be
reclaimed.

> That is, GFP_NOFS prevents recursion into any filesystem, not just
> the one that we are actively operating on and needing to prevent
> recursion back into. We can safely have reclaim do relcaim work on
> other filesysetms without fear of recursion deadlocks, but the
> memory reclaim infrastructure does not provide that capability.(*)
> 
> e.g. if memalloc_nofs_save() took a reclaim context structure that
> the filesystem put the superblock, the superblock's nesting depth
> (because layering on loop devices can create cross-filesystem
> recursion dependencies), and any other filesyetm private data the
> fs wanted to add, we could actually have reclaim only avoid reclaim
> from filesytsems where there is a deadlock possiblity. e.g:
> 
> 	- superblock nesting depth is different, apply GFP_NOFS
> 	  reclaim unconditionally
> 	- superblock different apply GFP_KERNEL reclaim
> 	- superblock the same, pass context to filesystem to
> 	  decide if reclaim from the sueprblock is safe.
> 
> At this point, we get memory reclaim able to always be able to
> reclaim from filesystems that are not at risk of recursion
> deadlocks. Direct reclaim is much more likely to be able to make
> progress now because it is much less restricted in what it can
> reclaim. That's going to make direct relcaim faster and more
> efficient, and taht's the ultimate goal we are aiming to acheive
> here...

Yes, we have discussed something like that few years back at LSFMM IIRC.
The scoped NOFS/NOIO api was just a first step to reduce explicit
NOFS/NOIO usage with a hope that we will get no-recursion entry points
much more well defined and get rid of many instances where "this is a fs
code so it has to use NOFS gfp mask".

Some of that has happened and that is really great. On the other hand
many people still like to use that api as a workaround for an immediate
problem because no-recursion scopes are much harder to recognize unless
you are supper familiar with the specific fs/IO layer implementation.
So this is definitely not a project for somebody to go over all code and
just do the clean up.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-29  5:08   ` Mike Rapoport
@ 2020-06-29 12:18     ` Matthew Wilcox
  2020-06-29 12:52       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-29 12:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> > @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >  			return NULL;
> >  
> >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> > +			unsigned noio_flag;
> > +
> >  			dm_bufio_unlock(c);
> > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			noio_flag = memalloc_noio_save();
> 
> I've read the series twice and I'm still missing the definition of
> memalloc_noio_save().
> 
> And also it would be nice to have a paragraph about it in
> Documentation/core-api/memory-allocation.rst

Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions: memalloc_noio_save memalloc_noio_restore
Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it is safe to call ``memalloc_noio_save`` or



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-29 12:18     ` Matthew Wilcox
@ 2020-06-29 12:52       ` Michal Hocko
  2020-06-29 13:45         ` Mike Rapoport
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-06-29 12:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Rapoport, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> > > @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >  			return NULL;
> > >  
> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> > > +			unsigned noio_flag;
> > > +
> > >  			dm_bufio_unlock(c);
> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			noio_flag = memalloc_noio_save();
> > 
> > I've read the series twice and I'm still missing the definition of
> > memalloc_noio_save().
> > 
> > And also it would be nice to have a paragraph about it in
> > Documentation/core-api/memory-allocation.rst
> 
> Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions: memalloc_noio_save memalloc_noio_restore
> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it is safe to call ``memalloc_noio_save`` or
 
The patch is adding memalloc_nowait* and I suspect Mike had that in
mind, which would be a fair request. Btw. we are missing memalloc_nocma*
documentation either - I was just reminded of its existence today...

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-29  0:35       ` Dave Chinner
@ 2020-06-29 13:43         ` Mikulas Patocka
  2020-06-29 22:34           ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Mikulas Patocka @ 2020-06-29 13:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown



On Mon, 29 Jun 2020, Dave Chinner wrote:

> On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Sat, 27 Jun 2020, Dave Chinner wrote:
> > 
> > > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > > > Hi
> > > > 
> > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > > > prevents both filesystem recursion and i/o recursion.
> > > > 
> > > > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > > > and PF_MEMALLOC_NOIO is not set.
> > > 
> > > Correct me if I'm wrong, but I think that will prevent swapping from
> > > GFP_NOFS memory reclaim contexts.
> > 
> > Yes.
> > 
> > > IOWs, this will substantially
> > > change the behaviour of the memory reclaim system under sustained
> > > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> > > quite common, so I really don't think we want to telling memory
> > > reclaim "you can't do IO at all" when all we are trying to do is
> > > prevent recursion back into the same filesystem.
> > 
> > So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.
> 
> Uh, why?
> 
> Exactly what problem are you trying to solve here?

This:

1. The filesystem does a GFP_NOFS allocation.
2. The allocation calls directly a dm-bufio shrinker.
3. The dm-bufio shrinker sees that there is __GFP_IO set, so it assumes 
   that it can do I/O. It selects some dirty buffers, writes them back and 
   waits for the I/O to finish.
4. The dirty buffers belong to a loop device.
5. The loop device thread calls the filesystem that did the GFP_NOFS 
   allocation in step 1 (and that is still waiting for the allocation to 
   succeed).

Note that setting PF_MEMALLOC_NOIO on the loop thread won't help with this 
deadlock.

Do you argue that this is a bug in dm-bufio? Or a bug in the kernel? Or 
that it can't happen?

> > I saw this deadlock in the past in the dm-bufio subsystem - see the commit 
> > 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it.
> 
> 2014?
> 
> /me looks closer.
> 
> Hmmm. Only sent to dm-devel, no comments, no review, just merged.
> No surprise that nobody else actually knows about this commit. Well,
> time to review it ~6 years after it was merged....
> 
> | dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
> | device that makes calls into the filesystem. If __GFP_IO is present and
> | __GFP_FS isn't, dm-bufio could still block on filesystem operations if it
> | runs on a loop block device.
> 
> OK, so from an architectural POV, this commit is fundamentally
> broken - block/device layer allocation should not allow relcaim
> recursion into filesystems because filesystems are dependent on
> the block layer making forwards progress. This commit is trying to
> work around the loop device doing GFP_KERNEL/GFP_NOFS context
> allocation back end IO path of the loop device. This part of the
> loop device is a block device, so needs to run under GFP_NOIO
> context.

I agree that it is broken, but it fixes the above deadlock.

Mikulas



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-29 12:52       ` Michal Hocko
@ 2020-06-29 13:45         ` Mike Rapoport
  2020-06-29 21:28           ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2020-06-29 13:45 UTC (permalink / raw)
  To: Michal Hocko, Matthew Wilcox
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown



On June 29, 2020 3:52:31 PM GMT+03:00, Michal Hocko <mhocko@kernel.org> wrote:
>On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
>> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
>> > > @@ -886,8 +868,12 @@ static struct dm_buffer
>*__alloc_buffer_wait_no_callback(struct dm_bufio_client
>> > >  			return NULL;
>> > >  
>> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
>> > > +			unsigned noio_flag;
>> > > +
>> > >  			dm_bufio_unlock(c);
>> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY |
>__GFP_NOMEMALLOC | __GFP_NOWARN);
>> > > +			noio_flag = memalloc_noio_save();
>> > 
>> > I've read the series twice and I'm still missing the definition of
>> > memalloc_noio_save().
>> > 
>> > And also it would be nice to have a paragraph about it in
>> > Documentation/core-api/memory-allocation.rst
>> 
>>
>Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``,
>``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
>> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions:
>memalloc_noio_save memalloc_noio_restore
>> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it
>is safe to call ``memalloc_noio_save`` or
> 
>The patch is adding memalloc_nowait* and I suspect Mike had that in
>mind, which would be a fair request. 

Right, sorry misprinted that.

> Btw. we are missing
>memalloc_nocma*
>documentation either - I was just reminded of its existence today..
-- 
Sincerely yours,
Mike


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-29 13:45         ` Mike Rapoport
@ 2020-06-29 21:28           ` Matthew Wilcox
  2020-06-30  6:34             ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2020-06-29 21:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Mon, Jun 29, 2020 at 04:45:14PM +0300, Mike Rapoport wrote:
> 
> 
> On June 29, 2020 3:52:31 PM GMT+03:00, Michal Hocko <mhocko@kernel.org> wrote:
> >On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
> >> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> >> > > @@ -886,8 +868,12 @@ static struct dm_buffer
> >*__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >> > >  			return NULL;
> >> > >  
> >> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> >> > > +			unsigned noio_flag;
> >> > > +
> >> > >  			dm_bufio_unlock(c);
> >> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY |
> >__GFP_NOMEMALLOC | __GFP_NOWARN);
> >> > > +			noio_flag = memalloc_noio_save();
> >> > 
> >> > I've read the series twice and I'm still missing the definition of
> >> > memalloc_noio_save().
> >> > 
> >> > And also it would be nice to have a paragraph about it in
> >> > Documentation/core-api/memory-allocation.rst
> >> 
> >>
> >Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``,
> >``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> >> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions:
> >memalloc_noio_save memalloc_noio_restore
> >> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it
> >is safe to call ``memalloc_noio_save`` or
> > 
> >The patch is adding memalloc_nowait* and I suspect Mike had that in
> >mind, which would be a fair request. 
> 
> Right, sorry misprinted that.
> 
> > Btw. we are missing
> >memalloc_nocma*
> >documentation either - I was just reminded of its existence today..

Heh.  Oops, not sure how those got left out.  I hadn't touched the
documentation either, so -1 points to me.

The documentation is hard to add a new case to, so I rewrote it.  What
do you think?  (Obviously I'll split this out differently for submission;
this is just what I have in my tree right now).

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
deleted file mode 100644
index e7c32a8de126..000000000000
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ /dev/null
@@ -1,68 +0,0 @@
-.. _gfp_mask_from_fs_io:
-
-=================================
-GFP masks used from FS/IO context
-=================================
-
-:Date: May, 2018
-:Author: Michal Hocko <mhocko@kernel.org>
-
-Introduction
-============
-
-Code paths in the filesystem and IO stacks must be careful when
-allocating memory to prevent recursion deadlocks caused by direct
-memory reclaim calling back into the FS or IO paths and blocking on
-already held resources (e.g. locks - most commonly those used for the
-transaction context).
-
-The traditional way to avoid this deadlock problem is to clear __GFP_FS
-respectively __GFP_IO (note the latter implies clearing the first as well) in
-the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
-used as shortcut. It turned out though that above approach has led to
-abuses when the restricted gfp mask is used "just in case" without a
-deeper consideration which leads to problems because an excessive use
-of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
-reclaim issues.
-
-New API
-========
-
-Since 4.12 we do have a generic scope API for both NOFS and NOIO context
-``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
-``memalloc_noio_restore`` which allow to mark a scope to be a critical
-section from a filesystem or I/O point of view. Any allocation from that
-scope will inherently drop __GFP_FS respectively __GFP_IO from the given
-mask so no memory allocation can recurse back in the FS/IO.
-
-.. kernel-doc:: include/linux/sched/mm.h
-   :functions: memalloc_nofs_save memalloc_nofs_restore
-.. kernel-doc:: include/linux/sched/mm.h
-   :functions: memalloc_noio_save memalloc_noio_restore
-
-FS/IO code then simply calls the appropriate save function before
-any critical section with respect to the reclaim is started - e.g.
-lock shared with the reclaim context or when a transaction context
-nesting would be possible via reclaim. The restore function should be
-called when the critical section ends. All that ideally along with an
-explanation what is the reclaim context for easier maintenance.
-
-Please note that the proper pairing of save/restore functions
-allows nesting so it is safe to call ``memalloc_noio_save`` or
-``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
-scope.
-
-What about __vmalloc(GFP_NOFS)
-==============================
-
-vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
-GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
-to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
-almost always a bug. The good news is that the NOFS/NOIO semantic can be
-achieved by the scope API.
-
-In the ideal world, upper layers should already mark dangerous contexts
-and so no special care is required and vmalloc should be called without
-any problems. Sometimes if the context is not really clear or there are
-layering violations then the recommended way around that is to wrap ``vmalloc``
-by the scope API with a comment explaining the problem.
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 15ab86112627..55f611e34a1d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -90,7 +90,6 @@ more memory-management documentation in :doc:`/vm/index`.
    genalloc
    pin_user_pages
    boot-time-mm
-   gfp_mask-from-fs-io
 
 Interfaces for kernel debugging
 ===============================
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 4aa82ddd01b8..c6287c25ff99 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -69,13 +69,12 @@ here we briefly outline their recommended usage:
     ``GFP_USER`` means that the allocated memory is not movable and it
     must be directly accessible by the kernel.
 
-You may notice that quite a few allocations in the existing code
-specify ``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to
-prevent recursion deadlocks caused by direct memory reclaim calling
-back into the FS or IO paths and blocking on already held
-resources. Since 4.12 the preferred way to address this issue is to
-use new scope APIs described in
-:ref:`Documentation/core-api/gfp_mask-from-fs-io.rst <gfp_mask_from_fs_io>`.
+You may notice that quite a few allocations in the existing code specify
+``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
+recursion deadlocks caused by direct memory reclaim calling back into
+the FS or IO paths and blocking on already held resources. Since 4.12
+the preferred way to address this issue is to use the new scope APIs
+described below.
 
 Other legacy GFP flags are ``GFP_DMA`` and ``GFP_DMA32``. They are
 used to ensure that the allocated memory is accessible by hardware
@@ -84,6 +83,37 @@ driver for a device with such restrictions, avoid using these flags.
 And even with hardware with restrictions it is preferable to use
 `dma_alloc*` APIs.
 
+Memory scoping API
+==================
+
+Traditionally, we have passed GFP flags to functions that we call,
+indicating what kind of actions may be taken to free up memory if none
+is currently available.  This has proved impractical in some places and
+so we are currently transitioning to the calls below which override the
+flags specified by any particular call to allocate memory.
+
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nofs_save memalloc_nofs_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_noio_save memalloc_noio_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nocma_save memalloc_nocma_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nowait_save memalloc_nowait_restore
+
+These functions should be called at the point where any memory allocation
+would start to cause problems.  That is, do not simply wrap individual
+memory allocation calls which currently use ``GFP_NOFS`` with a pair
+of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
+find the lock which is taken that would cause problems if memory reclaim
+reentered the filesystem, place a call to memalloc_nofs_save() before it
+is acquired and a call to memalloc_nofs_restore() after it is released.
+Ideally also add a comment explaining why this lock will be problematic.
+
+Please note that the proper pairing of save/restore functions
+allows nesting so it is safe to call memalloc_noio_save() and
+memalloc_noio_restore() within an existing NOIO or NOFS scope.
+
 Selecting memory allocator
 ==========================
 
@@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
 alignment is also guaranteed to be at least the respective size.
 
 For large allocations you can use vmalloc() and vzalloc(), or directly
-request pages from the page allocator. The memory allocated by `vmalloc`
-and related functions is not physically contiguous.
+request pages from the page allocator.  The memory allocated by `vmalloc`
+and related functions is not physically contiguous.  The `vmalloc`
+family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
+flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
+the allocator which are hard to remove.  However, the scope APIs described
+above can be used to limit the `vmalloc` functions.
 
 If you are not sure whether the allocation size is too large for
 `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
 try to allocate memory with `kmalloc` and if the allocation fails it
-will be retried with `vmalloc`. There are restrictions on which GFP
-flags can be used with `kvmalloc`; please see kvmalloc_node() reference
-documentation. Note that `kvmalloc` may return memory that is not
-physically contiguous.
+will be retried with `vmalloc`. That means the GFP flags supported by
+`kvmalloc` are the same as those supported by `vmalloc` and `kvmalloc`
+may return memory that is not physically contiguous.
 
 If you need to allocate many identical objects you can use the slab
 cache allocator. The cache should be set up with kmem_cache_create() or
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 6484569f50df..9fc091274d1d 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 		 * them.  noio implies neither IO nor FS and it is a weaker
 		 * context so always make sure it takes precedence.
 		 */
-		if (current->memalloc_nowait)
+		if (current->memalloc_nowait) {
 			flags &= ~__GFP_DIRECT_RECLAIM;
-		else if (current->memalloc_noio)
+			flags |= __GFP_NOWARN;
+		} else if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;
@@ -275,6 +276,36 @@ static inline void memalloc_nofs_restore(unsigned int flags)
 	current->memalloc_nofs = flags ? 1 : 0;
 }
 
+/**
+ * memalloc_nowait_save - Marks implicit GFP_NOWAIT allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOWAIT allocation scope.
+ * All further allocations will implicitly disallow all waiting in the
+ * page allocator.  Use memalloc_nowait_restore() to end the scope with
+ * flags returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
+static inline unsigned int memalloc_nowait_save(void)
+{
+	unsigned int flags = current->memalloc_nowait;
+	current->memalloc_nowait = 1;
+	return flags;
+}
+
+/**
+ * memalloc_nowait_restore - Ends the implicit GFP_NOWAIT scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOWAIT scope started by memalloc_nowait_save().
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_nowait_save call.
+ */
+static inline void memalloc_nowait_restore(unsigned int flags)
+{
+	current->memalloc_nowait = flags ? 1 : 0;
+}
+
 static inline unsigned int memalloc_noreclaim_save(void)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] Overhaul memalloc_no*
  2020-06-29 13:43         ` Mikulas Patocka
@ 2020-06-29 22:34           ` Dave Chinner
  2020-07-03 14:26             ` [PATCH] dm-bufio: do cleanup from a workqueue Mikulas Patocka
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2020-06-29 22:34 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown

On Mon, Jun 29, 2020 at 09:43:23AM -0400, Mikulas Patocka wrote:
> On Mon, 29 Jun 2020, Dave Chinner wrote:
> > On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote:
> > > On Sat, 27 Jun 2020, Dave Chinner wrote:
> > > > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > > > > Hi
> > > > > 
> > > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > > > > prevents both filesystem recursion and i/o recursion.
> > > > > 
> > > > > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > > > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > > > > and PF_MEMALLOC_NOIO is not set.
> > > > 
> > > > Correct me if I'm wrong, but I think that will prevent swapping from
> > > > GFP_NOFS memory reclaim contexts.
> > > 
> > > Yes.
> > > 
> > > > IOWs, this will substantially
> > > > change the behaviour of the memory reclaim system under sustained
> > > > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> > > > quite common, so I really don't think we want to telling memory
> > > > reclaim "you can't do IO at all" when all we are trying to do is
> > > > prevent recursion back into the same filesystem.
> > > 
> > > So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.
> > 
> > Uh, why?
> > 
> > Exactly what problem are you trying to solve here?
> 
> This:
> 
> 1. The filesystem does a GFP_NOFS allocation.
> 2. The allocation calls directly a dm-bufio shrinker.
> 3. The dm-bufio shrinker sees that there is __GFP_IO set, so it assumes 
>    that it can do I/O. It selects some dirty buffers, writes them back and 
>    waits for the I/O to finish.

And so you are doing IO in a GFP_NOFS context because someone thought
the block layer can't recurse back into filesystems? That's a broken
assumption and has been since the loop device was introduced over a
couple of decades ago. I mean, the dm-bufio IO submission path uses
GFP_NOIO for obvious reasons, but once it's in the next device down
it loses all control of the submission context.

This is what I mean about "looking at reclaim contexts above the
current layer is a Big Red Flag"? The fundamental assumption of
dm-bufio that it can issue IO in GFP_NOFS context and not have a
lower layer recurse back into a filesystem has always been
incorrect. Just because the loop device now does GFP_NOIO
allocation, that doesn't mean what dm-bufio is doing in this
shrinker is correct or valid.

Because, as you point out:

> 4. The dirty buffers belong to a loop device.
> 5. The loop device thread calls the filesystem that did the GFP_NOFS 
>    allocation in step 1 (and that is still waiting for the allocation to 
>    succeed).
> Note that setting PF_MEMALLOC_NOIO on the loop thread won't help with this 
> deadlock.

Right, re-entering the filesystem might block on a lock, IO, memory
allocation, journal space reservation, etc. Indeed, it might not
even be able to issue transactions because the allocating context is
using GFP_NOFS because it is already running a transaction.

> Do you argue that this is a bug in dm-bufio? Or a bug in the kernel? Or 
> that it can't happen?

That's a bug in dm-bufio - dm is a layered block device and so has
_always_ been able to have filesystems both above and below
it in the storage stack. i.e. the assumption that there is no
filesystem context under the DM layers has always been wrong.

i.e. the memory reclaim context specfically directed dm-bufio that
whatever the shrinker does, it must not recurse into the filesystem
layer. It is the responsibility of the shrinker to obey the
constraints it was given by memory reclaim, and the dm-bufio
shrinker's assumption that there cannot be a filesystem below the DM
device violates this directive because, quite clearly, there can be
filesystems underneath DM devices.

IOWs, assuming that you can issue and block on IO from a -layered
block device- in GFP_NOFS shrinker context is flawed.  i.e. Anything
that presents as a block device that is layered on top of another
block device can recurse into a filesystem as they can sit on top of
a loop device. This has always been the case, and that means the
assumptions the dm-bufio shrinker is making about what it can do in
GFP_NOFS shrinker context has always been incorrect.

Remember that I explained "you should not block kswapd" in this
shrinker a year ago?

| What follows from that, and is pertinent for in this situation, is
| that if you don't block kswapd, then other reclaim contexts are not
| going to get stuck waiting for it regardless of the reclaim context
| they use.

https://lore.kernel.org/linux-fsdevel/20190809215733.GZ7777@dread.disaster.area/

If you did that when I suggested it, this problem would be solved.
i.e. The only way to fix this problem once adn for all is to stop
using the shrinker as a mechanism to issue and wait on IO. If you
need background writeback of dirty buffers, do it from a
WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim
path and so can issue writeback and block safely from a GFP_KERNEL
context. Kick the workqueue from the shrinker context, but get rid
of the IO submission and waiting from the shrinker and all the
GFP_NOFS memory reclaim recursion problems go away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-29 21:28           ` Matthew Wilcox
@ 2020-06-30  6:34             ` Michal Hocko
  2020-07-01  4:12               ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-06-30  6:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Rapoport, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
[...]
> The documentation is hard to add a new case to, so I rewrote it.  What
> do you think?  (Obviously I'll split this out differently for submission;
> this is just what I have in my tree right now).

I am fine with your changes. Few notes below.

> -It turned out though that above approach has led to
> -abuses when the restricted gfp mask is used "just in case" without a
> -deeper consideration which leads to problems because an excessive use
> -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> -reclaim issues.

I believe this is an important part because it shows that new people
coming to the existing code shouldn't take it as correct and rather
question it. Also having a clear indication that overuse is causing real
problems that might be not immediately visible to subsystems outside of
MM.

> -FS/IO code then simply calls the appropriate save function before
> -any critical section with respect to the reclaim is started - e.g.
> -lock shared with the reclaim context or when a transaction context
> -nesting would be possible via reclaim.  

[...]

> +These functions should be called at the point where any memory allocation
> +would start to cause problems.  That is, do not simply wrap individual
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> +find the lock which is taken that would cause problems if memory reclaim
> +reentered the filesystem, place a call to memalloc_nofs_save() before it
> +is acquired and a call to memalloc_nofs_restore() after it is released.
> +Ideally also add a comment explaining why this lock will be problematic.

The above text has mentioned the transaction context nesting as well and
that was a hint by Dave IIRC. It is imho good to have an example of
other reentrant points than just locks. I believe another useful example
would be something like loop device which is mixing IO and FS layers but
I am not familiar with all the details to give you an useful text.

[...]
> @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
>  alignment is also guaranteed to be at least the respective size.
>  
>  For large allocations you can use vmalloc() and vzalloc(), or directly
> -request pages from the page allocator. The memory allocated by `vmalloc`
> -and related functions is not physically contiguous.
> +request pages from the page allocator.  The memory allocated by `vmalloc`
> +and related functions is not physically contiguous.  The `vmalloc`
> +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> +the allocator which are hard to remove.  However, the scope APIs described
> +above can be used to limit the `vmalloc` functions.

I would reiterate "Do not just wrap vmalloc by the scope api but rather
rely on the real scope for the NOFS/NOIO context". Maybe we want to
stress out that once a scope is defined it is sticky to _all_
allocations and all allocators within that scope. The text is already
saying that but maybe we want to make it explicit and make it stand out.

[...]
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6484569f50df..9fc091274d1d 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
>  		 * them.  noio implies neither IO nor FS and it is a weaker
>  		 * context so always make sure it takes precedence.
>  		 */
> -		if (current->memalloc_nowait)
> +		if (current->memalloc_nowait) {
>  			flags &= ~__GFP_DIRECT_RECLAIM;
> -		else if (current->memalloc_noio)
> +			flags |= __GFP_NOWARN;

I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
with the initial implementation. Maybe we will learn later that there is
just too much unhelpful noise in the kernel log and will reconsider but
I wouldn't just start with that. Also we might learn that there will be
other modifiers for atomic (or should I say non-sleeping) scopes to be
defined. E.g. access to memory reserves but let's just wait for real
usecases.


Thanks a lot Matthew!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-30  6:34             ` Michal Hocko
@ 2020-07-01  4:12               ` Matthew Wilcox
  2020-07-01  5:53                 ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2020-07-01  4:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> [...]
> > The documentation is hard to add a new case to, so I rewrote it.  What
> > do you think?  (Obviously I'll split this out differently for submission;
> > this is just what I have in my tree right now).
> 
> I am fine with your changes. Few notes below.

Thanks!

> > -It turned out though that above approach has led to
> > -abuses when the restricted gfp mask is used "just in case" without a
> > -deeper consideration which leads to problems because an excessive use
> > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > -reclaim issues.
> 
> I believe this is an important part because it shows that new people
> coming to the existing code shouldn't take it as correct and rather
> question it. Also having a clear indication that overuse is causing real
> problems that might be not immediately visible to subsystems outside of
> MM.

It seemed to say a lot of the same things as this paragraph:

+You may notice that quite a few allocations in the existing code specify
+``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
+recursion deadlocks caused by direct memory reclaim calling back into
+the FS or IO paths and blocking on already held resources. Since 4.12
+the preferred way to address this issue is to use the new scope APIs
+described below.

Since this is in core-api/ rather than vm/, I felt that discussion of
the problems that it causes to the mm was a bit too much detail for the
people who would be reading this document.  Maybe I could move that
information into a new Documentation/vm/reclaim.rst file?

Let's see if Our Grumpy Editor has time to give us his advice on this.

> > -FS/IO code then simply calls the appropriate save function before
> > -any critical section with respect to the reclaim is started - e.g.
> > -lock shared with the reclaim context or when a transaction context
> > -nesting would be possible via reclaim.  
> 
> [...]
> 
> > +These functions should be called at the point where any memory allocation
> > +would start to cause problems.  That is, do not simply wrap individual
> > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > +find the lock which is taken that would cause problems if memory reclaim
> > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > +Ideally also add a comment explaining why this lock will be problematic.
> 
> The above text has mentioned the transaction context nesting as well and
> that was a hint by Dave IIRC. It is imho good to have an example of
> other reentrant points than just locks. I believe another useful example
> would be something like loop device which is mixing IO and FS layers but
> I am not familiar with all the details to give you an useful text.

I'll let Mikulas & Dave finish fighting about that before I write any
text mentioning the loop driver.  How about this for mentioning the
filesystem transaction possibility?

@@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
 
 These functions should be called at the point where any memory allocation
 would start to cause problems.  That is, do not simply wrap individual
-memory allocation calls which currently use ``GFP_NOFS`` with a pair
-of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
-find the lock which is taken that would cause problems if memory reclaim
+memory allocation calls which currently use ``GFP_NOFS`` with a pair of
+calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
+the resource which is acquired that would cause problems if memory reclaim
 reentered the filesystem, place a call to memalloc_nofs_save() before it
 is acquired and a call to memalloc_nofs_restore() after it is released.
 Ideally also add a comment explaining why this lock will be problematic.
+A resource might be a lock which would need to be acquired by an attempt
+to reclaim memory, or it might be starting a transaction that should not
+nest over a memory reclaim transaction.  Deep knowledge of the filesystem
+or driver is often needed to place memory scoping calls correctly.
 
 Please note that the proper pairing of save/restore functions
 allows nesting so it is safe to call memalloc_noio_save() and

> > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> >  alignment is also guaranteed to be at least the respective size.
> >  
> >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > -request pages from the page allocator. The memory allocated by `vmalloc`
> > -and related functions is not physically contiguous.
> > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > +and related functions is not physically contiguous.  The `vmalloc`
> > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > +the allocator which are hard to remove.  However, the scope APIs described
> > +above can be used to limit the `vmalloc` functions.
> 
> I would reiterate "Do not just wrap vmalloc by the scope api but rather
> rely on the real scope for the NOFS/NOIO context". Maybe we want to
> stress out that once a scope is defined it is sticky to _all_
> allocations and all allocators within that scope. The text is already
> saying that but maybe we want to make it explicit and make it stand out.

yes.  I went with:

@@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
 family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
 flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
 the allocator which are hard to remove.  However, the scope APIs described
-above can be used to limit the `vmalloc` functions.
+above can be used to limit the `vmalloc` functions.  As described above,
+do not simply wrap individual calls in the scope APIs, but look for the
+underlying reason why the memory allocation may not call into filesystems
+or block devices.
 
 If you are not sure whether the allocation size is too large for
 `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will


> [...]
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 6484569f50df..9fc091274d1d 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> >  		 * them.  noio implies neither IO nor FS and it is a weaker
> >  		 * context so always make sure it takes precedence.
> >  		 */
> > -		if (current->memalloc_nowait)
> > +		if (current->memalloc_nowait) {
> >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > -		else if (current->memalloc_noio)
> > +			flags |= __GFP_NOWARN;
> 
> I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> with the initial implementation. Maybe we will learn later that there is
> just too much unhelpful noise in the kernel log and will reconsider but
> I wouldn't just start with that. Also we might learn that there will be
> other modifiers for atomic (or should I say non-sleeping) scopes to be
> defined. E.g. access to memory reserves but let's just wait for real
> usecases.

Fair enough.  I'll drop that part.  Thanks!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-07-01  4:12               ` Matthew Wilcox
@ 2020-07-01  5:53                 ` Michal Hocko
  2020-07-01  7:04                   ` Mike Rapoport
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-07-01  5:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Rapoport, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Wed 01-07-20 05:12:03, Matthew Wilcox wrote:
> On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> > On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> > [...]
> > > The documentation is hard to add a new case to, so I rewrote it.  What
> > > do you think?  (Obviously I'll split this out differently for submission;
> > > this is just what I have in my tree right now).
> > 
> > I am fine with your changes. Few notes below.
> 
> Thanks!
> 
> > > -It turned out though that above approach has led to
> > > -abuses when the restricted gfp mask is used "just in case" without a
> > > -deeper consideration which leads to problems because an excessive use
> > > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > > -reclaim issues.
> > 
> > I believe this is an important part because it shows that new people
> > coming to the existing code shouldn't take it as correct and rather
> > question it. Also having a clear indication that overuse is causing real
> > problems that might be not immediately visible to subsystems outside of
> > MM.
> 
> It seemed to say a lot of the same things as this paragraph:
> 
> +You may notice that quite a few allocations in the existing code specify
> +``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
> +recursion deadlocks caused by direct memory reclaim calling back into
> +the FS or IO paths and blocking on already held resources. Since 4.12
> +the preferred way to address this issue is to use the new scope APIs
> +described below.
> 
> Since this is in core-api/ rather than vm/, I felt that discussion of
> the problems that it causes to the mm was a bit too much detail for the
> people who would be reading this document.  Maybe I could move that
> information into a new Documentation/vm/reclaim.rst file?

Hmm, my experience is that at least some users of NOFS/NOIO use this
flag just to be sure they do not do something wrong without realizing
that this might have a very negative effect on the whole system
operation. That was the main motivation to have an explicit note there.
I am not sure having that in MM internal documentation will make it
stand out for a general reader.

But I will not insist of course.

> Let's see if Our Grumpy Editor has time to give us his advice on this.
> 
> > > -FS/IO code then simply calls the appropriate save function before
> > > -any critical section with respect to the reclaim is started - e.g.
> > > -lock shared with the reclaim context or when a transaction context
> > > -nesting would be possible via reclaim.  
> > 
> > [...]
> > 
> > > +These functions should be called at the point where any memory allocation
> > > +would start to cause problems.  That is, do not simply wrap individual
> > > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > > +find the lock which is taken that would cause problems if memory reclaim
> > > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > > +Ideally also add a comment explaining why this lock will be problematic.
> > 
> > The above text has mentioned the transaction context nesting as well and
> > that was a hint by Dave IIRC. It is imho good to have an example of
> > other reentrant points than just locks. I believe another useful example
> > would be something like loop device which is mixing IO and FS layers but
> > I am not familiar with all the details to give you an useful text.
> 
> I'll let Mikulas & Dave finish fighting about that before I write any
> text mentioning the loop driver.  How about this for mentioning the
> filesystem transaction possibility?
> 
> @@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
>  
>  These functions should be called at the point where any memory allocation
>  would start to cause problems.  That is, do not simply wrap individual
> -memory allocation calls which currently use ``GFP_NOFS`` with a pair
> -of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> -find the lock which is taken that would cause problems if memory reclaim
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair of
> +calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
> +the resource which is acquired that would cause problems if memory reclaim
>  reentered the filesystem, place a call to memalloc_nofs_save() before it
>  is acquired and a call to memalloc_nofs_restore() after it is released.
>  Ideally also add a comment explaining why this lock will be problematic.
> +A resource might be a lock which would need to be acquired by an attempt
> +to reclaim memory, or it might be starting a transaction that should not
> +nest over a memory reclaim transaction.  Deep knowledge of the filesystem
> +or driver is often needed to place memory scoping calls correctly.

Ack

>  Please note that the proper pairing of save/restore functions
>  allows nesting so it is safe to call memalloc_noio_save() and
> 
> > > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> > >  alignment is also guaranteed to be at least the respective size.
> > >  
> > >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > > -request pages from the page allocator. The memory allocated by `vmalloc`
> > > -and related functions is not physically contiguous.
> > > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > > +and related functions is not physically contiguous.  The `vmalloc`
> > > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > > +the allocator which are hard to remove.  However, the scope APIs described
> > > +above can be used to limit the `vmalloc` functions.
> > 
> > I would reiterate "Do not just wrap vmalloc by the scope api but rather
> > rely on the real scope for the NOFS/NOIO context". Maybe we want to
> > stress out that once a scope is defined it is sticky to _all_
> > allocations and all allocators within that scope. The text is already
> > saying that but maybe we want to make it explicit and make it stand out.
> 
> yes.  I went with:
> 
> @@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
>  family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
>  flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
>  the allocator which are hard to remove.  However, the scope APIs described
> -above can be used to limit the `vmalloc` functions.
> +above can be used to limit the `vmalloc` functions.  As described above,
> +do not simply wrap individual calls in the scope APIs, but look for the
> +underlying reason why the memory allocation may not call into filesystems
> +or block devices.

ack

>  
>  If you are not sure whether the allocation size is too large for
>  `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
> 
> 
> > [...]
> > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > index 6484569f50df..9fc091274d1d 100644
> > > --- a/include/linux/sched/mm.h
> > > +++ b/include/linux/sched/mm.h
> > > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> > >  		 * them.  noio implies neither IO nor FS and it is a weaker
> > >  		 * context so always make sure it takes precedence.
> > >  		 */
> > > -		if (current->memalloc_nowait)
> > > +		if (current->memalloc_nowait) {
> > >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > > -		else if (current->memalloc_noio)
> > > +			flags |= __GFP_NOWARN;
> > 
> > I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> > with the initial implementation. Maybe we will learn later that there is
> > just too much unhelpful noise in the kernel log and will reconsider but
> > I wouldn't just start with that. Also we might learn that there will be
> > other modifiers for atomic (or should I say non-sleeping) scopes to be
> > defined. E.g. access to memory reserves but let's just wait for real
> > usecases.
> 
> Fair enough.  I'll drop that part.  Thanks!

thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-07-01  5:53                 ` Michal Hocko
@ 2020-07-01  7:04                   ` Mike Rapoport
  0 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2020-07-01  7:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-xfs, dm-devel,
	Mikulas Patocka, Jens Axboe, NeilBrown

On Wed, Jul 01, 2020 at 07:53:46AM +0200, Michal Hocko wrote:
> On Wed 01-07-20 05:12:03, Matthew Wilcox wrote:
> > On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> > > On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> > > [...]
> > > > The documentation is hard to add a new case to, so I rewrote it.  What
> > > > do you think?  (Obviously I'll split this out differently for submission;
> > > > this is just what I have in my tree right now).
> > > 
> > > I am fine with your changes. Few notes below.
> > 
> > Thanks!
> > 
> > > > -It turned out though that above approach has led to
> > > > -abuses when the restricted gfp mask is used "just in case" without a
> > > > -deeper consideration which leads to problems because an excessive use
> > > > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > > > -reclaim issues.
> > > 
> > > I believe this is an important part because it shows that new people
> > > coming to the existing code shouldn't take it as correct and rather
> > > question it. Also having a clear indication that overuse is causing real
> > > problems that might be not immediately visible to subsystems outside of
> > > MM.
> > 
> > It seemed to say a lot of the same things as this paragraph:
> > 
> > +You may notice that quite a few allocations in the existing code specify
> > +``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
> > +recursion deadlocks caused by direct memory reclaim calling back into
> > +the FS or IO paths and blocking on already held resources. Since 4.12
> > +the preferred way to address this issue is to use the new scope APIs
> > +described below.
> > 
> > Since this is in core-api/ rather than vm/, I felt that discussion of
> > the problems that it causes to the mm was a bit too much detail for the
> > people who would be reading this document.  Maybe I could move that
> > information into a new Documentation/vm/reclaim.rst file?

It would be nice to have Documentation/vm/reclaim.rst regardless ;-)

> Hmm, my experience is that at least some users of NOFS/NOIO use this
> flag just to be sure they do not do something wrong without realizing
> that this might have a very negative effect on the whole system
> operation. That was the main motivation to have an explicit note there.
> I am not sure having that in MM internal documentation will make it
> stand out for a general reader.

I'd add an explict note in the "Memory Scoping API" section. Please see
below.

> But I will not insist of course.
> 
> > Let's see if Our Grumpy Editor has time to give us his advice on this.
> > 
> > > > -FS/IO code then simply calls the appropriate save function before
> > > > -any critical section with respect to the reclaim is started - e.g.
> > > > -lock shared with the reclaim context or when a transaction context
> > > > -nesting would be possible via reclaim.  
> > > 
> > > [...]
> > > 
> > > > +These functions should be called at the point where any memory allocation
> > > > +would start to cause problems.  That is, do not simply wrap individual
> > > > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > > > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > > > +find the lock which is taken that would cause problems if memory reclaim
> > > > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > > > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > > > +Ideally also add a comment explaining why this lock will be problematic.
> > > 
> > > The above text has mentioned the transaction context nesting as well and
> > > that was a hint by Dave IIRC. It is imho good to have an example of
> > > other reentrant points than just locks. I believe another useful example
> > > would be something like loop device which is mixing IO and FS
> > > layers but I am not familiar with all the details to give you an
> > > useful text.
> > 
> > I'll let Mikulas & Dave finish fighting about that before I write
> > any text mentioning the loop driver.  How about this for mentioning
> > the filesystem transaction possibility?
> > 
> > @@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
> >  
> >  These functions should be called at the point where any memory allocation
> >  would start to cause problems.  That is, do not simply wrap individual
> > -memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > -of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > -find the lock which is taken that would cause problems if memory reclaim
> > +memory allocation calls which currently use ``GFP_NOFS`` with a pair of
> > +calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
> > +the resource which is acquired that would cause problems if memory reclaim
> >  reentered the filesystem, place a call to memalloc_nofs_save() before it
> >  is acquired and a call to memalloc_nofs_restore() after it is released.
> >  Ideally also add a comment explaining why this lock will be problematic.
> > +A resource might be a lock which would need to be acquired by an attempt
> > +to reclaim memory, or it might be starting a transaction that should not
> > +nest over a memory reclaim transaction.  Deep knowledge of the filesystem
> > +or driver is often needed to place memory scoping calls correctly.

I'd s/often/always/ :)

> Ack

And

+ Using memory scoping APIs "just in case" may lead to problematic
reclaim behaviour and have a very negative effect on the whole system
operation.

> >  Please note that the proper pairing of save/restore functions
> >  allows nesting so it is safe to call memalloc_noio_save() and
> > 
> > > > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> > > >  alignment is also guaranteed to be at least the respective size.
> > > >  
> > > >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > > > -request pages from the page allocator. The memory allocated by `vmalloc`
> > > > -and related functions is not physically contiguous.
> > > > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > > > +and related functions is not physically contiguous.  The `vmalloc`
> > > > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > > > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > > > +the allocator which are hard to remove.  However, the scope APIs described
> > > > +above can be used to limit the `vmalloc` functions.
> > > 
> > > I would reiterate "Do not just wrap vmalloc by the scope api but rather
> > > rely on the real scope for the NOFS/NOIO context". Maybe we want to
> > > stress out that once a scope is defined it is sticky to _all_
> > > allocations and all allocators within that scope. The text is already
> > > saying that but maybe we want to make it explicit and make it stand out.
> > 
> > yes.  I went with:
> > 
> > @@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
> >  family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> >  flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> >  the allocator which are hard to remove.  However, the scope APIs described
> > -above can be used to limit the `vmalloc` functions.
> > +above can be used to limit the `vmalloc` functions.  As described above,
> > +do not simply wrap individual calls in the scope APIs, but look for the
> > +underlying reason why the memory allocation may not call into filesystems
> > +or block devices.
> 
> ack
> 
> >  
> >  If you are not sure whether the allocation size is too large for
> >  `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
> > 
> > 
> > > [...]
> > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > > index 6484569f50df..9fc091274d1d 100644
> > > > --- a/include/linux/sched/mm.h
> > > > +++ b/include/linux/sched/mm.h
> > > > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> > > >  		 * them.  noio implies neither IO nor FS and it is a weaker
> > > >  		 * context so always make sure it takes precedence.
> > > >  		 */
> > > > -		if (current->memalloc_nowait)
> > > > +		if (current->memalloc_nowait) {
> > > >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > > > -		else if (current->memalloc_noio)
> > > > +			flags |= __GFP_NOWARN;
> > > 
> > > I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> > > with the initial implementation. Maybe we will learn later that there is
> > > just too much unhelpful noise in the kernel log and will reconsider but
> > > I wouldn't just start with that. Also we might learn that there will be
> > > other modifiers for atomic (or should I say non-sleeping) scopes to be
> > > defined. E.g. access to memory reserves but let's just wait for real
> > > usecases.
> > 
> > Fair enough.  I'll drop that part.  Thanks!
> 
> thanks!
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] dm-bufio: do cleanup from a workqueue
  2020-06-29 22:34           ` Dave Chinner
@ 2020-07-03 14:26             ` Mikulas Patocka
  0 siblings, 0 replies; 41+ messages in thread
From: Mikulas Patocka @ 2020-07-03 14:26 UTC (permalink / raw)
  To: Dave Chinner, Mike Snitzer
  Cc: Matthew Wilcox (Oracle),
	linux-kernel, linux-mm, linux-xfs, dm-devel, Jens Axboe,
	NeilBrown



On Tue, 30 Jun 2020, Dave Chinner wrote:

> https://lore.kernel.org/linux-fsdevel/20190809215733.GZ7777@dread.disaster.area/
> 
> If you did that when I suggested it, this problem would be solved.
> i.e. The only way to fix this problem once adn for all is to stop
> using the shrinker as a mechanism to issue and wait on IO. If you
> need background writeback of dirty buffers, do it from a
> WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim
> path and so can issue writeback and block safely from a GFP_KERNEL
> context. Kick the workqueue from the shrinker context, but get rid
> of the IO submission and waiting from the shrinker and all the
> GFP_NOFS memory reclaim recursion problems go away.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Hi

This is a patch that moves buffer cleanup to a workqueue. Please review 
it.

Mikulas



From: Mikulas Patocka <mpatocka@redhat.com>

kswapd should not block because it degrades system performance.
So, move reclaim of buffers to a workqueue.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-bufio.c |   60 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 19 deletions(-)

Index: linux-2.6/drivers/md/dm-bufio.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-bufio.c	2020-07-03 14:07:43.000000000 +0200
+++ linux-2.6/drivers/md/dm-bufio.c	2020-07-03 15:35:23.000000000 +0200
@@ -108,7 +108,10 @@ struct dm_bufio_client {
 	int async_write_error;
 
 	struct list_head client_list;
+
 	struct shrinker shrinker;
+	struct work_struct shrink_work;
+	atomic_long_t need_shrink;
 };
 
 /*
@@ -1634,8 +1637,7 @@ static unsigned long get_retain_buffers(
 	return retain_bytes;
 }
 
-static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
-			    gfp_t gfp_mask)
+static void __scan(struct dm_bufio_client *c)
 {
 	int l;
 	struct dm_buffer *b, *tmp;
@@ -1646,42 +1648,58 @@ static unsigned long __scan(struct dm_bu
 
 	for (l = 0; l < LIST_SIZE; l++) {
 		list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
-			if (__try_evict_buffer(b, gfp_mask))
+			if (count - freed <= retain_target)
+				atomic_long_set(&c->need_shrink, 0);
+			if (!atomic_long_read(&c->need_shrink))
+				return;
+			if (__try_evict_buffer(b, GFP_KERNEL)) {
+				atomic_long_dec(&c->need_shrink);
 				freed++;
-			if (!--nr_to_scan || ((count - freed) <= retain_target))
-				return freed;
+			}
 			cond_resched();
 		}
 	}
-	return freed;
 }
 
-static unsigned long
-dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+static void shrink_work(struct work_struct *w)
+{
+	struct dm_bufio_client *c = container_of(w, struct dm_bufio_client, shrink_work);
+
+	dm_bufio_lock(c);
+	__scan(c);
+	dm_bufio_unlock(c);
+}
+
+static unsigned long dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct dm_bufio_client *c;
-	unsigned long freed;
 
 	c = container_of(shrink, struct dm_bufio_client, shrinker);
-	if (sc->gfp_mask & __GFP_FS)
-		dm_bufio_lock(c);
-	else if (!dm_bufio_trylock(c))
-		return SHRINK_STOP;
+	atomic_long_add(sc->nr_to_scan, &c->need_shrink);
+	queue_work(dm_bufio_wq, &c->shrink_work);
 
-	freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
-	dm_bufio_unlock(c);
-	return freed;
+	return sc->nr_to_scan;
 }
 
-static unsigned long
-dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
 	unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
 			      READ_ONCE(c->n_buffers[LIST_DIRTY]);
 	unsigned long retain_target = get_retain_buffers(c);
+	unsigned long queued_for_cleanup = atomic_long_read(&c->need_shrink);
+
+	if (unlikely(count < retain_target))
+		count = 0;
+	else
+		count -= retain_target;
 
-	return (count < retain_target) ? 0 : (count - retain_target);
+	if (unlikely(count < queued_for_cleanup))
+		count = 0;
+	else
+		count -= queued_for_cleanup;
+
+	return count;
 }
 
 /*
@@ -1772,6 +1790,9 @@ struct dm_bufio_client *dm_bufio_client_
 		__free_buffer_wake(b);
 	}
 
+	INIT_WORK(&c->shrink_work, shrink_work);
+	atomic_long_set(&c->need_shrink, 0);
+
 	c->shrinker.count_objects = dm_bufio_shrink_count;
 	c->shrinker.scan_objects = dm_bufio_shrink_scan;
 	c->shrinker.seeks = 1;
@@ -1817,6 +1838,7 @@ void dm_bufio_client_destroy(struct dm_b
 	drop_buffers(c);
 
 	unregister_shrinker(&c->shrinker);
+	flush_work(&c->shrink_work);
 
 	mutex_lock(&dm_bufio_clients_lock);
 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
                     ` (3 preceding siblings ...)
  2020-06-29  5:08   ` Mike Rapoport
@ 2020-09-24  0:39   ` Mike Snitzer
  2020-09-24  1:10     ` Matthew Wilcox
  2020-10-23 14:49   ` Daniel Vetter
  5 siblings, 1 reply; 41+ messages in thread
From: Mike Snitzer @ 2020-09-24  0:39 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, Jens Axboe, linux-xfs, dm-devel, Mikulas Patocka

On Thu, Jun 25 2020 at  7:31am -0400,
Matthew Wilcox (Oracle) <willy@infradead.org> wrote:

> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/md/dm-bufio.c    | 30 ++++++++----------------------
>  include/linux/sched.h    |  1 +
>  include/linux/sched/mm.h | 12 ++++++++----
>  3 files changed, 17 insertions(+), 26 deletions(-)


Hi,

Curious on the state of this patchset?  Not seeing it in next-20200923

The dm-bufio cleanup looks desirable.

Thanks,
Mike



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-09-24  0:39   ` Mike Snitzer
@ 2020-09-24  1:10     ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2020-09-24  1:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-kernel, linux-mm, Jens Axboe, linux-xfs, dm-devel, Mikulas Patocka

On Wed, Sep 23, 2020 at 08:39:02PM -0400, Mike Snitzer wrote:
> On Thu, Jun 25 2020 at  7:31am -0400,
> Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > dm-bufio's allocations.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  drivers/md/dm-bufio.c    | 30 ++++++++----------------------
> >  include/linux/sched.h    |  1 +
> >  include/linux/sched/mm.h | 12 ++++++++----
> >  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> 
> Hi,
> 
> Curious on the state of this patchset?  Not seeing it in next-20200923
> 
> The dm-bufio cleanup looks desirable.

I've been busy with THPs and haven't pushed this patchset for this window.
It's probably a bit late now we're at rc6, so I'll clean it up and push
it for 5.11?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] mm: Add memalloc_nowait
  2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
                     ` (4 preceding siblings ...)
  2020-09-24  0:39   ` Mike Snitzer
@ 2020-10-23 14:49   ` Daniel Vetter
  5 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2020-10-23 14:49 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-kernel, linux-mm, linux-xfs, dm-devel, Mikulas Patocka,
	Jens Axboe, NeilBrown

On Thu, Jun 25, 2020 at 12:31:22PM +0100, Matthew Wilcox (Oracle) wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/md/dm-bufio.c    | 30 ++++++++----------------------
>  include/linux/sched.h    |  1 +
>  include/linux/sched/mm.h | 12 ++++++++----
>  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6d1565021d74..140ada9a2c8f 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -412,23 +412,6 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
>  
>  	*data_mode = DATA_MODE_VMALLOC;
>  
> -	/*
> -	 * __vmalloc allocates the data pages and auxiliary structures with
> -	 * gfp_flags that were specified, but pagetables are always allocated
> -	 * with GFP_KERNEL, no matter what was specified as gfp_mask.
> -	 *
> -	 * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that
> -	 * all allocations done by this process (including pagetables) are done
> -	 * as if GFP_NOIO was specified.
> -	 */
> -	if (gfp_mask & __GFP_NORETRY) {
> -		unsigned noio_flag = memalloc_noio_save();
> -		void *ptr = __vmalloc(c->block_size, gfp_mask);
> -
> -		memalloc_noio_restore(noio_flag);
> -		return ptr;
> -	}
> -
>  	return __vmalloc(c->block_size, gfp_mask);
>  }
>  
> @@ -866,9 +849,6 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> -	 *		    mutex and wait ourselves.
> -	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
>  	 *
> @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			unsigned nowait_flag = memalloc_nowait_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_nowait_restore(nowait_flag);
>  			if (b)
>  				return b;
>  		}
> @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  			return NULL;
>  
>  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +			unsigned noio_flag;
> +
>  			dm_bufio_unlock(c);
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			noio_flag = memalloc_noio_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_noio_restore(noio_flag);
>  			dm_bufio_lock(c);
>  			if (b)
>  				return b;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 90336850e940..b1c2cddd366c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -803,6 +803,7 @@ struct task_struct {
>  #endif
>  	unsigned			memalloc_noio:1;
>  	unsigned			memalloc_nofs:1;
> +	unsigned			memalloc_nowait:1;

I think you also need to update gfpflags_allow_blocking() to take your new
flag into account, or some debug checks all over the place might misfire.

Plus I have a patch which rolls this out to a few more places in all the
allocators.
-Daniel

>  	unsigned			memalloc_nocma:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6f7b59a848a6..6484569f50df 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -179,12 +179,16 @@ static inline bool in_vfork(struct task_struct *tsk)
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
>  	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
> -		     current->memalloc_nocma)) {
> +		     current->memalloc_nocma) || current->memalloc_nowait) {
>  		/*
> -		 * NOIO implies both NOIO and NOFS and it is a weaker context
> -		 * so always make sure it makes precedence
> +		 * Clearing DIRECT_RECLAIM means we won't get to the point
> +		 * of testing IO or FS, so we don't need to bother clearing
> +		 * them.  noio implies neither IO nor FS and it is a weaker
> +		 * context so always make sure it takes precedence.
>  		 */
> -		if (current->memalloc_noio)
> +		if (current->memalloc_nowait)
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> +		else if (current->memalloc_noio)
>  			flags &= ~(__GFP_IO | __GFP_FS);
>  		else if (current->memalloc_nofs)
>  			flags &= ~__GFP_FS;
> -- 
> 2.27.0
> 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-10-23 14:49 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 11:31 [PATCH 0/6] Overhaul memalloc_no* Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio Matthew Wilcox (Oracle)
2020-06-25 12:22   ` Michal Hocko
2020-06-25 12:34     ` Matthew Wilcox
2020-06-25 12:42       ` Michal Hocko
2020-06-25 11:31 ` [PATCH 2/6] mm: Add become_kswapd and restore_kswapd Matthew Wilcox (Oracle)
2020-06-25 12:31   ` Michal Hocko
2020-06-25 11:31 ` [PATCH 3/6] xfs: Convert to memalloc_nofs_save Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs Matthew Wilcox (Oracle)
2020-06-25 13:35   ` Michal Hocko
2020-06-25 11:31 ` [PATCH 5/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma Matthew Wilcox (Oracle)
2020-06-25 11:31 ` [PATCH 6/6] mm: Add memalloc_nowait Matthew Wilcox (Oracle)
2020-06-25 12:40   ` Michal Hocko
2020-06-25 13:10     ` Matthew Wilcox
2020-06-25 13:34       ` Michal Hocko
2020-06-25 19:05   ` kernel test robot
2020-06-25 23:51   ` kernel test robot
2020-06-29  5:08   ` Mike Rapoport
2020-06-29 12:18     ` Matthew Wilcox
2020-06-29 12:52       ` Michal Hocko
2020-06-29 13:45         ` Mike Rapoport
2020-06-29 21:28           ` Matthew Wilcox
2020-06-30  6:34             ` Michal Hocko
2020-07-01  4:12               ` Matthew Wilcox
2020-07-01  5:53                 ` Michal Hocko
2020-07-01  7:04                   ` Mike Rapoport
2020-09-24  0:39   ` Mike Snitzer
2020-09-24  1:10     ` Matthew Wilcox
2020-10-23 14:49   ` Daniel Vetter
2020-06-25 18:48 ` [PATCH 0/6] Overhaul memalloc_no* Darrick J. Wong
2020-06-25 20:34   ` Matthew Wilcox
2020-06-25 20:36   ` Michal Hocko
2020-06-25 20:40     ` Matthew Wilcox
2020-06-26 15:02 ` Mikulas Patocka
2020-06-26 23:08   ` Dave Chinner
2020-06-27 13:09     ` Mikulas Patocka
2020-06-29  0:35       ` Dave Chinner
2020-06-29 13:43         ` Mikulas Patocka
2020-06-29 22:34           ` Dave Chinner
2020-07-03 14:26             ` [PATCH] dm-bufio: do cleanup from a workqueue Mikulas Patocka
2020-06-29  8:22     ` [PATCH 0/6] Overhaul memalloc_no* Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).