linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] extend vmalloc support for constrained allocations
@ 2021-11-22 15:32 Michal Hocko
  2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-22 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

Hi,
The previous version has been posted here [1] 

I hope I have addressed all the feedback. There were some suggestions
for further improvements but I would rather make this smaller as I
cannot really invest more time and I believe further changes can be done
on top.

This version is a rebase on top of the current Linus tree. Except for
the review feedback and conflicting changes in the area there is only
one change to filter out __GFP_NOFAIL from the bulk allocator. This is
not necessary strictly speaking AFAICS but I found it less confusing
because vmalloc has its fallback strategy and the bulk allocator is
meant only for the fast path.

Original cover:
Based on a recent discussion with Dave and Neil [2] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make
life of kvmalloc users easier.

A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.

NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API. Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope. Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).

First patch implements NOFS/NOIO support for vmalloc. The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.

I hope I haven't missed anything in the vmalloc allocator.

Thanks!

[1] http://lkml.kernel.org/r/20211025150223.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
  2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
@ 2021-11-22 15:32 ` Michal Hocko
  2021-11-23 19:05   ` Uladzislau Rezki
  2021-11-26 15:13   ` Vlastimil Babka
  2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-22 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask
and performed GFP_KERNEL like allocations.

Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope. There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g. when
a lock required during the memory reclaim is required etc.). It seems
that not all NOFS/NOIO users have adopted this approach and instead
they have taken a workaround approach to wrap a single [k]vmalloc
allocation by a scope API.

These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.

Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
allocations already comply with the given gfp_mask. The only current
exception is vmap_pages_range which maps kernel page tables. Infer the
proper scope API based on the given gfp mask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d2a00ad4e1dd..17ca7001de1f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2926,6 +2926,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	unsigned long array_size;
 	unsigned int nr_small_pages = size >> PAGE_SHIFT;
 	unsigned int page_order;
+	unsigned int flags;
+	int ret;
 
 	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
 	gfp_mask |= __GFP_NOWARN;
@@ -2967,8 +2969,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		goto fail;
 	}
 
-	if (vmap_pages_range(addr, addr + size, prot, area->pages,
-			page_shift) < 0) {
+	/*
+	 * page tables allocations ignore external gfp mask, enforce it
+	 * by the scope API
+	 */
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		flags = memalloc_nofs_save();
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		flags = memalloc_noio_save();
+
+	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+			page_shift);
+
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		memalloc_nofs_restore(flags);
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		memalloc_noio_restore(flags);
+
+	if (ret < 0) {
 		warn_alloc(orig_gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to map pages",
 			area->nr_pages * PAGE_SIZE);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
  2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
@ 2021-11-22 15:32 ` Michal Hocko
  2021-11-23 19:01   ` Uladzislau Rezki
                     ` (2 more replies)
  2021-11-22 15:32 ` [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-22 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The large part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g.
a change to the vmalloc space change if the failure was caused by
the space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the retry
simple for now and just sleep to prevent from hogging CPUs.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 17ca7001de1f..b6aed4f94a85 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	 * more permissive.
 	 */
 	if (!order) {
+		gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
+
 		while (nr_allocated < nr_pages) {
 			unsigned int nr, nr_pages_request;
 
@@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			 * but mempolcy want to alloc memory by interleaving.
 			 */
 			if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
-				nr = alloc_pages_bulk_array_mempolicy(gfp,
+				nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
 							nr_pages_request,
 							pages + nr_allocated);
 
 			else
-				nr = alloc_pages_bulk_array_node(gfp, nid,
+				nr = alloc_pages_bulk_array_node(bulk_gfp, nid,
 							nr_pages_request,
 							pages + nr_allocated);
 
@@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 {
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 	const gfp_t orig_gfp_mask = gfp_mask;
+	bool nofail = gfp_mask & __GFP_NOFAIL;
 	unsigned long addr = (unsigned long)area->addr;
 	unsigned long size = get_vm_area_size(area);
 	unsigned long array_size;
@@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		flags = memalloc_noio_save();
 
-	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+	do {
+		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift);
+		if (nofail && (ret < 0))
+			schedule_timeout_uninterruptible(1);
+	} while (nofail && (ret < 0));
 
 	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		memalloc_nofs_restore(flags);
@@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 				  VM_UNINITIALIZED | vm_flags, start, end, node,
 				  gfp_mask, caller);
 	if (!area) {
+		bool nofail = gfp_mask & __GFP_NOFAIL;
 		warn_alloc(gfp_mask, NULL,
-			"vmalloc error: size %lu, vm_struct allocation failed",
-			real_size);
+			"vmalloc error: size %lu, vm_struct allocation failed%s",
+			real_size, (nofail) ? ". Retrying." : "");
+		if (nofail) {
+			schedule_timeout_uninterruptible(1);
+			goto again;
+		}
 		goto fail;
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
  2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
  2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
@ 2021-11-22 15:32 ` Michal Hocko
  2021-11-23 18:58   ` Uladzislau Rezki
  2021-11-26 15:39   ` Vlastimil Babka
  2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
  2021-11-24 22:55 ` [PATCH v2 0/4] extend vmalloc support for constrained allocations Dave Chinner
  4 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-22 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp flags")
has been merged prematurely without the rest of the series and without
addressed review feedback from Neil. Fix that up now. Only wording is
changed slightly.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b6aed4f94a85..b1c115ec13be 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3021,12 +3021,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
  *
  * Allocate enough pages to cover @size from the page level
  * allocator with @gfp_mask flags. Please note that the full set of gfp
- * flags are not supported. GFP_KERNEL would be a preferred allocation mode
- * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
- * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
- * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
- * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
- * __GFP_NOWARN can be used to suppress error messages about failures.
+ * flags are not supported. GFP_KERNEL, GFP_NOFS and GFP_NOIO are all
+ * supported.
+ * Zone modifiers are not supported. From the reclaim modifiers
+ * __GFP_DIRECT_RECLAIM is required (aka GFP_NOWAIT is not supported)
+ * and only __GFP_NOFAIL is supported (i.e. __GFP_NORETRY and
+ * __GFP_RETRY_MAYFAIL are not supported).
+ *
+ * __GFP_NOWARN can be used to suppress failures messages.
  *
  * Map them into contiguous kernel virtual space, using a pagetable
  * protection of @prot.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
                   ` (2 preceding siblings ...)
  2021-11-22 15:32 ` [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
@ 2021-11-22 15:32 ` Michal Hocko
  2021-11-23 18:57   ` Uladzislau Rezki
                     ` (2 more replies)
  2021-11-24 22:55 ` [PATCH v2 0/4] extend vmalloc support for constrained allocations Dave Chinner
  4 siblings, 3 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-22 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
by previous patches so we can allow the support for kvmalloc. This
will allow some external users to simplify or completely remove
their helpers.

GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.

ceph_kvmalloc is the first helper to be dropped and changed to
kvmalloc.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/ceph/libceph.h |  1 -
 mm/util.c                    | 15 ++++-----------
 net/ceph/buffer.c            |  4 ++--
 net/ceph/ceph_common.c       | 27 ---------------------------
 net/ceph/crypto.c            |  2 +-
 net/ceph/messenger.c         |  2 +-
 net/ceph/messenger_v2.c      |  2 +-
 net/ceph/osdmap.c            | 12 ++++++------
 8 files changed, 15 insertions(+), 50 deletions(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 409d8c29bc4f..309acbcb5a8a 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
 
 extern const char *ceph_msg_type_name(int type);
 extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
-extern void *ceph_kvmalloc(size_t size, gfp_t flags);
 
 struct fs_parameter;
 struct fc_log;
diff --git a/mm/util.c b/mm/util.c
index e58151a61255..7275f2829e3f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY modifier.
  * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
  * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
- * Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
- * fall back to vmalloc.
- *
  * Return: pointer to the allocated memory of %NULL in case of failure
  */
 void *kvmalloc_node(size_t size, gfp_t flags, int node)
@@ -563,13 +560,6 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
-	/*
-	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
-	 * so the given set of flags has to be compatible.
-	 */
-	if ((flags & GFP_KERNEL) != GFP_KERNEL)
-		return kmalloc_node(size, flags, node);
-
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -582,6 +572,9 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 
 		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
 			kmalloc_flags |= __GFP_NORETRY;
+
+		/* nofail semantic is implemented by the vmalloc fallback */
+		kmalloc_flags &= ~__GFP_NOFAIL;
 	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
index 5622763ad402..7e51f128045d 100644
--- a/net/ceph/buffer.c
+++ b/net/ceph/buffer.c
@@ -7,7 +7,7 @@
 
 #include <linux/ceph/buffer.h>
 #include <linux/ceph/decode.h>
-#include <linux/ceph/libceph.h> /* for ceph_kvmalloc */
+#include <linux/ceph/libceph.h> /* for kvmalloc */
 
 struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
 {
@@ -17,7 +17,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
 	if (!b)
 		return NULL;
 
-	b->vec.iov_base = ceph_kvmalloc(len, gfp);
+	b->vec.iov_base = kvmalloc(len, gfp);
 	if (!b->vec.iov_base) {
 		kfree(b);
 		return NULL;
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 97d6ea763e32..9441b4a4912b 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -190,33 +190,6 @@ int ceph_compare_options(struct ceph_options *new_opt,
 }
 EXPORT_SYMBOL(ceph_compare_options);
 
-/*
- * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
- * compatible with (a superset of) GFP_KERNEL.  This is because while the
- * actual pages are allocated with the specified flags, the page table pages
- * are always allocated with GFP_KERNEL.
- *
- * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
- */
-void *ceph_kvmalloc(size_t size, gfp_t flags)
-{
-	void *p;
-
-	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
-		p = kvmalloc(size, flags);
-	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
-		unsigned int nofs_flag = memalloc_nofs_save();
-		p = kvmalloc(size, GFP_KERNEL);
-		memalloc_nofs_restore(nofs_flag);
-	} else {
-		unsigned int noio_flag = memalloc_noio_save();
-		p = kvmalloc(size, GFP_KERNEL);
-		memalloc_noio_restore(noio_flag);
-	}
-
-	return p;
-}
-
 static int parse_fsid(const char *str, struct ceph_fsid *fsid)
 {
 	int i = 0;
diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
index 92d89b331645..051d22c0e4ad 100644
--- a/net/ceph/crypto.c
+++ b/net/ceph/crypto.c
@@ -147,7 +147,7 @@ void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
 static const u8 *aes_iv = (u8 *)CEPH_AES_IV;
 
 /*
- * Should be used for buffers allocated with ceph_kvmalloc().
+ * Should be used for buffers allocated with kvmalloc().
  * Currently these are encrypt out-buffer (ceph_buffer) and decrypt
  * in-buffer (msg front).
  *
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 57d043b382ed..7b891be799d2 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1920,7 +1920,7 @@ struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
 
 	/* front */
 	if (front_len) {
-		m->front.iov_base = ceph_kvmalloc(front_len, flags);
+		m->front.iov_base = kvmalloc(front_len, flags);
 		if (m->front.iov_base == NULL) {
 			dout("ceph_msg_new can't allocate %d bytes\n",
 			     front_len);
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index cc40ce4e02fb..c4099b641b38 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -308,7 +308,7 @@ static void *alloc_conn_buf(struct ceph_connection *con, int len)
 	if (WARN_ON(con->v2.conn_buf_cnt >= ARRAY_SIZE(con->v2.conn_bufs)))
 		return NULL;
 
-	buf = ceph_kvmalloc(len, GFP_NOIO);
+	buf = kvmalloc(len, GFP_NOIO);
 	if (!buf)
 		return NULL;
 
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 75b738083523..2823bb3cff55 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -980,7 +980,7 @@ static struct crush_work *alloc_workspace(const struct crush_map *c)
 	work_size = crush_work_size(c, CEPH_PG_MAX_SIZE);
 	dout("%s work_size %zu bytes\n", __func__, work_size);
 
-	work = ceph_kvmalloc(work_size, GFP_NOIO);
+	work = kvmalloc(work_size, GFP_NOIO);
 	if (!work)
 		return NULL;
 
@@ -1190,9 +1190,9 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
 	if (max == map->max_osd)
 		return 0;
 
-	state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
-	weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
-	addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
+	state = kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
+	weight = kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
+	addr = kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
 	if (!state || !weight || !addr) {
 		kvfree(state);
 		kvfree(weight);
@@ -1222,7 +1222,7 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
 	if (map->osd_primary_affinity) {
 		u32 *affinity;
 
-		affinity = ceph_kvmalloc(array_size(max, sizeof(*affinity)),
+		affinity = kvmalloc(array_size(max, sizeof(*affinity)),
 					 GFP_NOFS);
 		if (!affinity)
 			return -ENOMEM;
@@ -1503,7 +1503,7 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
 	if (!map->osd_primary_affinity) {
 		int i;
 
-		map->osd_primary_affinity = ceph_kvmalloc(
+		map->osd_primary_affinity = kvmalloc(
 		    array_size(map->max_osd, sizeof(*map->osd_primary_affinity)),
 		    GFP_NOFS);
 		if (!map->osd_primary_affinity)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
@ 2021-11-23 18:57   ` Uladzislau Rezki
  2021-11-23 19:02   ` Uladzislau Rezki
  2021-11-26 15:50   ` Vlastimil Babka
  2 siblings, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-23 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Mon, Nov 22, 2021 at 04:32:33PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> by previous patches so we can allow the support for kvmalloc. This
> will allow some external users to simplify or completely remove
> their helpers.
> 
> GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> explicitly documented so let's add a note about that.
> 
> ceph_kvmalloc is the first helper to be dropped and changed to
> kvmalloc.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/ceph/libceph.h |  1 -
>  mm/util.c                    | 15 ++++-----------
>  net/ceph/buffer.c            |  4 ++--
>  net/ceph/ceph_common.c       | 27 ---------------------------
>  net/ceph/crypto.c            |  2 +-
>  net/ceph/messenger.c         |  2 +-
>  net/ceph/messenger_v2.c      |  2 +-
>  net/ceph/osdmap.c            | 12 ++++++------
>  8 files changed, 15 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> index 409d8c29bc4f..309acbcb5a8a 100644
> --- a/include/linux/ceph/libceph.h
> +++ b/include/linux/ceph/libceph.h
> @@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
>  
>  extern const char *ceph_msg_type_name(int type);
>  extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
> -extern void *ceph_kvmalloc(size_t size, gfp_t flags);
>  
>  struct fs_parameter;
>  struct fc_log;
> diff --git a/mm/util.c b/mm/util.c
> index e58151a61255..7275f2829e3f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> + * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY modifier.
>   * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
>   * preferable to the vmalloc fallback, due to visible performance drawbacks.
>   *
> - * Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
> - * fall back to vmalloc.
> - *
>   * Return: pointer to the allocated memory of %NULL in case of failure
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> @@ -563,13 +560,6 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
> -	/*
> -	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
> -	 * so the given set of flags has to be compatible.
> -	 */
> -	if ((flags & GFP_KERNEL) != GFP_KERNEL)
> -		return kmalloc_node(size, flags, node);
> -
>  	/*
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
> @@ -582,6 +572,9 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  
>  		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
>  			kmalloc_flags |= __GFP_NORETRY;
> +
> +		/* nofail semantic is implemented by the vmalloc fallback */
> +		kmalloc_flags &= ~__GFP_NOFAIL;
>  	}
>  
>  	ret = kmalloc_node(size, kmalloc_flags, node);
> diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
> index 5622763ad402..7e51f128045d 100644
> --- a/net/ceph/buffer.c
> +++ b/net/ceph/buffer.c
> @@ -7,7 +7,7 @@
>  
>  #include <linux/ceph/buffer.h>
>  #include <linux/ceph/decode.h>
> -#include <linux/ceph/libceph.h> /* for ceph_kvmalloc */
> +#include <linux/ceph/libceph.h> /* for kvmalloc */
>  
>  struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
>  {
> @@ -17,7 +17,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
>  	if (!b)
>  		return NULL;
>  
> -	b->vec.iov_base = ceph_kvmalloc(len, gfp);
> +	b->vec.iov_base = kvmalloc(len, gfp);
>  	if (!b->vec.iov_base) {
>  		kfree(b);
>  		return NULL;
> diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
> index 97d6ea763e32..9441b4a4912b 100644
> --- a/net/ceph/ceph_common.c
> +++ b/net/ceph/ceph_common.c
> @@ -190,33 +190,6 @@ int ceph_compare_options(struct ceph_options *new_opt,
>  }
>  EXPORT_SYMBOL(ceph_compare_options);
>  
> -/*
> - * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
> - * compatible with (a superset of) GFP_KERNEL.  This is because while the
> - * actual pages are allocated with the specified flags, the page table pages
> - * are always allocated with GFP_KERNEL.
> - *
> - * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
> - */
> -void *ceph_kvmalloc(size_t size, gfp_t flags)
> -{
> -	void *p;
> -
> -	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
> -		p = kvmalloc(size, flags);
> -	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
> -		unsigned int nofs_flag = memalloc_nofs_save();
> -		p = kvmalloc(size, GFP_KERNEL);
> -		memalloc_nofs_restore(nofs_flag);
> -	} else {
> -		unsigned int noio_flag = memalloc_noio_save();
> -		p = kvmalloc(size, GFP_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> -
> -	return p;
> -}
> -
>  static int parse_fsid(const char *str, struct ceph_fsid *fsid)
>  {
>  	int i = 0;
> diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
> index 92d89b331645..051d22c0e4ad 100644
> --- a/net/ceph/crypto.c
> +++ b/net/ceph/crypto.c
> @@ -147,7 +147,7 @@ void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
>  static const u8 *aes_iv = (u8 *)CEPH_AES_IV;
>  
>  /*
> - * Should be used for buffers allocated with ceph_kvmalloc().
> + * Should be used for buffers allocated with kvmalloc().
>   * Currently these are encrypt out-buffer (ceph_buffer) and decrypt
>   * in-buffer (msg front).
>   *
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 57d043b382ed..7b891be799d2 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1920,7 +1920,7 @@ struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
>  
>  	/* front */
>  	if (front_len) {
> -		m->front.iov_base = ceph_kvmalloc(front_len, flags);
> +		m->front.iov_base = kvmalloc(front_len, flags);
>  		if (m->front.iov_base == NULL) {
>  			dout("ceph_msg_new can't allocate %d bytes\n",
>  			     front_len);
> diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
> index cc40ce4e02fb..c4099b641b38 100644
> --- a/net/ceph/messenger_v2.c
> +++ b/net/ceph/messenger_v2.c
> @@ -308,7 +308,7 @@ static void *alloc_conn_buf(struct ceph_connection *con, int len)
>  	if (WARN_ON(con->v2.conn_buf_cnt >= ARRAY_SIZE(con->v2.conn_bufs)))
>  		return NULL;
>  
> -	buf = ceph_kvmalloc(len, GFP_NOIO);
> +	buf = kvmalloc(len, GFP_NOIO);
>  	if (!buf)
>  		return NULL;
>  
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 75b738083523..2823bb3cff55 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -980,7 +980,7 @@ static struct crush_work *alloc_workspace(const struct crush_map *c)
>  	work_size = crush_work_size(c, CEPH_PG_MAX_SIZE);
>  	dout("%s work_size %zu bytes\n", __func__, work_size);
>  
> -	work = ceph_kvmalloc(work_size, GFP_NOIO);
> +	work = kvmalloc(work_size, GFP_NOIO);
>  	if (!work)
>  		return NULL;
>  
> @@ -1190,9 +1190,9 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
>  	if (max == map->max_osd)
>  		return 0;
>  
> -	state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
> -	weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
> -	addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
> +	state = kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
> +	weight = kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
> +	addr = kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
>  	if (!state || !weight || !addr) {
>  		kvfree(state);
>  		kvfree(weight);
> @@ -1222,7 +1222,7 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
>  	if (map->osd_primary_affinity) {
>  		u32 *affinity;
>  
> -		affinity = ceph_kvmalloc(array_size(max, sizeof(*affinity)),
> +		affinity = kvmalloc(array_size(max, sizeof(*affinity)),
>  					 GFP_NOFS);
>  		if (!affinity)
>  			return -ENOMEM;
> @@ -1503,7 +1503,7 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
>  	if (!map->osd_primary_affinity) {
>  		int i;
>  
> -		map->osd_primary_affinity = ceph_kvmalloc(
> +		map->osd_primary_affinity = kvmalloc(
>  		    array_size(map->max_osd, sizeof(*map->osd_primary_affinity)),
>  		    GFP_NOFS);
>  		if (!map->osd_primary_affinity)
> -- 
> 2.30.2
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-11-22 15:32 ` [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
@ 2021-11-23 18:58   ` Uladzislau Rezki
  2021-11-26 15:39   ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-23 18:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Mon, Nov 22, 2021 at 04:32:32PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp flags")
> has been merged prematurely without the rest of the series and without
> addressed review feedback from Neil. Fix that up now. Only wording is
> changed slightly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b6aed4f94a85..b1c115ec13be 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3021,12 +3021,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>   *
>   * Allocate enough pages to cover @size from the page level
>   * allocator with @gfp_mask flags. Please note that the full set of gfp
> - * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> - * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> - * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> - * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> - * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> - * __GFP_NOWARN can be used to suppress error messages about failures.
> + * flags are not supported. GFP_KERNEL, GFP_NOFS and GFP_NOIO are all
> + * supported.
> + * Zone modifiers are not supported. From the reclaim modifiers
> + * __GFP_DIRECT_RECLAIM is required (aka GFP_NOWAIT is not supported)
> + * and only __GFP_NOFAIL is supported (i.e. __GFP_NORETRY and
> + * __GFP_RETRY_MAYFAIL are not supported).
> + *
> + * __GFP_NOWARN can be used to suppress failures messages.
>   *
>   * Map them into contiguous kernel virtual space, using a pagetable
>   * protection of @prot.
> -- 
> 2.30.2
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
@ 2021-11-23 19:01   ` Uladzislau Rezki
  2021-11-23 20:09     ` Michal Hocko
  2021-11-24  1:02     ` Andrew Morton
  2021-11-26 10:48   ` Michal Hocko
  2021-11-26 15:32   ` Vlastimil Babka
  2 siblings, 2 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-23 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 17ca7001de1f..b6aed4f94a85 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	 * more permissive.
>  	 */
>  	if (!order) {
> +		gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
> +
>  		while (nr_allocated < nr_pages) {
>  			unsigned int nr, nr_pages_request;
>  
> @@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  			 * but mempolcy want to alloc memory by interleaving.
>  			 */
>  			if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
> -				nr = alloc_pages_bulk_array_mempolicy(gfp,
> +				nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
>  							nr_pages_request,
>  							pages + nr_allocated);
>  
>  			else
> -				nr = alloc_pages_bulk_array_node(gfp, nid,
> +				nr = alloc_pages_bulk_array_node(bulk_gfp, nid,
>  							nr_pages_request,
>  							pages + nr_allocated);
>  
> @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  {
>  	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>  	const gfp_t orig_gfp_mask = gfp_mask;
> +	bool nofail = gfp_mask & __GFP_NOFAIL;
>  	unsigned long addr = (unsigned long)area->addr;
>  	unsigned long size = get_vm_area_size(area);
>  	unsigned long array_size;
> @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
>  
> -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +	do {
> +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>  			page_shift);
> +		if (nofail && (ret < 0))
> +			schedule_timeout_uninterruptible(1);
> +	} while (nofail && (ret < 0));
>  
>  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
> @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  				  VM_UNINITIALIZED | vm_flags, start, end, node,
>  				  gfp_mask, caller);
>  	if (!area) {
> +		bool nofail = gfp_mask & __GFP_NOFAIL;
>  		warn_alloc(gfp_mask, NULL,
> -			"vmalloc error: size %lu, vm_struct allocation failed",
> -			real_size);
> +			"vmalloc error: size %lu, vm_struct allocation failed%s",
> +			real_size, (nofail) ? ". Retrying." : "");
> +		if (nofail) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}
>  		goto fail;
>  	}
>  
> -- 
> 2.30.2
> 
I have raised two concerns in our previous discussion about this change,
well that is sad...

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
  2021-11-23 18:57   ` Uladzislau Rezki
@ 2021-11-23 19:02   ` Uladzislau Rezki
  2021-11-26 15:50   ` Vlastimil Babka
  2 siblings, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-23 19:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Mon, Nov 22, 2021 at 04:32:33PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> by previous patches so we can allow the support for kvmalloc. This
> will allow some external users to simplify or completely remove
> their helpers.
> 
> GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> explicitly documented so let's add a note about that.
> 
> ceph_kvmalloc is the first helper to be dropped and changed to
> kvmalloc.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/ceph/libceph.h |  1 -
>  mm/util.c                    | 15 ++++-----------
>  net/ceph/buffer.c            |  4 ++--
>  net/ceph/ceph_common.c       | 27 ---------------------------
>  net/ceph/crypto.c            |  2 +-
>  net/ceph/messenger.c         |  2 +-
>  net/ceph/messenger_v2.c      |  2 +-
>  net/ceph/osdmap.c            | 12 ++++++------
>  8 files changed, 15 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> index 409d8c29bc4f..309acbcb5a8a 100644
> --- a/include/linux/ceph/libceph.h
> +++ b/include/linux/ceph/libceph.h
> @@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
>  
>  extern const char *ceph_msg_type_name(int type);
>  extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
> -extern void *ceph_kvmalloc(size_t size, gfp_t flags);
>  
>  struct fs_parameter;
>  struct fc_log;
> diff --git a/mm/util.c b/mm/util.c
> index e58151a61255..7275f2829e3f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> + * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY modifier.
>   * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
>   * preferable to the vmalloc fallback, due to visible performance drawbacks.
>   *
> - * Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
> - * fall back to vmalloc.
> - *
>   * Return: pointer to the allocated memory of %NULL in case of failure
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> @@ -563,13 +560,6 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
> -	/*
> -	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
> -	 * so the given set of flags has to be compatible.
> -	 */
> -	if ((flags & GFP_KERNEL) != GFP_KERNEL)
> -		return kmalloc_node(size, flags, node);
> -
>  	/*
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
> @@ -582,6 +572,9 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  
>  		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
>  			kmalloc_flags |= __GFP_NORETRY;
> +
> +		/* nofail semantic is implemented by the vmalloc fallback */
> +		kmalloc_flags &= ~__GFP_NOFAIL;
>  	}
>  
>  	ret = kmalloc_node(size, kmalloc_flags, node);
> diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
> index 5622763ad402..7e51f128045d 100644
> --- a/net/ceph/buffer.c
> +++ b/net/ceph/buffer.c
> @@ -7,7 +7,7 @@
>  
>  #include <linux/ceph/buffer.h>
>  #include <linux/ceph/decode.h>
> -#include <linux/ceph/libceph.h> /* for ceph_kvmalloc */
> +#include <linux/ceph/libceph.h> /* for kvmalloc */
>  
>  struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
>  {
> @@ -17,7 +17,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
>  	if (!b)
>  		return NULL;
>  
> -	b->vec.iov_base = ceph_kvmalloc(len, gfp);
> +	b->vec.iov_base = kvmalloc(len, gfp);
>  	if (!b->vec.iov_base) {
>  		kfree(b);
>  		return NULL;
> diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
> index 97d6ea763e32..9441b4a4912b 100644
> --- a/net/ceph/ceph_common.c
> +++ b/net/ceph/ceph_common.c
> @@ -190,33 +190,6 @@ int ceph_compare_options(struct ceph_options *new_opt,
>  }
>  EXPORT_SYMBOL(ceph_compare_options);
>  
> -/*
> - * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
> - * compatible with (a superset of) GFP_KERNEL.  This is because while the
> - * actual pages are allocated with the specified flags, the page table pages
> - * are always allocated with GFP_KERNEL.
> - *
> - * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
> - */
> -void *ceph_kvmalloc(size_t size, gfp_t flags)
> -{
> -	void *p;
> -
> -	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
> -		p = kvmalloc(size, flags);
> -	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
> -		unsigned int nofs_flag = memalloc_nofs_save();
> -		p = kvmalloc(size, GFP_KERNEL);
> -		memalloc_nofs_restore(nofs_flag);
> -	} else {
> -		unsigned int noio_flag = memalloc_noio_save();
> -		p = kvmalloc(size, GFP_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> -
> -	return p;
> -}
> -
>  static int parse_fsid(const char *str, struct ceph_fsid *fsid)
>  {
>  	int i = 0;
> diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
> index 92d89b331645..051d22c0e4ad 100644
> --- a/net/ceph/crypto.c
> +++ b/net/ceph/crypto.c
> @@ -147,7 +147,7 @@ void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
>  static const u8 *aes_iv = (u8 *)CEPH_AES_IV;
>  
>  /*
> - * Should be used for buffers allocated with ceph_kvmalloc().
> + * Should be used for buffers allocated with kvmalloc().
>   * Currently these are encrypt out-buffer (ceph_buffer) and decrypt
>   * in-buffer (msg front).
>   *
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 57d043b382ed..7b891be799d2 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1920,7 +1920,7 @@ struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
>  
>  	/* front */
>  	if (front_len) {
> -		m->front.iov_base = ceph_kvmalloc(front_len, flags);
> +		m->front.iov_base = kvmalloc(front_len, flags);
>  		if (m->front.iov_base == NULL) {
>  			dout("ceph_msg_new can't allocate %d bytes\n",
>  			     front_len);
> diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
> index cc40ce4e02fb..c4099b641b38 100644
> --- a/net/ceph/messenger_v2.c
> +++ b/net/ceph/messenger_v2.c
> @@ -308,7 +308,7 @@ static void *alloc_conn_buf(struct ceph_connection *con, int len)
>  	if (WARN_ON(con->v2.conn_buf_cnt >= ARRAY_SIZE(con->v2.conn_bufs)))
>  		return NULL;
>  
> -	buf = ceph_kvmalloc(len, GFP_NOIO);
> +	buf = kvmalloc(len, GFP_NOIO);
>  	if (!buf)
>  		return NULL;
>  
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 75b738083523..2823bb3cff55 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -980,7 +980,7 @@ static struct crush_work *alloc_workspace(const struct crush_map *c)
>  	work_size = crush_work_size(c, CEPH_PG_MAX_SIZE);
>  	dout("%s work_size %zu bytes\n", __func__, work_size);
>  
> -	work = ceph_kvmalloc(work_size, GFP_NOIO);
> +	work = kvmalloc(work_size, GFP_NOIO);
>  	if (!work)
>  		return NULL;
>  
> @@ -1190,9 +1190,9 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
>  	if (max == map->max_osd)
>  		return 0;
>  
> -	state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
> -	weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
> -	addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
> +	state = kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
> +	weight = kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
> +	addr = kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
>  	if (!state || !weight || !addr) {
>  		kvfree(state);
>  		kvfree(weight);
> @@ -1222,7 +1222,7 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
>  	if (map->osd_primary_affinity) {
>  		u32 *affinity;
>  
> -		affinity = ceph_kvmalloc(array_size(max, sizeof(*affinity)),
> +		affinity = kvmalloc(array_size(max, sizeof(*affinity)),
>  					 GFP_NOFS);
>  		if (!affinity)
>  			return -ENOMEM;
> @@ -1503,7 +1503,7 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
>  	if (!map->osd_primary_affinity) {
>  		int i;
>  
> -		map->osd_primary_affinity = ceph_kvmalloc(
> +		map->osd_primary_affinity = kvmalloc(
>  		    array_size(map->max_osd, sizeof(*map->osd_primary_affinity)),
>  		    GFP_NOFS);
>  		if (!map->osd_primary_affinity)
> -- 
> 2.30.2
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
  2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
@ 2021-11-23 19:05   ` Uladzislau Rezki
  2021-11-26 15:13   ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-23 19:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Mon, Nov 22, 2021 at 04:32:30PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
> page table allocations do not support externally provided gfp mask
> and performed GFP_KERNEL like allocations.
> 
> Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
> to enforce NOFS and NOIO constrains implicitly to all allocators within
> the scope. There was a hope that those scopes would be defined on a
> higher level when the reclaim recursion boundary starts/stops (e.g. when
> a lock required during the memory reclaim is required etc.). It seems
> that not all NOFS/NOIO users have adopted this approach and instead
> they have taken a workaround approach to wrap a single [k]vmalloc
> allocation by a scope API.
> 
> These workarounds do not serve the purpose of a better reclaim recursion
> documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
> just provide them with the semantic they are asking for without a need
> for workarounds.
> 
> Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
> allocations already comply with the given gfp_mask. The only current
> exception is vmap_pages_range which maps kernel page tables. Infer the
> proper scope API based on the given gfp mask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..17ca7001de1f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2926,6 +2926,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned long array_size;
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
> +	unsigned int flags;
> +	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
>  	gfp_mask |= __GFP_NOWARN;
> @@ -2967,8 +2969,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  		goto fail;
>  	}
>  
> -	if (vmap_pages_range(addr, addr + size, prot, area->pages,
> -			page_shift) < 0) {
> +	/*
> +	 * page tables allocations ignore external gfp mask, enforce it
> +	 * by the scope API
> +	 */
> +	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +		flags = memalloc_nofs_save();
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> +		flags = memalloc_noio_save();
> +
> +	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +			page_shift);
> +
> +	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +		memalloc_nofs_restore(flags);
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> +		memalloc_noio_restore(flags);
> +
> +	if (ret < 0) {
>  		warn_alloc(orig_gfp_mask, NULL,
>  			"vmalloc error: size %lu, failed to map pages",
>  			area->nr_pages * PAGE_SIZE);
> -- 
> 2.30.2
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-23 19:01   ` Uladzislau Rezki
@ 2021-11-23 20:09     ` Michal Hocko
  2021-11-24 20:46       ` Uladzislau Rezki
  2021-11-24  1:02     ` Andrew Morton
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-23 20:09 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote:
[...]
> I have raised two concerns in our previous discussion about this change,
> well that is sad...

I definitely didn't mean to ignore any of the feedback. IIRC we were in
a disagreement in the failure mode for retry loop - i.e. free all the
allocated pages in case page table pages cannot be allocated. I still
maintain my position and until there is a wider consensus on that I will
keep my current implementation. The other concern was about failures
warning but that shouldn't be a problem when basing on the current Linus
tree. Am I missing something?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-23 19:01   ` Uladzislau Rezki
  2021-11-23 20:09     ` Michal Hocko
@ 2021-11-24  1:02     ` Andrew Morton
  2021-11-24  3:16       ` NeilBrown
                         ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Andrew Morton @ 2021-11-24  1:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Michal Hocko, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:

> On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.

Perhaps we should tell xfs "no, do it internally".  Because this is a
rather nasty-looking thing - do we want to encourage other callsites to
start using it?

> > The large part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> > 
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> > 

Yes, the horse has already bolted.  But we didn't want that horse anyway ;)

I added GFP_NOFAIL back in the mesozoic era because quite a lot of
sites were doing open-coded try-forever loops.  I thought "hey, they
shouldn't be doing that in the first place, but let's at least
centralize the concept to reduce code size, code duplication and so
it's something we can now grep for".  But longer term, all GFP_NOFAIL
sites should be reworked to no longer need to do the retry-forever
thing.  In retrospect, this bright idea of mine seems to have added
license for more sites to use retry-forever.  Sigh.

> > +		if (nofail) {
> > +			schedule_timeout_uninterruptible(1);
> > +			goto again;
> > +		}

The idea behind congestion_wait() is to prevent us from having to
hard-wire delays like this.  congestion_wait(1) would sleep for up to
one millisecond, but will return earlier if reclaim events happened
which make it likely that the caller can now proceed with the
allocation event, successfully.

However it turns out that congestion_wait() was quietly broken at the
block level some time ago.  We could perhaps resurrect the concept at
another level - say by releasing congestion_wait() callers if an amount
of memory newly becomes allocatable.  This obviously asks for inclusion
of zone/node/etc info from the congestion_wait() caller.  But that's
just an optimization - if the newly-available memory isn't useful to
the congestion_wait() caller, they just fail the allocation attempts
and wait again.

> well that is sad...
> I have raised two concerns in our previous discussion about this change,

Can you please reiterate those concerns here?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  1:02     ` Andrew Morton
@ 2021-11-24  3:16       ` NeilBrown
  2021-11-24  3:48         ` Andrew Morton
  2021-11-24 23:45         ` Dave Chinner
  2021-11-24  8:43       ` Michal Hocko
  2021-11-24 20:11       ` Uladzislau Rezki
  2 siblings, 2 replies; 44+ messages in thread
From: NeilBrown @ 2021-11-24  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki, Michal Hocko, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Wed, 24 Nov 2021, Andrew Morton wrote:
> 
> I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> sites were doing open-coded try-forever loops.  I thought "hey, they
> shouldn't be doing that in the first place, but let's at least
> centralize the concept to reduce code size, code duplication and so
> it's something we can now grep for".  But longer term, all GFP_NOFAIL
> sites should be reworked to no longer need to do the retry-forever
> thing.  In retrospect, this bright idea of mine seems to have added
> license for more sites to use retry-forever.  Sigh.

One of the costs of not having GFP_NOFAIL (or similar) is lots of
untested failure-path code.

When does an allocation that is allowed to retry and reclaim ever fail
anyway? I think the answer is "only when it has been killed by the oom
killer".  That of course cannot happen to kernel threads, so maybe
kernel threads should never need GFP_NOFAIL??

I'm not sure the above is 100%, but I do think that is the sort of
semantic that we want.  We want to know what kmalloc failure *means*.
We also need well defined and documented strategies to handle it.
mempools are one such strategy, but not always suitable.
preallocating can also be useful but can be clumsy to implement.  Maybe
we should support a process preallocating a bunch of pages which can
only be used by the process - and are auto-freed when the process
returns to user-space.  That might allow the "error paths" to be simple
and early, and subsequent allocations that were GFP_USEPREALLOC would be
safe.

i.e. we need a plan for how to rework all those no-fail call-sites.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  3:16       ` NeilBrown
@ 2021-11-24  3:48         ` Andrew Morton
  2021-11-24  5:23           ` NeilBrown
  2021-11-24 23:45         ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2021-11-24  3:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Uladzislau Rezki, Michal Hocko, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@suse.de> wrote:

> On Wed, 24 Nov 2021, Andrew Morton wrote:
> > 
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops.  I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing.  In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever.  Sigh.
> 
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.

Well that's bad of the relevant developers and testers!  It isn't that
hard to fake up allocation failures.  Either with the formal fault
injection framework or with ad-hackery.

> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer".  That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??

> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want.  We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

Well, mempools aren't "handling" it.  They're just another layer to
make memory allocation attempts appear to be magical.  The preferred
answer is "just handle the damned error and return ENOMEM".

Obviously this gets very painful at times (arguably because of
high-level design shortcomings).  The old radix_tree_preload approach
was bulletproof, but was quite a lot of fuss.

> preallocating can also be useful but can be clumsy to implement.  Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space.  That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

Yes, I think something like that would be quite feasible.  Need to
prevent interrupt code from stealing the task's page store.

It can be a drag calculating (by hand) what the max possible amount of
allocation will be and one can end up allocating and then not using a
lot of memory.

I forget why radix_tree_preload used a cpu-local store rather than a
per-task one.

Plus "what order pages would you like" and "on which node" and "in
which zone", etc...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  3:48         ` Andrew Morton
@ 2021-11-24  5:23           ` NeilBrown
  2021-11-25  0:32             ` Theodore Y. Ts'o
  2021-11-26 14:50             ` Vlastimil Babka
  0 siblings, 2 replies; 44+ messages in thread
From: NeilBrown @ 2021-11-24  5:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki, Michal Hocko, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Wed, 24 Nov 2021, Andrew Morton wrote:
> On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@suse.de> wrote:
> 
> > On Wed, 24 Nov 2021, Andrew Morton wrote:
> > > 
> > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > > sites were doing open-coded try-forever loops.  I thought "hey, they
> > > shouldn't be doing that in the first place, but let's at least
> > > centralize the concept to reduce code size, code duplication and so
> > > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > > sites should be reworked to no longer need to do the retry-forever
> > > thing.  In retrospect, this bright idea of mine seems to have added
> > > license for more sites to use retry-forever.  Sigh.
> > 
> > One of the costs of not having GFP_NOFAIL (or similar) is lots of
> > untested failure-path code.
> 
> Well that's bad of the relevant developers and testers!  It isn't that
> hard to fake up allocation failures.  Either with the formal fault
> injection framework or with ad-hackery.

If we have CONFIG_RANDOM_ALLOC_PAGE_FAILURE then I might agree.
lockdep is AWESOME.  It makes testing for deadlock problems *so* easy.  
That is the level of ease-of-use we need if you want people to handle
page-alloc failures reliably.

> 
> > When does an allocation that is allowed to retry and reclaim ever fail
> > anyway? I think the answer is "only when it has been killed by the oom
> > killer".  That of course cannot happen to kernel threads, so maybe
> > kernel threads should never need GFP_NOFAIL??
> 
> > I'm not sure the above is 100%, but I do think that is the sort of
> > semantic that we want.  We want to know what kmalloc failure *means*.
> > We also need well defined and documented strategies to handle it.
> > mempools are one such strategy, but not always suitable.
> 
> Well, mempools aren't "handling" it.  They're just another layer to
> make memory allocation attempts appear to be magical.  The preferred
> answer is "just handle the damned error and return ENOMEM".

No.  mempools are EXACTLY handling it - in a specific context.  When
writing out dirty pages so as to free up memory, you often need to
allocate memory.  And you may need a sequence of allocations, typically
at different levels in the stack.  Global water-marks cannot help
reliably as it might all be used up with top-level requests, and the
lower levels can starve indefinitely.  mempools ensure that when memory
is freed, it is returned to the same level it was allocated for.  That
ensures you can get at least one request at a time all the way out to
storage.

If swap-out just returned ENOMEM, you'd really be in trouble.

> 
> Obviously this gets very painful at times (arguably because of
> high-level design shortcomings).  The old radix_tree_preload approach
> was bulletproof, but was quite a lot of fuss.

It would get particularly painful if some system call started returned
-ENOMEM, which had never returned that before.  I note that ext4 uses
__GFP_NOFAIL when handling truncate.  I don't think user-space would be
happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
recent commit which adds it focuses more on wanting to avoid the need
for fsck.

> 
> > preallocating can also be useful but can be clumsy to implement.  Maybe
> > we should support a process preallocating a bunch of pages which can
> > only be used by the process - and are auto-freed when the process
> > returns to user-space.  That might allow the "error paths" to be simple
> > and early, and subsequent allocations that were GFP_USEPREALLOC would be
> > safe.
> 
> Yes, I think something like that would be quite feasible.  Need to
> prevent interrupt code from stealing the task's page store.
> 
> It can be a drag calculating (by hand) what the max possible amount of
> allocation will be and one can end up allocating and then not using a
> lot of memory.

CONFIG_DEBUG_PREALLOC could force every GFP_USE_PREALLOC request to use
a different page, and warn if there weren't enough.  Not perfect, but
useful.

Concerning the "lot of memory" having prealloc as an easy option means
people can use it until it becomes too expensive, then find a better
solution.  There will likely always be a better solution, but it might
not often be worth the effort.

> 
> I forget why radix_tree_preload used a cpu-local store rather than a
> per-task one.
> 
> Plus "what order pages would you like" and "on which node" and "in
> which zone", etc...

"what order" - only order-0 I hope.  I'd hazard a guess that 90% of
current NOFAIL allocations only need one page (providing slub is used -
slab seems to insist on high-order pages sometimes).
"which node" - whichever.  Unless __GFP_HARDWALL is set, alloc_page()
will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
probably a poor choice.
"which zone" - NORMAL.  I cannot find any NOFAIL allocations that want
DMA.  fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
*requre* highmem.

Of course, before designing this interface too precisely we should check
if anyone can use it.  From a quick through the some of the 100-ish
users of __GFP_NOFAIL I'd guess that mempools would help - the
preallocation should happen at init-time, not request-time.  Maybe if we
made mempools even more light weight .... though that risks allocating a
lot of memory that will never get used.

This brings me back to the idea that
    alloc_page(wait and reclaim allowed)
should only fail on OOM_KILL.  That way kernel threads are safe, and
user-threads are free to return ENOMEM knowing it won't get to
user-space.  If user-thread code really needs NOFAIL, it punts to a
workqueue and waits - aborting the wait if it is killed, while the work
item still runs eventually.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  1:02     ` Andrew Morton
  2021-11-24  3:16       ` NeilBrown
@ 2021-11-24  8:43       ` Michal Hocko
  2021-11-24 20:37         ` Uladzislau Rezki
  2021-11-24 20:11       ` Uladzislau Rezki
  2 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-24  8:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> 
> Perhaps we should tell xfs "no, do it internally".  Because this is a
> rather nasty-looking thing - do we want to encourage other callsites to
> start using it?

This is what xfs is likely going to do if we do not provide the
functionality. I just do not see why that would be a better outcome
though. My longterm experience tells me that whenever we ignore
requirements by other subsystems then those requirements materialize in
some form in the end. In many cases done either suboptimaly or outright
wrong. This might be not the case for xfs as the quality of
implementation is high there but this is not the case in general.

Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
any other stupid reason then what? Is that something we should worry
about? Retrying within the allocator doesn't make the things worse. In
fact it is just easier to find such abusers by grep which would be more
elaborate with custom retry loops.
 
[...]
> > > +		if (nofail) {
> > > +			schedule_timeout_uninterruptible(1);
> > > +			goto again;
> > > +		}
> 
> The idea behind congestion_wait() is to prevent us from having to
> hard-wire delays like this.  congestion_wait(1) would sleep for up to
> one millisecond, but will return earlier if reclaim events happened
> which make it likely that the caller can now proceed with the
> allocation event, successfully.
> 
> However it turns out that congestion_wait() was quietly broken at the
> block level some time ago.  We could perhaps resurrect the concept at
> another level - say by releasing congestion_wait() callers if an amount
> of memory newly becomes allocatable.  This obviously asks for inclusion
> of zone/node/etc info from the congestion_wait() caller.  But that's
> just an optimization - if the newly-available memory isn't useful to
> the congestion_wait() caller, they just fail the allocation attempts
> and wait again.

vmalloc has two potential failure modes. Depleted memory and vmalloc
space. So there are two different events to wait for. I do agree that
schedule_timeout_uninterruptible is both ugly and very simple but do we
really need a much more sophisticated solution at this stage?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  1:02     ` Andrew Morton
  2021-11-24  3:16       ` NeilBrown
  2021-11-24  8:43       ` Michal Hocko
@ 2021-11-24 20:11       ` Uladzislau Rezki
  2021-11-25  8:46         ` Michal Hocko
  2 siblings, 1 reply; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-24 20:11 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Uladzislau Rezki, Michal Hocko, Dave Chinner, Neil Brown,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> 
> Perhaps we should tell xfs "no, do it internally".  Because this is a
> rather nasty-looking thing - do we want to encourage other callsites to
> start using it?
> 
> > > The large part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > > 
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > > 
> 
> Yes, the horse has already bolted.  But we didn't want that horse anyway ;)
> 
> I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> sites were doing open-coded try-forever loops.  I thought "hey, they
> shouldn't be doing that in the first place, but let's at least
> centralize the concept to reduce code size, code duplication and so
> it's something we can now grep for".  But longer term, all GFP_NOFAIL
> sites should be reworked to no longer need to do the retry-forever
> thing.  In retrospect, this bright idea of mine seems to have added
> license for more sites to use retry-forever.  Sigh.
> 
> > > +		if (nofail) {
> > > +			schedule_timeout_uninterruptible(1);
> > > +			goto again;
> > > +		}
> 
> The idea behind congestion_wait() is to prevent us from having to
> hard-wire delays like this.  congestion_wait(1) would sleep for up to
> one millisecond, but will return earlier if reclaim events happened
> which make it likely that the caller can now proceed with the
> allocation event, successfully.
> 
> However it turns out that congestion_wait() was quietly broken at the
> block level some time ago.  We could perhaps resurrect the concept at
> another level - say by releasing congestion_wait() callers if an amount
> of memory newly becomes allocatable.  This obviously asks for inclusion
> of zone/node/etc info from the congestion_wait() caller.  But that's
> just an optimization - if the newly-available memory isn't useful to
> the congestion_wait() caller, they just fail the allocation attempts
> and wait again.
> 
> > well that is sad...
> > I have raised two concerns in our previous discussion about this change,
> 
> Can you please reiterate those concerns here?
>
1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
duplication and spreading the logic across several places. This is about
simplification.

2. Second one is about to do an unwinding and release everything what we
have just accumulated in terms of memory consumption. The failure might
occur, if so a condition we are in is a low memory one or high memory
pressure. In this case, since we are about to sleep some milliseconds
in order to repeat later, IMHO it makes sense to release memory:

- to prevent killing apps or possible OOM;
- we can end up looping quite a lot of time or even forever if users do
  nasty things with vmalloc API and __GFP_NOFAIL flag.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  8:43       ` Michal Hocko
@ 2021-11-24 20:37         ` Uladzislau Rezki
  2021-11-25  8:48           ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-24 20:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Uladzislau Rezki, Dave Chinner, Neil Brown,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton

On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > 
> > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > rather nasty-looking thing - do we want to encourage other callsites to
> > start using it?
> 
> This is what xfs is likely going to do if we do not provide the
> functionality. I just do not see why that would be a better outcome
> though. My longterm experience tells me that whenever we ignore
> requirements by other subsystems then those requirements materialize in
> some form in the end. In many cases done either suboptimaly or outright
> wrong. This might be not the case for xfs as the quality of
> implementation is high there but this is not the case in general.
> 
> Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> any other stupid reason then what? Is that something we should worry
> about? Retrying within the allocator doesn't make the things worse. In
> fact it is just easier to find such abusers by grep which would be more
> elaborate with custom retry loops.
>  
> [...]
> > > > +		if (nofail) {
> > > > +			schedule_timeout_uninterruptible(1);
> > > > +			goto again;
> > > > +		}
> > 
> > The idea behind congestion_wait() is to prevent us from having to
> > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > one millisecond, but will return earlier if reclaim events happened
> > which make it likely that the caller can now proceed with the
> > allocation event, successfully.
> > 
> > However it turns out that congestion_wait() was quietly broken at the
> > block level some time ago.  We could perhaps resurrect the concept at
> > another level - say by releasing congestion_wait() callers if an amount
> > of memory newly becomes allocatable.  This obviously asks for inclusion
> > of zone/node/etc info from the congestion_wait() caller.  But that's
> > just an optimization - if the newly-available memory isn't useful to
> > the congestion_wait() caller, they just fail the allocation attempts
> > and wait again.
> 
> vmalloc has two potential failure modes. Depleted memory and vmalloc
> space. So there are two different events to wait for. I do agree that
> schedule_timeout_uninterruptible is both ugly and very simple but do we
> really need a much more sophisticated solution at this stage?
>
I would say there is at least one more. It is about when users set their
own range(start:end) where to allocate. In that scenario we might never
return to a user, because there might not be any free vmap space on
specified range.

To address this, we can allow __GFP_NOFAIL only for entire vmalloc
address space, i.e. within VMALLOC_START:VMALLOC_END. By doing so
we will guarantee that we will not run out of vmap space, at least
for 64 bit systems, for smaller 32 bit ones we can not guarantee it
but it is populated back when the "lazily free logic" is kicked.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-23 20:09     ` Michal Hocko
@ 2021-11-24 20:46       ` Uladzislau Rezki
  0 siblings, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-24 20:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Andrew Morton, Dave Chinner, Neil Brown,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton

On Tue, Nov 23, 2021 at 09:09:08PM +0100, Michal Hocko wrote:
> On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote:
> [...]
> > I have raised two concerns in our previous discussion about this change,
> > well that is sad...
> 
> I definitely didn't mean to ignore any of the feedback. IIRC we were in
> a disagreement in the failure mode for retry loop - i.e. free all the
> allocated pages in case page table pages cannot be allocated. I still
> maintain my position and until there is a wider consensus on that I will
> keep my current implementation. The other concern was about failures
> warning but that shouldn't be a problem when basing on the current Linus
> tree. Am I missing something?
>
Sorry i was answering vice-versa from latest toward first messages :)

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] extend vmalloc support for constrained allocations
  2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
                   ` (3 preceding siblings ...)
  2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
@ 2021-11-24 22:55 ` Dave Chinner
  2021-11-25  8:58   ` Michal Hocko
  4 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2021-11-24 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Mon, Nov 22, 2021 at 04:32:29PM +0100, Michal Hocko wrote:
> Hi,
> The previous version has been posted here [1] 
> 
> I hope I have addressed all the feedback. There were some suggestions
> for further improvements but I would rather make this smaller as I
> cannot really invest more time and I believe further changes can be done
> on top.
> 
> This version is a rebase on top of the current Linus tree. Except for
> the review feedback and conflicting changes in the area there is only
> one change to filter out __GFP_NOFAIL from the bulk allocator. This is
> not necessary strictly speaking AFAICS but I found it less confusing
> because vmalloc has its fallback strategy and the bulk allocator is
> meant only for the fast path.
> 
> Original cover:
> Based on a recent discussion with Dave and Neil [2] I have tried to
> implement NOFS, NOIO, NOFAIL support for the vmalloc to make
> life of kvmalloc users easier.
> 
> A requirement for NOFAIL support for kvmalloc was new to me but this
> seems to be really needed by the xfs code.
> 
> NOFS/NOIO was a known and a long term problem which was hoped to be
> handled by the scope API. Those scope should have been used at the
> reclaim recursion boundaries both to document them and also to remove
> the necessity of NOFS/NOIO constrains for all allocations within that
> scope. Instead workarounds were developed to wrap a single allocation
> instead (like ceph_kvmalloc).
> 
> First patch implements NOFS/NOIO support for vmalloc. The second one
> adds NOFAIL support and the third one bundles all together into kvmalloc
> and drops ceph_kvmalloc which can use kvmalloc directly now.
> 
> I hope I haven't missed anything in the vmalloc allocator.

Correct __GFP_NOLOCKDEP support is also needed. See:

https://lore.kernel.org/linux-mm/20211119225435.GZ449541@dread.disaster.area/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  3:16       ` NeilBrown
  2021-11-24  3:48         ` Andrew Morton
@ 2021-11-24 23:45         ` Dave Chinner
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2021-11-24 23:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Uladzislau Rezki, Michal Hocko, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Wed, Nov 24, 2021 at 02:16:56PM +1100, NeilBrown wrote:
> On Wed, 24 Nov 2021, Andrew Morton wrote:
> > 
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops.  I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing.  In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever.  Sigh.
> 
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.
> 
> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer".  That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??
> 
> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want.  We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

mempools are not suitable for anything that uses demand paging to
hold an unbounded data set in memory before it can free anything.
This is basically the definition of memory demand in an XFS
transaction, and most transaction based filesystems have similar
behaviour.

> preallocating can also be useful but can be clumsy to implement.  Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space.  That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

We talked about this years ago at LSFMM (2013 or 2014, IIRC). The
problem is "how much do you preallocate when the worst case
requirement to guarantee forwards progress is at least tens of
megabytes". Considering that there might be thousands of these
contexts running concurrent at any given time and we might be
running through several million preallocation contexts a second,
suddenly preallocation is a great big CPU and memory pit.

Hence preallocation simply doesn't work when the scope to guarantee
forwards progress is in the order of megabytes (even tens of
megabytes) per "no fail" context scope.  In situations like this we
need -memory reservations-, not preallocation.

Have the mm guarantee that there is a certain amount of memory
available (even if it requires reclaim to make available) before we
start the operation that cannot tolerate memory allocation failure,
track the memory usage as it goes, warn if it overruns, and release
the unused part of the reservation when context completes.

[ This is what we already do in XFS for guaranteeing forwards
progress for writing modifications into the strictly bound journal
space. We reserve space up front and use tracking tickets to account
for space used across each transaction context. This avoids
overcommit and all the deadlock and corruption problems that come
from running out of physical log space to write all the changes
we've made into the log. We could simply add memory reservations and
tracking structures to the transaction context and we've pretty much
got everything we need on the XFS side covered... ]

> i.e. we need a plan for how to rework all those no-fail call-sites.

Even if we do make them all the filesystems handle ENOMEM errors
gracefully and pass that back up to userspace, how are applications
going to react to random ENOMEM errors when doing data writes or
file creation or any other operation that accesses filesystems?

Given the way applications handle transient errors (i.e. they
don't!) propagating ENOMEM back up to userspace will result in
applications randomly failing under memory pressure.  That's a much
worse situation than having to manage the _extremely rare_ issues
that occur because of __GFP_NOFAIL usage in the kernel.

Let's keep that in mind here - __GFP_NOFAIL usage is not causing
system failures in the real world, whilst propagating ENOMEM back
out to userspace is potentially very harmful to users....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  5:23           ` NeilBrown
@ 2021-11-25  0:32             ` Theodore Y. Ts'o
  2021-11-26 14:50             ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Theodore Y. Ts'o @ 2021-11-25  0:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Uladzislau Rezki, Michal Hocko, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton, Michal Hocko

On Wed, Nov 24, 2021 at 04:23:31PM +1100, NeilBrown wrote:
> 
> It would get particularly painful if some system call started returned
> -ENOMEM, which had never returned that before.  I note that ext4 uses
> __GFP_NOFAIL when handling truncate.  I don't think user-space would be
> happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
> recent commit which adds it focuses more on wanting to avoid the need
> for fsck.

If the inode is in use (via an open file descriptor) when it is
unlocked, we can't actually do the truncate until the inode is
evicted, and at that point, there is no user space to return to.  For
that reason, the evict_inode() method is not *allowed* to fail.  So
this is why we need to use GFP_NOFAIL or an open-coded retry loop.
The alternative would be to mark the file system corrupt, and then
either remount the file system, panic the system and reboot, or leave
the file system corrupted ("don't worry, be happy").  I considered
GFP_NOFAIL to be the lesser of the evils.  :-)

If the VFS allowed evict_inode() to fail, all it could do is to put
the inode back on the list of inodes to be later evicted --- which is
to say, we would have to add a lot of complexity to effectively add a
gigantic retry loop.  

Granted, we wouldn't need to be holding any locks in between retries,
so perhaps it'a better than adding a retry loop deep in the guts of
the ext4 truncate codepath.  But then we would need to worry about
userspace getting ENOMEM for system calls which historically, users
have traditionally never failing.  I suppose we could also solve this
problem by adding retry logic in the top-level VFS truncate codepath,
so instead of returning ENOMEM, we just retry the truncate(2) system
call and hope that we have enough memory to succeed this time.

After all, can the userspace do if truncate() fails with ENOMEM?  It
can fail the userspace program, which in the case of a long-running
daemon such as mysqld, is basically the userspace equivalent of "panic
and reboot", or it can retry truncate(2) syste call at the userspace
level.

Are we detecting a pattern here?  There will always be cases where the
choice is "panic" or "retry".

							- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24 20:11       ` Uladzislau Rezki
@ 2021-11-25  8:46         ` Michal Hocko
  2021-11-25 18:02           ` Uladzislau Rezki
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-25  8:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote:
> On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > 
> > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > rather nasty-looking thing - do we want to encourage other callsites to
> > start using it?
> > 
> > > > The large part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > > 
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > 
> > 
> > Yes, the horse has already bolted.  But we didn't want that horse anyway ;)
> > 
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops.  I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing.  In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever.  Sigh.
> > 
> > > > +		if (nofail) {
> > > > +			schedule_timeout_uninterruptible(1);
> > > > +			goto again;
> > > > +		}
> > 
> > The idea behind congestion_wait() is to prevent us from having to
> > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > one millisecond, but will return earlier if reclaim events happened
> > which make it likely that the caller can now proceed with the
> > allocation event, successfully.
> > 
> > However it turns out that congestion_wait() was quietly broken at the
> > block level some time ago.  We could perhaps resurrect the concept at
> > another level - say by releasing congestion_wait() callers if an amount
> > of memory newly becomes allocatable.  This obviously asks for inclusion
> > of zone/node/etc info from the congestion_wait() caller.  But that's
> > just an optimization - if the newly-available memory isn't useful to
> > the congestion_wait() caller, they just fail the allocation attempts
> > and wait again.
> > 
> > > well that is sad...
> > > I have raised two concerns in our previous discussion about this change,
> > 
> > Can you please reiterate those concerns here?
> >
> 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
> duplication and spreading the logic across several places. This is about
> simplification.

I am all for simplifications. But the presented simplification lead to 2) and ...

> 2. Second one is about to do an unwinding and release everything what we
> have just accumulated in terms of memory consumption. The failure might
> occur, if so a condition we are in is a low memory one or high memory
> pressure. In this case, since we are about to sleep some milliseconds
> in order to repeat later, IMHO it makes sense to release memory:
> 
> - to prevent killing apps or possible OOM;
> - we can end up looping quite a lot of time or even forever if users do
>   nasty things with vmalloc API and __GFP_NOFAIL flag.

... this is where we disagree and I have tried to explain why. The primary
memory to allocate are pages to back the vmalloc area. Failing to
allocate few page tables - which btw. do not fail as they are order-0 -
and result into the whole and much more expensive work to allocate the
former is really wasteful. You've had a concern about OOM killer
invocation while retrying the page table allocation but you should
realize that page table allocations might already invoke OOM killer so that
is absolutely nothing new.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24 20:37         ` Uladzislau Rezki
@ 2021-11-25  8:48           ` Michal Hocko
  2021-11-25 18:40             ` Uladzislau Rezki
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-25  8:48 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > > 
> > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > > 
> > > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > > rather nasty-looking thing - do we want to encourage other callsites to
> > > start using it?
> > 
> > This is what xfs is likely going to do if we do not provide the
> > functionality. I just do not see why that would be a better outcome
> > though. My longterm experience tells me that whenever we ignore
> > requirements by other subsystems then those requirements materialize in
> > some form in the end. In many cases done either suboptimaly or outright
> > wrong. This might be not the case for xfs as the quality of
> > implementation is high there but this is not the case in general.
> > 
> > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > any other stupid reason then what? Is that something we should worry
> > about? Retrying within the allocator doesn't make the things worse. In
> > fact it is just easier to find such abusers by grep which would be more
> > elaborate with custom retry loops.
> >  
> > [...]
> > > > > +		if (nofail) {
> > > > > +			schedule_timeout_uninterruptible(1);
> > > > > +			goto again;
> > > > > +		}
> > > 
> > > The idea behind congestion_wait() is to prevent us from having to
> > > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > > one millisecond, but will return earlier if reclaim events happened
> > > which make it likely that the caller can now proceed with the
> > > allocation event, successfully.
> > > 
> > > However it turns out that congestion_wait() was quietly broken at the
> > > block level some time ago.  We could perhaps resurrect the concept at
> > > another level - say by releasing congestion_wait() callers if an amount
> > > of memory newly becomes allocatable.  This obviously asks for inclusion
> > > of zone/node/etc info from the congestion_wait() caller.  But that's
> > > just an optimization - if the newly-available memory isn't useful to
> > > the congestion_wait() caller, they just fail the allocation attempts
> > > and wait again.
> > 
> > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > space. So there are two different events to wait for. I do agree that
> > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > really need a much more sophisticated solution at this stage?
> >
> I would say there is at least one more. It is about when users set their
> own range(start:end) where to allocate. In that scenario we might never
> return to a user, because there might not be any free vmap space on
> specified range.
> 
> To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> address space, i.e. within VMALLOC_START:VMALLOC_END.

How should we do that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] extend vmalloc support for constrained allocations
  2021-11-24 22:55 ` [PATCH v2 0/4] extend vmalloc support for constrained allocations Dave Chinner
@ 2021-11-25  8:58   ` Michal Hocko
  2021-11-25  9:30     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-25  8:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu 25-11-21 09:55:26, Dave Chinner wrote:
[...]
> Correct __GFP_NOLOCKDEP support is also needed. See:
> 
> https://lore.kernel.org/linux-mm/20211119225435.GZ449541@dread.disaster.area/

I will have a closer look. This will require changes on both vmalloc and
sl?b sides.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] extend vmalloc support for constrained allocations
  2021-11-25  8:58   ` Michal Hocko
@ 2021-11-25  9:30     ` Michal Hocko
  2021-11-25 21:30       ` Dave Chinner
  2021-11-26  9:20       ` Vlastimil Babka
  0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-25  9:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Sebastian Andrzej Siewior, Vlastimil Babka

[Cc Sebastian and Vlastimil]

On Thu 25-11-21 09:58:31, Michal Hocko wrote:
> On Thu 25-11-21 09:55:26, Dave Chinner wrote:
> [...]
> > Correct __GFP_NOLOCKDEP support is also needed. See:
> > 
> > https://lore.kernel.org/linux-mm/20211119225435.GZ449541@dread.disaster.area/
> 
> I will have a closer look. This will require changes on both vmalloc and
> sl?b sides.

This should hopefully make the trick
--- 
From 0082d29c771d831e5d1b9bb4c0a61d39bac017f0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 25 Nov 2021 10:20:16 +0100
Subject: [PATCH] mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware

sl?b and vmalloc allocators reduce the given gfp mask for their internal
needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim
behavior and constrains.

__GFP_NOLOCKDEP is not a part of that mask because it doesn't really
control the reclaim behavior strictly speaking. On the other hand
it tells the underlying page allocator to disable reclaim recursion
detection so arguably it should be part of the mask.

Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
form so this change is safe pretty much by definition. It also adds
a support for this flag to SL?B and vmalloc allocators which will in
turn allow its use to kvmalloc as well. A lack of the support has been
noticed recently in http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..2ceea20b5b2a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -21,7 +21,7 @@
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
-			__GFP_ATOMIC)
+			__GFP_ATOMIC|__GFP_NOLOCKDEP)
 
 /* The GFP flags allowed during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
-- 
2.30.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25  8:46         ` Michal Hocko
@ 2021-11-25 18:02           ` Uladzislau Rezki
  2021-11-25 19:24             ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-25 18:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu, Nov 25, 2021 at 9:46 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote:
> > On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > >
> > > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > > rather nasty-looking thing - do we want to encourage other callsites to
> > > start using it?
> > >
> > > > > The large part of the vmalloc implementation already complies with the
> > > > > given gfp flags so there is no work for those to be done. The area
> > > > > and page table allocations are an exception to that. Implement a retry
> > > > > loop for those.
> > > > >
> > > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > > a change to the vmalloc space change if the failure was caused by
> > > > > the space fragmentation or depletion. But there are multiple different
> > > > > reasons to retry and this could become much more complex. Keep the retry
> > > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > >
> > >
> > > Yes, the horse has already bolted.  But we didn't want that horse anyway ;)
> > >
> > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > > sites were doing open-coded try-forever loops.  I thought "hey, they
> > > shouldn't be doing that in the first place, but let's at least
> > > centralize the concept to reduce code size, code duplication and so
> > > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > > sites should be reworked to no longer need to do the retry-forever
> > > thing.  In retrospect, this bright idea of mine seems to have added
> > > license for more sites to use retry-forever.  Sigh.
> > >
> > > > > +               if (nofail) {
> > > > > +                       schedule_timeout_uninterruptible(1);
> > > > > +                       goto again;
> > > > > +               }
> > >
> > > The idea behind congestion_wait() is to prevent us from having to
> > > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > > one millisecond, but will return earlier if reclaim events happened
> > > which make it likely that the caller can now proceed with the
> > > allocation event, successfully.
> > >
> > > However it turns out that congestion_wait() was quietly broken at the
> > > block level some time ago.  We could perhaps resurrect the concept at
> > > another level - say by releasing congestion_wait() callers if an amount
> > > of memory newly becomes allocatable.  This obviously asks for inclusion
> > > of zone/node/etc info from the congestion_wait() caller.  But that's
> > > just an optimization - if the newly-available memory isn't useful to
> > > the congestion_wait() caller, they just fail the allocation attempts
> > > and wait again.
> > >
> > > > well that is sad...
> > > > I have raised two concerns in our previous discussion about this change,
> > >
> > > Can you please reiterate those concerns here?
> > >
> > 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
> > duplication and spreading the logic across several places. This is about
> > simplification.
>
> I am all for simplifications. But the presented simplification lead to 2) and ...
>
> > 2. Second one is about to do an unwinding and release everything what we
> > have just accumulated in terms of memory consumption. The failure might
> > occur, if so a condition we are in is a low memory one or high memory
> > pressure. In this case, since we are about to sleep some milliseconds
> > in order to repeat later, IMHO it makes sense to release memory:
> >
> > - to prevent killing apps or possible OOM;
> > - we can end up looping quite a lot of time or even forever if users do
> >   nasty things with vmalloc API and __GFP_NOFAIL flag.
>
> ... this is where we disagree and I have tried to explain why. The primary
> memory to allocate are pages to back the vmalloc area. Failing to
> allocate few page tables - which btw. do not fail as they are order-0 -
> and result into the whole and much more expensive work to allocate the
> former is really wasteful. You've had a concern about OOM killer
> invocation while retrying the page table allocation but you should
> realize that page table allocations might already invoke OOM killer so that
> is absolutely nothing new.
>
We are in a slow path and this is a corner case, it means we will
timeout for many
milliseconds, for example for CONFIG_HZ_100 it is 10 milliseconds. I would agree
with you if it was requesting some memory and repeating in a tight loop because
of any time constraint and workloads sensitive to latency.

Is it sensitive to any workload? If so, we definitely can not go with
any delay there.

As for OOM, right you are. But it also can be that we are the source
who triggers it,
directly or indirectly. Unwinding and cleaning is a maximum what we
actually can do
here staying fair to OOM.

Therefore i root for simplification and OOM related concerns :) But
maybe there will
be other opinions.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25  8:48           ` Michal Hocko
@ 2021-11-25 18:40             ` Uladzislau Rezki
  2021-11-25 19:21               ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-25 18:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > >
> > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > > cannot fail and they do not fit into a single page.
> > > >
> > > > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > > > rather nasty-looking thing - do we want to encourage other callsites to
> > > > start using it?
> > >
> > > This is what xfs is likely going to do if we do not provide the
> > > functionality. I just do not see why that would be a better outcome
> > > though. My longterm experience tells me that whenever we ignore
> > > requirements by other subsystems then those requirements materialize in
> > > some form in the end. In many cases done either suboptimaly or outright
> > > wrong. This might be not the case for xfs as the quality of
> > > implementation is high there but this is not the case in general.
> > >
> > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > > any other stupid reason then what? Is that something we should worry
> > > about? Retrying within the allocator doesn't make the things worse. In
> > > fact it is just easier to find such abusers by grep which would be more
> > > elaborate with custom retry loops.
> > >
> > > [...]
> > > > > > +             if (nofail) {
> > > > > > +                     schedule_timeout_uninterruptible(1);
> > > > > > +                     goto again;
> > > > > > +             }
> > > >
> > > > The idea behind congestion_wait() is to prevent us from having to
> > > > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > > > one millisecond, but will return earlier if reclaim events happened
> > > > which make it likely that the caller can now proceed with the
> > > > allocation event, successfully.
> > > >
> > > > However it turns out that congestion_wait() was quietly broken at the
> > > > block level some time ago.  We could perhaps resurrect the concept at
> > > > another level - say by releasing congestion_wait() callers if an amount
> > > > of memory newly becomes allocatable.  This obviously asks for inclusion
> > > > of zone/node/etc info from the congestion_wait() caller.  But that's
> > > > just an optimization - if the newly-available memory isn't useful to
> > > > the congestion_wait() caller, they just fail the allocation attempts
> > > > and wait again.
> > >
> > > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > > space. So there are two different events to wait for. I do agree that
> > > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > > really need a much more sophisticated solution at this stage?
> > >
> > I would say there is at least one more. It is about when users set their
> > own range(start:end) where to allocate. In that scenario we might never
> > return to a user, because there might not be any free vmap space on
> > specified range.
> >
> > To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> > address space, i.e. within VMALLOC_START:VMALLOC_END.
>
> How should we do that?
>
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d2a00ad4e1dd..664935bee2a2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size,
unsigned long align,
                return NULL;
        }

+       if (gfp_mask & __GFP_NOFAIL) {
+               if (start != VMALLOC_START || end != VMALLOC_END) {
+                       gfp_mask &= ~__GFP_NOFAIL;
+                       WARN_ONCE(1, "__GFP_NOFAIL is allowed only for
entire vmalloc space.");
+               }
+       }
+
        if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
                unsigned long size_per_node;
<snip>

Or just allow __GFP_NOFAIL flag usage only for a high level API, it is
__vmalloc() one where
gfp can be passed. Because it uses whole vmalloc address space, thus
we do not need to
check the range and other parameters like align, etc. This variant is
preferable.

But the problem is that there are internal functions which are
publicly available for kernel users like
__vmalloc_node_range(). In that case we can add a big comment like:
__GFP_NOFAIL flag can be
used __only__ with high level API, i.e. __vmalloc() one.

Any thoughts?

-- 
Uladzislau Rezki

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25 18:40             ` Uladzislau Rezki
@ 2021-11-25 19:21               ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-25 19:21 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu 25-11-21 19:40:56, Uladzislau Rezki wrote:
> On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> > > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > > > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > >
> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > > > cannot fail and they do not fit into a single page.
> > > > >
> > > > > Perhaps we should tell xfs "no, do it internally".  Because this is a
> > > > > rather nasty-looking thing - do we want to encourage other callsites to
> > > > > start using it?
> > > >
> > > > This is what xfs is likely going to do if we do not provide the
> > > > functionality. I just do not see why that would be a better outcome
> > > > though. My longterm experience tells me that whenever we ignore
> > > > requirements by other subsystems then those requirements materialize in
> > > > some form in the end. In many cases done either suboptimaly or outright
> > > > wrong. This might be not the case for xfs as the quality of
> > > > implementation is high there but this is not the case in general.
> > > >
> > > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > > > any other stupid reason then what? Is that something we should worry
> > > > about? Retrying within the allocator doesn't make the things worse. In
> > > > fact it is just easier to find such abusers by grep which would be more
> > > > elaborate with custom retry loops.
> > > >
> > > > [...]
> > > > > > > +             if (nofail) {
> > > > > > > +                     schedule_timeout_uninterruptible(1);
> > > > > > > +                     goto again;
> > > > > > > +             }
> > > > >
> > > > > The idea behind congestion_wait() is to prevent us from having to
> > > > > hard-wire delays like this.  congestion_wait(1) would sleep for up to
> > > > > one millisecond, but will return earlier if reclaim events happened
> > > > > which make it likely that the caller can now proceed with the
> > > > > allocation event, successfully.
> > > > >
> > > > > However it turns out that congestion_wait() was quietly broken at the
> > > > > block level some time ago.  We could perhaps resurrect the concept at
> > > > > another level - say by releasing congestion_wait() callers if an amount
> > > > > of memory newly becomes allocatable.  This obviously asks for inclusion
> > > > > of zone/node/etc info from the congestion_wait() caller.  But that's
> > > > > just an optimization - if the newly-available memory isn't useful to
> > > > > the congestion_wait() caller, they just fail the allocation attempts
> > > > > and wait again.
> > > >
> > > > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > > > space. So there are two different events to wait for. I do agree that
> > > > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > > > really need a much more sophisticated solution at this stage?
> > > >
> > > I would say there is at least one more. It is about when users set their
> > > own range(start:end) where to allocate. In that scenario we might never
> > > return to a user, because there might not be any free vmap space on
> > > specified range.
> > >
> > > To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> > > address space, i.e. within VMALLOC_START:VMALLOC_END.
> >
> > How should we do that?
> >
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..664935bee2a2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size,
> unsigned long align,
>                 return NULL;
>         }
> 
> +       if (gfp_mask & __GFP_NOFAIL) {
> +               if (start != VMALLOC_START || end != VMALLOC_END) {
> +                       gfp_mask &= ~__GFP_NOFAIL;
> +                       WARN_ONCE(1, "__GFP_NOFAIL is allowed only for
> entire vmalloc space.");
> +               }
> +       }

So the called function effectivelly ignores the flag which could lead to
an actual failure and that is something the caller has told us not to
do. I do not consider such an API great, to say the least.

> +
>         if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
>                 unsigned long size_per_node;
> <snip>
> 
> Or just allow __GFP_NOFAIL flag usage only for a high level API, it is
> __vmalloc() one where
> gfp can be passed. Because it uses whole vmalloc address space, thus
> we do not need to
> check the range and other parameters like align, etc. This variant is
> preferable.
> 
> But the problem is that there are internal functions which are
> publicly available for kernel users like
> __vmalloc_node_range(). In that case we can add a big comment like:
> __GFP_NOFAIL flag can be
> used __only__ with high level API, i.e. __vmalloc() one.
> 
> Any thoughts?

I dunno. I find it rather ugly. We can surely document some APIs that
they shouldn't be used with __GFP_NOFAIL because they could result in an
endless loop but I find it rather subtle to change the contract under
the caller's feet and cause other problems.

I am rather curious about other opinions but at this moment this is
trying to handle a non existing problem IMHO. vmalloc and for that
matter other allocators are not trying to be defensive in API because we
assume in-kernel users to be good citizens.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25 18:02           ` Uladzislau Rezki
@ 2021-11-25 19:24             ` Michal Hocko
  2021-11-25 20:03               ` Uladzislau Rezki
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-25 19:24 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
[...]
> Therefore i root for simplification and OOM related concerns :) But
> maybe there will be other opinions.

I have to say that I disagree with your view. I am not sure we have
other precedence where an allocator would throw away the primary
allocation just because a metadata allocation failure.

In any case, if there is a wider consensus that we really want to do
that I can rework. I would prefer to keep the patch as is though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25 19:24             ` Michal Hocko
@ 2021-11-25 20:03               ` Uladzislau Rezki
  2021-11-25 20:13                 ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-25 20:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Andrew Morton, Dave Chinner, Neil Brown,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton

On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> [...]
> > Therefore i root for simplification and OOM related concerns :) But
> > maybe there will be other opinions.
> 
> I have to say that I disagree with your view. I am not sure we have
> other precedence where an allocator would throw away the primary
> allocation just because a metadata allocation failure.
> 
Well, i tried to do some code review and raised some concerns and
proposals. If you do not agree, then of course i will not be arguing
here.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25 20:03               ` Uladzislau Rezki
@ 2021-11-25 20:13                 ` Michal Hocko
  2021-11-25 20:21                   ` Uladzislau Rezki
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-25 20:13 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Dave Chinner, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote:
> On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> > [...]
> > > Therefore i root for simplification and OOM related concerns :) But
> > > maybe there will be other opinions.
> > 
> > I have to say that I disagree with your view. I am not sure we have
> > other precedence where an allocator would throw away the primary
> > allocation just because a metadata allocation failure.
> > 
> Well, i tried to do some code review and raised some concerns and
> proposals.

I do appreciate your review! No question about that.

I was just surprised by your reaction that your review feedback had been
ignored because I do not think this is the case.

We were in a disagreement and that is just fine. It is quite normal to
disagree. The question is whether that disagreement is fundamental and
poses a roadblock for merging. I definitely do not want and mean to push
anything by force. My previous understanding was that your concerns are
mostly about aesthetics rather than blocking further progress.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-25 20:13                 ` Michal Hocko
@ 2021-11-25 20:21                   ` Uladzislau Rezki
  0 siblings, 0 replies; 44+ messages in thread
From: Uladzislau Rezki @ 2021-11-25 20:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Andrew Morton, Dave Chinner, Neil Brown,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton

> On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote:
> > On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> > > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> > > [...]
> > > > Therefore i root for simplification and OOM related concerns :) But
> > > > maybe there will be other opinions.
> > > 
> > > I have to say that I disagree with your view. I am not sure we have
> > > other precedence where an allocator would throw away the primary
> > > allocation just because a metadata allocation failure.
> > > 
> > Well, i tried to do some code review and raised some concerns and
> > proposals.
> 
> I do appreciate your review! No question about that.
> 
> I was just surprised by your reaction that your review feedback had been
> ignored because I do not think this is the case.
> 
> We were in a disagreement and that is just fine. It is quite normal to
> disagree. The question is whether that disagreement is fundamental and
> poses a roadblock for merging. I definitely do not want and mean to push
> anything by force. My previous understanding was that your concerns are
> mostly about aesthetics rather than blocking further progress.
>
It is not up to me to block you from further progress and of course it
was not my intention. You asked for a review i did it, that is it :)

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] extend vmalloc support for constrained allocations
  2021-11-25  9:30     ` Michal Hocko
@ 2021-11-25 21:30       ` Dave Chinner
  2021-11-26  9:20       ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2021-11-25 21:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Sebastian Andrzej Siewior, Vlastimil Babka

On Thu, Nov 25, 2021 at 10:30:28AM +0100, Michal Hocko wrote:
> [Cc Sebastian and Vlastimil]
> 
> On Thu 25-11-21 09:58:31, Michal Hocko wrote:
> > On Thu 25-11-21 09:55:26, Dave Chinner wrote:
> > [...]
> > > Correct __GFP_NOLOCKDEP support is also needed. See:
> > > 
> > > https://lore.kernel.org/linux-mm/20211119225435.GZ449541@dread.disaster.area/
> > 
> > I will have a closer look. This will require changes on both vmalloc and
> > sl?b sides.
> 
> This should hopefully make the trick
> --- 
> From 0082d29c771d831e5d1b9bb4c0a61d39bac017f0 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 25 Nov 2021 10:20:16 +0100
> Subject: [PATCH] mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware
> 
> sl?b and vmalloc allocators reduce the given gfp mask for their internal
> needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim
> behavior and constrains.
> 
> __GFP_NOLOCKDEP is not a part of that mask because it doesn't really
> control the reclaim behavior strictly speaking. On the other hand
> it tells the underlying page allocator to disable reclaim recursion
> detection so arguably it should be part of the mask.
> 
> Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
> form so this change is safe pretty much by definition. It also adds
> a support for this flag to SL?B and vmalloc allocators which will in
> turn allow its use to kvmalloc as well. A lack of the support has been
> noticed recently in http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3b79a5c9427a..2ceea20b5b2a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -21,7 +21,7 @@
>  #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
>  			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
>  			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
> -			__GFP_ATOMIC)
> +			__GFP_ATOMIC|__GFP_NOLOCKDEP)
>  
>  /* The GFP flags allowed during early boot */
>  #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))

Looks reasonable to me.

Acked-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] extend vmalloc support for constrained allocations
  2021-11-25  9:30     ` Michal Hocko
  2021-11-25 21:30       ` Dave Chinner
@ 2021-11-26  9:20       ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26  9:20 UTC (permalink / raw)
  To: Michal Hocko, Dave Chinner
  Cc: Andrew Morton, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Sebastian Andrzej Siewior

On 11/25/21 10:30, Michal Hocko wrote:
> [Cc Sebastian and Vlastimil]
> 
> On Thu 25-11-21 09:58:31, Michal Hocko wrote:
>> On Thu 25-11-21 09:55:26, Dave Chinner wrote:
>> [...]
>> > Correct __GFP_NOLOCKDEP support is also needed. See:
>> > 
>> > https://lore.kernel.org/linux-mm/20211119225435.GZ449541@dread.disaster.area/
>> 
>> I will have a closer look. This will require changes on both vmalloc and
>> sl?b sides.
> 
> This should hopefully make the trick
> --- 
> From 0082d29c771d831e5d1b9bb4c0a61d39bac017f0 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 25 Nov 2021 10:20:16 +0100
> Subject: [PATCH] mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware
> 
> sl?b and vmalloc allocators reduce the given gfp mask for their internal
> needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim
> behavior and constrains.
> 
> __GFP_NOLOCKDEP is not a part of that mask because it doesn't really
> control the reclaim behavior strictly speaking. On the other hand
> it tells the underlying page allocator to disable reclaim recursion
> detection so arguably it should be part of the mask.
> 
> Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
> form so this change is safe pretty much by definition. It also adds
> a support for this flag to SL?B and vmalloc allocators which will in
> turn allow its use to kvmalloc as well. A lack of the support has been
> noticed recently in http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3b79a5c9427a..2ceea20b5b2a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -21,7 +21,7 @@
>  #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
>  			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
>  			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
> -			__GFP_ATOMIC)
> +			__GFP_ATOMIC|__GFP_NOLOCKDEP)
>  
>  /* The GFP flags allowed during early boot */
>  #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
  2021-11-23 19:01   ` Uladzislau Rezki
@ 2021-11-26 10:48   ` Michal Hocko
  2021-11-28  0:00     ` Andrew Morton
  2021-11-26 15:32   ` Vlastimil Babka
  2 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2021-11-26 10:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Are there still any concerns around this patch or the approach in
general?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-24  5:23           ` NeilBrown
  2021-11-25  0:32             ` Theodore Y. Ts'o
@ 2021-11-26 14:50             ` Vlastimil Babka
  2021-11-26 15:09               ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26 14:50 UTC (permalink / raw)
  To: NeilBrown, Andrew Morton
  Cc: Uladzislau Rezki, Michal Hocko, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On 11/24/21 06:23, NeilBrown wrote:
>> 
>> I forget why radix_tree_preload used a cpu-local store rather than a
>> per-task one.
>> 
>> Plus "what order pages would you like" and "on which node" and "in
>> which zone", etc...
> 
> "what order" - only order-0 I hope.  I'd hazard a guess that 90% of
> current NOFAIL allocations only need one page (providing slub is used -
> slab seems to insist on high-order pages sometimes).

Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback
to smallest order that's enough (thus 0 unless the objects are larger than a
page).

> "which node" - whichever.  Unless __GFP_HARDWALL is set, alloc_page()
> will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
> probably a poor choice.
> "which zone" - NORMAL.  I cannot find any NOFAIL allocations that want
> DMA.  fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
> *requre* highmem.
> 
> Of course, before designing this interface too precisely we should check
> if anyone can use it.  From a quick through the some of the 100-ish
> users of __GFP_NOFAIL I'd guess that mempools would help - the
> preallocation should happen at init-time, not request-time.  Maybe if we
> made mempools even more light weight .... though that risks allocating a
> lot of memory that will never get used.
> 
> This brings me back to the idea that
>     alloc_page(wait and reclaim allowed)
> should only fail on OOM_KILL.  That way kernel threads are safe, and
> user-threads are free to return ENOMEM knowing it won't get to

Hm I thought that's already pretty much the case of the "too small to fail"
of today. IIRC there's exactly that gotcha that OOM KILL can result in such
allocation failure. But I believe that approach is rather fragile. If you
encounter such an allocation not checking the resulting page != NULL, you
can only guess which one is true:

- the author simply forgot to check at all
- the author relied on "too small to fail" without realizing the gotcha
- at the time of writing the code was verified that it can be only run in
kernel thread context, not user and
  - it is still true
  - it stopped being true at some later point
  - might be hard to even decide which is the case

IIRC at some point we tried to abolish the "too small to fail" rule because
of this, but Linus denied that. But the opposite - make it hard guarantee in
all cases - also didn't happen, so...

> user-space.  If user-thread code really needs NOFAIL, it punts to a
> workqueue and waits - aborting the wait if it is killed, while the work
> item still runs eventually.
> 
> NeilBrown
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-26 14:50             ` Vlastimil Babka
@ 2021-11-26 15:09               ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-26 15:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: NeilBrown, Andrew Morton, Uladzislau Rezki, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, linux-mm, LKML, Ilya Dryomov,
	Jeff Layton

On Fri 26-11-21 15:50:15, Vlastimil Babka wrote:
> On 11/24/21 06:23, NeilBrown wrote:
> >> 
> >> I forget why radix_tree_preload used a cpu-local store rather than a
> >> per-task one.
> >> 
> >> Plus "what order pages would you like" and "on which node" and "in
> >> which zone", etc...
> > 
> > "what order" - only order-0 I hope.  I'd hazard a guess that 90% of
> > current NOFAIL allocations only need one page (providing slub is used -
> > slab seems to insist on high-order pages sometimes).
> 
> Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback
> to smallest order that's enough (thus 0 unless the objects are larger than a
> page).
> 
> > "which node" - whichever.  Unless __GFP_HARDWALL is set, alloc_page()
> > will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
> > probably a poor choice.
> > "which zone" - NORMAL.  I cannot find any NOFAIL allocations that want
> > DMA.  fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
> > *requre* highmem.
> > 
> > Of course, before designing this interface too precisely we should check
> > if anyone can use it.  From a quick through the some of the 100-ish
> > users of __GFP_NOFAIL I'd guess that mempools would help - the
> > preallocation should happen at init-time, not request-time.  Maybe if we
> > made mempools even more light weight .... though that risks allocating a
> > lot of memory that will never get used.
> > 
> > This brings me back to the idea that
> >     alloc_page(wait and reclaim allowed)
> > should only fail on OOM_KILL.  That way kernel threads are safe, and
> > user-threads are free to return ENOMEM knowing it won't get to
> 
> Hm I thought that's already pretty much the case of the "too small to fail"
> of today. IIRC there's exactly that gotcha that OOM KILL can result in such
> allocation failure. But I believe that approach is rather fragile. If you
> encounter such an allocation not checking the resulting page != NULL, you
> can only guess which one is true:
> 
> - the author simply forgot to check at all
> - the author relied on "too small to fail" without realizing the gotcha
> - at the time of writing the code was verified that it can be only run in
> kernel thread context, not user and
>   - it is still true
>   - it stopped being true at some later point
>   - might be hard to even decide which is the case
> 
> IIRC at some point we tried to abolish the "too small to fail" rule because
> of this, but Linus denied that. But the opposite - make it hard guarantee in
> all cases - also didn't happen, so...

Yeah. IMHO we should treat each missing check for allocation failure
(except for GFP_NOFAIL) as a bug regardless the practical implementation
that say that small allocations do not fail. Because they can fail and
we should never subscribe to official support implicit non-fail
semantic.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
  2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
  2021-11-23 19:05   ` Uladzislau Rezki
@ 2021-11-26 15:13   ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26 15:13 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On 11/22/21 16:32, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
> page table allocations do not support externally provided gfp mask
> and performed GFP_KERNEL like allocations.
> 
> Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
> to enforce NOFS and NOIO constrains implicitly to all allocators within
> the scope. There was a hope that those scopes would be defined on a
> higher level when the reclaim recursion boundary starts/stops (e.g. when
> a lock required during the memory reclaim is required etc.). It seems
> that not all NOFS/NOIO users have adopted this approach and instead
> they have taken a workaround approach to wrap a single [k]vmalloc
> allocation by a scope API.
> 
> These workarounds do not serve the purpose of a better reclaim recursion
> documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
> just provide them with the semantic they are asking for without a need
> for workarounds.
> 
> Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
> allocations already comply with the given gfp_mask. The only current
> exception is vmap_pages_range which maps kernel page tables. Infer the
> proper scope API based on the given gfp mask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmalloc.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..17ca7001de1f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2926,6 +2926,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned long array_size;
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
> +	unsigned int flags;
> +	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
>  	gfp_mask |= __GFP_NOWARN;
> @@ -2967,8 +2969,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  		goto fail;
>  	}
>  
> -	if (vmap_pages_range(addr, addr + size, prot, area->pages,
> -			page_shift) < 0) {
> +	/*
> +	 * page tables allocations ignore external gfp mask, enforce it
> +	 * by the scope API
> +	 */
> +	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +		flags = memalloc_nofs_save();
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> +		flags = memalloc_noio_save();
> +
> +	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +			page_shift);
> +
> +	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +		memalloc_nofs_restore(flags);
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> +		memalloc_noio_restore(flags);
> +
> +	if (ret < 0) {
>  		warn_alloc(orig_gfp_mask, NULL,
>  			"vmalloc error: size %lu, failed to map pages",
>  			area->nr_pages * PAGE_SIZE);
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
  2021-11-23 19:01   ` Uladzislau Rezki
  2021-11-26 10:48   ` Michal Hocko
@ 2021-11-26 15:32   ` Vlastimil Babka
  2 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26 15:32 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On 11/22/21 16:32, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  {
>  	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>  	const gfp_t orig_gfp_mask = gfp_mask;
> +	bool nofail = gfp_mask & __GFP_NOFAIL;
>  	unsigned long addr = (unsigned long)area->addr;
>  	unsigned long size = get_vm_area_size(area);
>  	unsigned long array_size;
> @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
>  
> -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +	do {
> +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>  			page_shift);
> +		if (nofail && (ret < 0))
> +			schedule_timeout_uninterruptible(1);
> +	} while (nofail && (ret < 0));

Kind of ugly to have the same condition twice here, but no hard feelings.

>  
>  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
> @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  				  VM_UNINITIALIZED | vm_flags, start, end, node,
>  				  gfp_mask, caller);
>  	if (!area) {
> +		bool nofail = gfp_mask & __GFP_NOFAIL;
>  		warn_alloc(gfp_mask, NULL,
> -			"vmalloc error: size %lu, vm_struct allocation failed",
> -			real_size);
> +			"vmalloc error: size %lu, vm_struct allocation failed%s",
> +			real_size, (nofail) ? ". Retrying." : "");
> +		if (nofail) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}
>  		goto fail;
>  	}
>  
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-11-22 15:32 ` [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
  2021-11-23 18:58   ` Uladzislau Rezki
@ 2021-11-26 15:39   ` Vlastimil Babka
  1 sibling, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26 15:39 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On 11/22/21 16:32, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp flags")
> has been merged prematurely without the rest of the series and without
> addressed review feedback from Neil. Fix that up now. Only wording is
> changed slightly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmalloc.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b6aed4f94a85..b1c115ec13be 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3021,12 +3021,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>   *
>   * Allocate enough pages to cover @size from the page level
>   * allocator with @gfp_mask flags. Please note that the full set of gfp
> - * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> - * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> - * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> - * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> - * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> - * __GFP_NOWARN can be used to suppress error messages about failures.
> + * flags are not supported. GFP_KERNEL, GFP_NOFS and GFP_NOIO are all
> + * supported.
> + * Zone modifiers are not supported. From the reclaim modifiers
> + * __GFP_DIRECT_RECLAIM is required (aka GFP_NOWAIT is not supported)
> + * and only __GFP_NOFAIL is supported (i.e. __GFP_NORETRY and
> + * __GFP_RETRY_MAYFAIL are not supported).
> + *
> + * __GFP_NOWARN can be used to suppress failures messages.
>   *
>   * Map them into contiguous kernel virtual space, using a pagetable
>   * protection of @prot.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
  2021-11-23 18:57   ` Uladzislau Rezki
  2021-11-23 19:02   ` Uladzislau Rezki
@ 2021-11-26 15:50   ` Vlastimil Babka
  2 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2021-11-26 15:50 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On 11/22/21 16:32, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> by previous patches so we can allow the support for kvmalloc. This
> will allow some external users to simplify or completely remove
> their helpers.
> 
> GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> explicitly documented so let's add a note about that.
> 
> ceph_kvmalloc is the first helper to be dropped and changed to
> kvmalloc.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-26 10:48   ` Michal Hocko
@ 2021-11-28  0:00     ` Andrew Morton
  2021-11-29  8:56       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2021-11-28  0:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> > 
> > The large part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> > 
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Are there still any concerns around this patch or the approach in
> general?

Well.  Like GFP_NOFAIL, every use is a sin.  But I don't think I've
ever seen a real-world report of anyone hitting GFP_NOFAIL's
theoretical issues.

I assume there will be a v3?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-11-28  0:00     ` Andrew Morton
@ 2021-11-29  8:56       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2021-11-29  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Neil Brown, Christoph Hellwig, Uladzislau Rezki,
	linux-fsdevel, linux-mm, LKML, Ilya Dryomov, Jeff Layton

On Sat 27-11-21 16:00:43, Andrew Morton wrote:
> On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > > 
> > > The large part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > > 
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > Are there still any concerns around this patch or the approach in
> > general?
> 
> Well.  Like GFP_NOFAIL, every use is a sin.  But I don't think I've
> ever seen a real-world report of anyone hitting GFP_NOFAIL's
> theoretical issues.

I am not sure what you mean here. If you are missing real GFP_NOFAIL use
cases then some have been mentioned in the discussion

> I assume there will be a v3?

Currently I do not have any follow up changes on top of neither of the
patch except for acks and review tags. I can repost with those if you
prefer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-11-29  8:58 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-22 15:32 [PATCH v2 0/4] extend vmalloc support for constrained allocations Michal Hocko
2021-11-22 15:32 ` [PATCH v2 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
2021-11-23 19:05   ` Uladzislau Rezki
2021-11-26 15:13   ` Vlastimil Babka
2021-11-22 15:32 ` [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
2021-11-23 19:01   ` Uladzislau Rezki
2021-11-23 20:09     ` Michal Hocko
2021-11-24 20:46       ` Uladzislau Rezki
2021-11-24  1:02     ` Andrew Morton
2021-11-24  3:16       ` NeilBrown
2021-11-24  3:48         ` Andrew Morton
2021-11-24  5:23           ` NeilBrown
2021-11-25  0:32             ` Theodore Y. Ts'o
2021-11-26 14:50             ` Vlastimil Babka
2021-11-26 15:09               ` Michal Hocko
2021-11-24 23:45         ` Dave Chinner
2021-11-24  8:43       ` Michal Hocko
2021-11-24 20:37         ` Uladzislau Rezki
2021-11-25  8:48           ` Michal Hocko
2021-11-25 18:40             ` Uladzislau Rezki
2021-11-25 19:21               ` Michal Hocko
2021-11-24 20:11       ` Uladzislau Rezki
2021-11-25  8:46         ` Michal Hocko
2021-11-25 18:02           ` Uladzislau Rezki
2021-11-25 19:24             ` Michal Hocko
2021-11-25 20:03               ` Uladzislau Rezki
2021-11-25 20:13                 ` Michal Hocko
2021-11-25 20:21                   ` Uladzislau Rezki
2021-11-26 10:48   ` Michal Hocko
2021-11-28  0:00     ` Andrew Morton
2021-11-29  8:56       ` Michal Hocko
2021-11-26 15:32   ` Vlastimil Babka
2021-11-22 15:32 ` [PATCH v2 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
2021-11-23 18:58   ` Uladzislau Rezki
2021-11-26 15:39   ` Vlastimil Babka
2021-11-22 15:32 ` [PATCH v2 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
2021-11-23 18:57   ` Uladzislau Rezki
2021-11-23 19:02   ` Uladzislau Rezki
2021-11-26 15:50   ` Vlastimil Babka
2021-11-24 22:55 ` [PATCH v2 0/4] extend vmalloc support for constrained allocations Dave Chinner
2021-11-25  8:58   ` Michal Hocko
2021-11-25  9:30     ` Michal Hocko
2021-11-25 21:30       ` Dave Chinner
2021-11-26  9:20       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).