linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] extend vmalloc support for constrained allocations
@ 2021-10-25 15:02 Michal Hocko
  2021-10-25 15:02 ` [PATCH 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-25 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Neil Brown, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

Hi,
this has been posted as a RFC previously [1] and it seems there was no
fundamental disagreement about the approach so I am dropping RFC and I
have also integrated some feedback from that discussion.

Based on a recent discussion with Dave and Neil [2] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make
life of kvmalloc users easier.


A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.

NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API. Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope. Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).

First patch implements NOFS/NOIO support for vmalloc. The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.

Please note that this is RFC and I haven't done any testing on this yet.
I hope I haven't missed anything in the vmalloc allocator. It would be
really great if Christoph and Uladzislau could have a look.

Thanks!

[1] http://lkml.kernel.org/r/20211018114712.9802-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown




^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
  2021-10-25 15:02 [PATCH 0/4] extend vmalloc support for constrained allocations Michal Hocko
@ 2021-10-25 15:02 ` Michal Hocko
  2021-10-25 15:02 ` [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-25 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Neil Brown, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask
and performed GFP_KERNEL like allocations.

Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope. There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g. when
a lock required during the memory reclaim is required etc.). It seems
that not all NOFS/NOIO users have adopted this approach and instead
they have taken a workaround approach to wrap a single [k]vmalloc
allocation by a scope API.

These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.

Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
allocations already comply with the given gfp_mask. The only current
exception is vmap_pages_range which maps kernel page tables. Infer the
proper scope API based on the given gfp mask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..c6cc77d2f366 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2889,6 +2889,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	unsigned long array_size;
 	unsigned int nr_small_pages = size >> PAGE_SHIFT;
 	unsigned int page_order;
+	unsigned int flags;
+	int ret;
 
 	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
 	gfp_mask |= __GFP_NOWARN;
@@ -2930,8 +2932,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		goto fail;
 	}
 
-	if (vmap_pages_range(addr, addr + size, prot, area->pages,
-			page_shift) < 0) {
+	/*
+	 * page tables allocations ignore external gfp mask, enforce it
+	 * by the scope API
+	 */
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		flags = memalloc_nofs_save();
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		flags = memalloc_noio_save();
+
+	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+			page_shift);
+
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		memalloc_nofs_restore(flags);
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		memalloc_noio_restore(flags);
+
+	if (ret < 0) {
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to map pages",
 			area->nr_pages * PAGE_SIZE);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-25 15:02 [PATCH 0/4] extend vmalloc support for constrained allocations Michal Hocko
  2021-10-25 15:02 ` [PATCH 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
@ 2021-10-25 15:02 ` Michal Hocko
  2021-10-25 22:59   ` NeilBrown
  2021-10-26 15:48   ` Uladzislau Rezki
  2021-10-25 15:02 ` [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
  2021-10-25 15:02 ` [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
  3 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-25 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Neil Brown, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The larg part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g.
a change to the vmalloc space change if the failure was caused by
the space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the retry
simple for now and just sleep to prevent from hogging CPUs.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c6cc77d2f366..602649919a9d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		flags = memalloc_noio_save();
 
-	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+	do {
+		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift);
+		if (ret < 0)
+			schedule_timeout_uninterruptible(1);
+	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
 
 	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		memalloc_nofs_restore(flags);
@@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, vm_struct allocation failed",
 			real_size);
+		if (gfp_mask & __GFP_NOFAIL) {
+			schedule_timeout_uninterruptible(1);
+			goto again;
+		}
 		goto fail;
 	}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-10-25 15:02 [PATCH 0/4] extend vmalloc support for constrained allocations Michal Hocko
  2021-10-25 15:02 ` [PATCH 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
  2021-10-25 15:02 ` [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
@ 2021-10-25 15:02 ` Michal Hocko
  2021-10-25 23:26   ` NeilBrown
  2021-10-25 15:02 ` [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
  3 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-25 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Neil Brown, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The core of the vmalloc allocator __vmalloc_area_node doesn't say
anything about gfp mask argument. Not all gfp flags are supported
though. Be more explicit about constrains.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 602649919a9d..2199d821c981 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2980,8 +2980,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
  * @caller:		  caller's return address
  *
  * Allocate enough pages to cover @size from the page level
- * allocator with @gfp_mask flags.  Map them into contiguous
- * kernel virtual space, using a pagetable protection of @prot.
+ * allocator with @gfp_mask flags. Please note that the full set of gfp
+ * flags are not supported. GFP_KERNEL would be a preferred allocation mode
+ * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
+ * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
+ * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
+ * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
+ * __GFP_NOWARN can be used to suppress error messages about failures.
+ * 
+ * Map them into contiguous kernel virtual space, using a pagetable
+ * protection of @prot.
  *
  * Return: the address of the area or %NULL on failure
  */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-10-25 15:02 [PATCH 0/4] extend vmalloc support for constrained allocations Michal Hocko
                   ` (2 preceding siblings ...)
  2021-10-25 15:02 ` [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
@ 2021-10-25 15:02 ` Michal Hocko
  2021-10-25 23:34   ` NeilBrown
  3 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-25 15:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Neil Brown, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
by previous patches so we can allow the support for kvmalloc. This
will allow some external users to simplify or completely remove
their helpers.

GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.

ceph_kvmalloc is the first helper to be dropped and changed to
kvmalloc.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/ceph/libceph.h |  1 -
 mm/util.c                    | 15 ++++-----------
 net/ceph/buffer.c            |  4 ++--
 net/ceph/ceph_common.c       | 27 ---------------------------
 net/ceph/crypto.c            |  2 +-
 net/ceph/messenger.c         |  2 +-
 net/ceph/messenger_v2.c      |  2 +-
 net/ceph/osdmap.c            | 12 ++++++------
 8 files changed, 15 insertions(+), 50 deletions(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 409d8c29bc4f..309acbcb5a8a 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
 
 extern const char *ceph_msg_type_name(int type);
 extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
-extern void *ceph_kvmalloc(size_t size, gfp_t flags);
 
 struct fs_parameter;
 struct fc_log;
diff --git a/mm/util.c b/mm/util.c
index bacabe446906..fdec6b4b1267 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.
  * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
  * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
- * Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
- * fall back to vmalloc.
- *
  * Return: pointer to the allocated memory of %NULL in case of failure
  */
 void *kvmalloc_node(size_t size, gfp_t flags, int node)
@@ -563,13 +560,6 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
-	/*
-	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
-	 * so the given set of flags has to be compatible.
-	 */
-	if ((flags & GFP_KERNEL) != GFP_KERNEL)
-		return kmalloc_node(size, flags, node);
-
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -582,6 +572,9 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 
 		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
 			kmalloc_flags |= __GFP_NORETRY;
+
+		/* nofail semantic is implemented by the vmalloc fallback */
+		kmalloc_flags &= ~__GFP_NOFAIL;
 	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
index 5622763ad402..7e51f128045d 100644
--- a/net/ceph/buffer.c
+++ b/net/ceph/buffer.c
@@ -7,7 +7,7 @@
 
 #include <linux/ceph/buffer.h>
 #include <linux/ceph/decode.h>
-#include <linux/ceph/libceph.h> /* for ceph_kvmalloc */
+#include <linux/ceph/libceph.h> /* for kvmalloc */
 
 struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
 {
@@ -17,7 +17,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
 	if (!b)
 		return NULL;
 
-	b->vec.iov_base = ceph_kvmalloc(len, gfp);
+	b->vec.iov_base = kvmalloc(len, gfp);
 	if (!b->vec.iov_base) {
 		kfree(b);
 		return NULL;
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 97d6ea763e32..9441b4a4912b 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -190,33 +190,6 @@ int ceph_compare_options(struct ceph_options *new_opt,
 }
 EXPORT_SYMBOL(ceph_compare_options);
 
-/*
- * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
- * compatible with (a superset of) GFP_KERNEL.  This is because while the
- * actual pages are allocated with the specified flags, the page table pages
- * are always allocated with GFP_KERNEL.
- *
- * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
- */
-void *ceph_kvmalloc(size_t size, gfp_t flags)
-{
-	void *p;
-
-	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
-		p = kvmalloc(size, flags);
-	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
-		unsigned int nofs_flag = memalloc_nofs_save();
-		p = kvmalloc(size, GFP_KERNEL);
-		memalloc_nofs_restore(nofs_flag);
-	} else {
-		unsigned int noio_flag = memalloc_noio_save();
-		p = kvmalloc(size, GFP_KERNEL);
-		memalloc_noio_restore(noio_flag);
-	}
-
-	return p;
-}
-
 static int parse_fsid(const char *str, struct ceph_fsid *fsid)
 {
 	int i = 0;
diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
index 92d89b331645..051d22c0e4ad 100644
--- a/net/ceph/crypto.c
+++ b/net/ceph/crypto.c
@@ -147,7 +147,7 @@ void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
 static const u8 *aes_iv = (u8 *)CEPH_AES_IV;
 
 /*
- * Should be used for buffers allocated with ceph_kvmalloc().
+ * Should be used for buffers allocated with kvmalloc().
  * Currently these are encrypt out-buffer (ceph_buffer) and decrypt
  * in-buffer (msg front).
  *
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 57d043b382ed..7b891be799d2 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1920,7 +1920,7 @@ struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
 
 	/* front */
 	if (front_len) {
-		m->front.iov_base = ceph_kvmalloc(front_len, flags);
+		m->front.iov_base = kvmalloc(front_len, flags);
 		if (m->front.iov_base == NULL) {
 			dout("ceph_msg_new can't allocate %d bytes\n",
 			     front_len);
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index cc40ce4e02fb..c4099b641b38 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -308,7 +308,7 @@ static void *alloc_conn_buf(struct ceph_connection *con, int len)
 	if (WARN_ON(con->v2.conn_buf_cnt >= ARRAY_SIZE(con->v2.conn_bufs)))
 		return NULL;
 
-	buf = ceph_kvmalloc(len, GFP_NOIO);
+	buf = kvmalloc(len, GFP_NOIO);
 	if (!buf)
 		return NULL;
 
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 75b738083523..2823bb3cff55 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -980,7 +980,7 @@ static struct crush_work *alloc_workspace(const struct crush_map *c)
 	work_size = crush_work_size(c, CEPH_PG_MAX_SIZE);
 	dout("%s work_size %zu bytes\n", __func__, work_size);
 
-	work = ceph_kvmalloc(work_size, GFP_NOIO);
+	work = kvmalloc(work_size, GFP_NOIO);
 	if (!work)
 		return NULL;
 
@@ -1190,9 +1190,9 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
 	if (max == map->max_osd)
 		return 0;
 
-	state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
-	weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
-	addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
+	state = kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
+	weight = kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
+	addr = kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
 	if (!state || !weight || !addr) {
 		kvfree(state);
 		kvfree(weight);
@@ -1222,7 +1222,7 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
 	if (map->osd_primary_affinity) {
 		u32 *affinity;
 
-		affinity = ceph_kvmalloc(array_size(max, sizeof(*affinity)),
+		affinity = kvmalloc(array_size(max, sizeof(*affinity)),
 					 GFP_NOFS);
 		if (!affinity)
 			return -ENOMEM;
@@ -1503,7 +1503,7 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
 	if (!map->osd_primary_affinity) {
 		int i;
 
-		map->osd_primary_affinity = ceph_kvmalloc(
+		map->osd_primary_affinity = kvmalloc(
 		    array_size(map->max_osd, sizeof(*map->osd_primary_affinity)),
 		    GFP_NOFS);
 		if (!map->osd_primary_affinity)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-25 15:02 ` [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
@ 2021-10-25 22:59   ` NeilBrown
  2021-10-26  7:03     ` Michal Hocko
  2021-10-26 15:48   ` Uladzislau Rezki
  1 sibling, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-25 22:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Tue, 26 Oct 2021, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The larg part of the vmalloc implementation already complies with the

*large*

> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c6cc77d2f366..602649919a9d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
>  
> -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +	do {
> +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>  			page_shift);
> +		if (ret < 0)
> +			schedule_timeout_uninterruptible(1);
> +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>  
>  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
> @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  		warn_alloc(gfp_mask, NULL,
>  			"vmalloc error: size %lu, vm_struct allocation failed",
>  			real_size);
> +		if (gfp_mask & __GFP_NOFAIL) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}

Shouldn't the retry happen *before* the warning?

NeilBrown


>  		goto fail;
>  	}
>  
> -- 
> 2.30.2
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-10-25 15:02 ` [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
@ 2021-10-25 23:26   ` NeilBrown
  2021-10-26  7:10     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-25 23:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Tue, 26 Oct 2021, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The core of the vmalloc allocator __vmalloc_area_node doesn't say
> anything about gfp mask argument. Not all gfp flags are supported
> though. Be more explicit about constrains.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 602649919a9d..2199d821c981 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2980,8 +2980,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>   * @caller:		  caller's return address
>   *
>   * Allocate enough pages to cover @size from the page level
> - * allocator with @gfp_mask flags.  Map them into contiguous
> - * kernel virtual space, using a pagetable protection of @prot.
> + * allocator with @gfp_mask flags. Please note that the full set of gfp
> + * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> + * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not

In what sense is GFP_KERNEL "preferred"??
The choice of GFP_NOFS, when necessary, isn't based on preference but
on need.

I understand that you would prefer no one ever used GFP_NOFs ever - just
use the scope API.  I even agree.  But this is not the place to make
that case. 

> + * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> + * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka

I don't think "aka" is the right thing to use here.  It is short for
"also known as" and there is nothing that is being known as something
else.
It would be appropriate to say (i.e. GFP_NOWAIT is not supported).
"i.e." is short for the Latin "id est" which means "that is" and
normally introduces an alternate description (whereas aka introduces an
alternate name).


> + * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).

Why do you think __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported.

> + * __GFP_NOWARN can be used to suppress error messages about failures.

Surely "NOWARN" suppresses warning messages, not error messages ....

Thanks,
NeilBrown


> + * 
> + * Map them into contiguous kernel virtual space, using a pagetable
> + * protection of @prot.
>   *
>   * Return: the address of the area or %NULL on failure
>   */
> -- 
> 2.30.2
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-10-25 15:02 ` [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
@ 2021-10-25 23:34   ` NeilBrown
  2021-10-26  7:15     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-25 23:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton,
	Michal Hocko

On Tue, 26 Oct 2021, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> by previous patches so we can allow the support for kvmalloc. This
> will allow some external users to simplify or completely remove
> their helpers.
> 
> GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> explicitly documented so let's add a note about that.
> 
> ceph_kvmalloc is the first helper to be dropped and changed to
> kvmalloc.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/ceph/libceph.h |  1 -
>  mm/util.c                    | 15 ++++-----------
>  net/ceph/buffer.c            |  4 ++--
>  net/ceph/ceph_common.c       | 27 ---------------------------
>  net/ceph/crypto.c            |  2 +-
>  net/ceph/messenger.c         |  2 +-
>  net/ceph/messenger_v2.c      |  2 +-
>  net/ceph/osdmap.c            | 12 ++++++------
>  8 files changed, 15 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> index 409d8c29bc4f..309acbcb5a8a 100644
> --- a/include/linux/ceph/libceph.h
> +++ b/include/linux/ceph/libceph.h
> @@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
>  
>  extern const char *ceph_msg_type_name(int type);
>  extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
> -extern void *ceph_kvmalloc(size_t size, gfp_t flags);
>  
>  struct fs_parameter;
>  struct fc_log;
> diff --git a/mm/util.c b/mm/util.c
> index bacabe446906..fdec6b4b1267 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> + * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.

GFP_NOWAIT is not a modifier.  It is a base value that can be modified.
I think you mean that
    __GFP_NORETRY is not supported and __GFP_DIRECT_RECLAIM is required

But I really cannot see why either of these statements are true.

Before your patch, __GFP_NORETRY would have forced use of kmalloc, so
that would mean it isn't really supported.  But that doesn't happen any more.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-25 22:59   ` NeilBrown
@ 2021-10-26  7:03     ` Michal Hocko
  2021-10-26 10:30       ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-26  7:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 09:59:36, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
[...]
> > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >  		warn_alloc(gfp_mask, NULL,
> >  			"vmalloc error: size %lu, vm_struct allocation failed",
> >  			real_size);
> > +		if (gfp_mask & __GFP_NOFAIL) {
> > +			schedule_timeout_uninterruptible(1);
> > +			goto again;
> > +		}
> 
> Shouldn't the retry happen *before* the warning?

I've done it after to catch the "depleted or fragmented" vmalloc space.
This is not related to the memory available and therefore it won't be
handled by the oom killer. The error message shouldn't imply the vmalloc
allocation failure IMHO but I am open to suggestions.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-10-25 23:26   ` NeilBrown
@ 2021-10-26  7:10     ` Michal Hocko
  2021-10-26 10:43       ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-26  7:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 10:26:06, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > The core of the vmalloc allocator __vmalloc_area_node doesn't say
> > anything about gfp mask argument. Not all gfp flags are supported
> > though. Be more explicit about constrains.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmalloc.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 602649919a9d..2199d821c981 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2980,8 +2980,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >   * @caller:		  caller's return address
> >   *
> >   * Allocate enough pages to cover @size from the page level
> > - * allocator with @gfp_mask flags.  Map them into contiguous
> > - * kernel virtual space, using a pagetable protection of @prot.
> > + * allocator with @gfp_mask flags. Please note that the full set of gfp
> > + * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> > + * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> 
> In what sense is GFP_KERNEL "preferred"??
> The choice of GFP_NOFS, when necessary, isn't based on preference but
> on need.
> 
> I understand that you would prefer no one ever used GFP_NOFs ever - just
> use the scope API.  I even agree.  But this is not the place to make
> that case. 

Any suggestion for a better wording?

> > + * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> > + * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> 
> I don't think "aka" is the right thing to use here.  It is short for
> "also known as" and there is nothing that is being known as something
> else.
> It would be appropriate to say (i.e. GFP_NOWAIT is not supported).
> "i.e." is short for the Latin "id est" which means "that is" and
> normally introduces an alternate description (whereas aka introduces an
> alternate name).

OK
 
> > + * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> 
> Why do you think __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported.

Because they cannot be passed to the page table allocator. In both cases
the allocation would fail when system is short on memory. GFP_KERNEL
used for ptes implicitly doesn't behave that way.

> 
> > + * __GFP_NOWARN can be used to suppress error messages about failures.
> 
> Surely "NOWARN" suppresses warning messages, not error messages ....

I am not sure I follow. NOWARN means "do not warn" independently on the
log level chosen for the message. Is an allocation failure an error
message? Is the "vmalloc error: size %lu, failed to map pages" an error
message?

Anyway I will go with "__GFP_NOWARN can be used to suppress failure messages"

Is that better?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-10-25 23:34   ` NeilBrown
@ 2021-10-26  7:15     ` Michal Hocko
  2021-10-26 10:48       ` NeilBrown
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-26  7:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 10:34:34, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> > by previous patches so we can allow the support for kvmalloc. This
> > will allow some external users to simplify or completely remove
> > their helpers.
> > 
> > GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> > explicitly documented so let's add a note about that.
> > 
> > ceph_kvmalloc is the first helper to be dropped and changed to
> > kvmalloc.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/ceph/libceph.h |  1 -
> >  mm/util.c                    | 15 ++++-----------
> >  net/ceph/buffer.c            |  4 ++--
> >  net/ceph/ceph_common.c       | 27 ---------------------------
> >  net/ceph/crypto.c            |  2 +-
> >  net/ceph/messenger.c         |  2 +-
> >  net/ceph/messenger_v2.c      |  2 +-
> >  net/ceph/osdmap.c            | 12 ++++++------
> >  8 files changed, 15 insertions(+), 50 deletions(-)
> > 
> > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> > index 409d8c29bc4f..309acbcb5a8a 100644
> > --- a/include/linux/ceph/libceph.h
> > +++ b/include/linux/ceph/libceph.h
> > @@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
> >  
> >  extern const char *ceph_msg_type_name(int type);
> >  extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
> > -extern void *ceph_kvmalloc(size_t size, gfp_t flags);
> >  
> >  struct fs_parameter;
> >  struct fc_log;
> > diff --git a/mm/util.c b/mm/util.c
> > index bacabe446906..fdec6b4b1267 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
> >   * Uses kmalloc to get the memory but if the allocation fails then falls back
> >   * to the vmalloc allocator. Use kvfree for freeing the memory.
> >   *
> > - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> > + * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.
> 
> GFP_NOWAIT is not a modifier.  It is a base value that can be modified.
> I think you mean that
>     __GFP_NORETRY is not supported and __GFP_DIRECT_RECLAIM is required

I thought naming the higher level gfp mask would be more helpful here.
Most people do not tend to think in terms of __GFP_DIRECT_RECLAIM but
rather GFP_NOWAIT or GFP_ATOMIC.

> But I really cannot see why either of these statements are true.

The reason is same as why vmalloc do not support neither of them.

> Before your patch, __GFP_NORETRY would have forced use of kmalloc, so
> that would mean it isn't really supported.  But that doesn't happen any more.

__GFP_NORETRY is used internaly by kvmalloc but that doesn't mean it is
supported by the caller. In fact __GFP_NORETRY is used to implement a
higher level logic of the prioritization between kmalloc and vmalloc
fallback because some users would rather see vmalloc fallback even for
smaller allocations which do not really fail otherwise (e.g. < order-4).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26  7:03     ` Michal Hocko
@ 2021-10-26 10:30       ` NeilBrown
  2021-10-26 11:29         ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-26 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 09:59:36, Neil Brown wrote:
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> [...]
> > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > >  		warn_alloc(gfp_mask, NULL,
> > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > >  			real_size);
> > > +		if (gfp_mask & __GFP_NOFAIL) {
> > > +			schedule_timeout_uninterruptible(1);
> > > +			goto again;
> > > +		}
> > 
> > Shouldn't the retry happen *before* the warning?
> 
> I've done it after to catch the "depleted or fragmented" vmalloc space.
> This is not related to the memory available and therefore it won't be
> handled by the oom killer. The error message shouldn't imply the vmalloc
> allocation failure IMHO but I am open to suggestions.

The word "failed" does seem to imply what you don't want it to imply...

I guess it is reasonable to have this warning, but maybe add " -- retrying"
if __GFP_NOFAIL.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-10-26  7:10     ` Michal Hocko
@ 2021-10-26 10:43       ` NeilBrown
  2021-10-26 12:20         ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-26 10:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 10:26:06, Neil Brown wrote:
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > The core of the vmalloc allocator __vmalloc_area_node doesn't say
> > > anything about gfp mask argument. Not all gfp flags are supported
> > > though. Be more explicit about constrains.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmalloc.c | 12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 602649919a9d..2199d821c981 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2980,8 +2980,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >   * @caller:		  caller's return address
> > >   *
> > >   * Allocate enough pages to cover @size from the page level
> > > - * allocator with @gfp_mask flags.  Map them into contiguous
> > > - * kernel virtual space, using a pagetable protection of @prot.
> > > + * allocator with @gfp_mask flags. Please note that the full set of gfp
> > > + * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> > > + * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> > 
> > In what sense is GFP_KERNEL "preferred"??
> > The choice of GFP_NOFS, when necessary, isn't based on preference but
> > on need.
> > 
> > I understand that you would prefer no one ever used GFP_NOFs ever - just
> > use the scope API.  I even agree.  But this is not the place to make
> > that case. 
> 
> Any suggestion for a better wording?

 "GFP_KERNEL, GFP_NOFS, and GFP_NOIO are all supported".

> 
> > > + * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> > > + * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> > 
> > I don't think "aka" is the right thing to use here.  It is short for
> > "also known as" and there is nothing that is being known as something
> > else.
> > It would be appropriate to say (i.e. GFP_NOWAIT is not supported).
> > "i.e." is short for the Latin "id est" which means "that is" and
> > normally introduces an alternate description (whereas aka introduces an
> > alternate name).
> 
> OK
>  
> > > + * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> > 
> > Why do you think __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported.
> 
> Because they cannot be passed to the page table allocator. In both cases
> the allocation would fail when system is short on memory. GFP_KERNEL
> used for ptes implicitly doesn't behave that way.

Could you please point me to the particular allocation which uses
GFP_KERNEL rather than the flags passed to __vmalloc_node()?  I cannot
find it.

> 
> > 
> > > + * __GFP_NOWARN can be used to suppress error messages about failures.
> > 
> > Surely "NOWARN" suppresses warning messages, not error messages ....
> 
> I am not sure I follow. NOWARN means "do not warn" independently on the
> log level chosen for the message. Is an allocation failure an error
> message? Is the "vmalloc error: size %lu, failed to map pages" an error
> message?

If guess working with a C compiler has trained me to think that
"warnings" are different from "errors".

> 
> Anyway I will go with "__GFP_NOWARN can be used to suppress failure messages"
> 
> Is that better?

Yes, that's an excellent solution!  Thanks.

NeilBrown


> -- 
> Michal Hocko
> SUSE Labs
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-10-26  7:15     ` Michal Hocko
@ 2021-10-26 10:48       ` NeilBrown
  2021-10-26 12:23         ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-26 10:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 10:34:34, Neil Brown wrote:
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
> > > by previous patches so we can allow the support for kvmalloc. This
> > > will allow some external users to simplify or completely remove
> > > their helpers.
> > > 
> > > GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
> > > explicitly documented so let's add a note about that.
> > > 
> > > ceph_kvmalloc is the first helper to be dropped and changed to
> > > kvmalloc.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/ceph/libceph.h |  1 -
> > >  mm/util.c                    | 15 ++++-----------
> > >  net/ceph/buffer.c            |  4 ++--
> > >  net/ceph/ceph_common.c       | 27 ---------------------------
> > >  net/ceph/crypto.c            |  2 +-
> > >  net/ceph/messenger.c         |  2 +-
> > >  net/ceph/messenger_v2.c      |  2 +-
> > >  net/ceph/osdmap.c            | 12 ++++++------
> > >  8 files changed, 15 insertions(+), 50 deletions(-)
> > > 
> > > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
> > > index 409d8c29bc4f..309acbcb5a8a 100644
> > > --- a/include/linux/ceph/libceph.h
> > > +++ b/include/linux/ceph/libceph.h
> > > @@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);
> > >  
> > >  extern const char *ceph_msg_type_name(int type);
> > >  extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
> > > -extern void *ceph_kvmalloc(size_t size, gfp_t flags);
> > >  
> > >  struct fs_parameter;
> > >  struct fc_log;
> > > diff --git a/mm/util.c b/mm/util.c
> > > index bacabe446906..fdec6b4b1267 100644
> > > --- a/mm/util.c
> > > +++ b/mm/util.c
> > > @@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
> > >   * Uses kmalloc to get the memory but if the allocation fails then falls back
> > >   * to the vmalloc allocator. Use kvfree for freeing the memory.
> > >   *
> > > - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> > > + * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.
> > 
> > GFP_NOWAIT is not a modifier.  It is a base value that can be modified.
> > I think you mean that
> >     __GFP_NORETRY is not supported and __GFP_DIRECT_RECLAIM is required
> 
> I thought naming the higher level gfp mask would be more helpful here.
> Most people do not tend to think in terms of __GFP_DIRECT_RECLAIM but
> rather GFP_NOWAIT or GFP_ATOMIC.

Maybe it would.  But the text says "Reclaim modifiers" and then lists
one modifier and one mask.  That is confusing.
If you want to mention both, keep them separate.

  GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the
  __GFP_NORETRY modifier.

or something like that.

Thanks,
NeilBrown


> 
> > But I really cannot see why either of these statements are true.
> 
> The reason is same as why vmalloc do not support neither of them.
> 
> > Before your patch, __GFP_NORETRY would have forced use of kmalloc, so
> > that would mean it isn't really supported.  But that doesn't happen any more.
> 
> __GFP_NORETRY is used internaly by kvmalloc but that doesn't mean it is
> supported by the caller. In fact __GFP_NORETRY is used to implement a
> higher level logic of the prioritization between kmalloc and vmalloc
> fallback because some users would rather see vmalloc fallback even for
> smaller allocations which do not really fail otherwise (e.g. < order-4).
> -- 
> Michal Hocko
> SUSE Labs
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26 10:30       ` NeilBrown
@ 2021-10-26 11:29         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-26 11:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 21:30:52, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > On Tue 26-10-21 09:59:36, Neil Brown wrote:
> > > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > [...]
> > > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > >  		warn_alloc(gfp_mask, NULL,
> > > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > > >  			real_size);
> > > > +		if (gfp_mask & __GFP_NOFAIL) {
> > > > +			schedule_timeout_uninterruptible(1);
> > > > +			goto again;
> > > > +		}
> > > 
> > > Shouldn't the retry happen *before* the warning?
> > 
> > I've done it after to catch the "depleted or fragmented" vmalloc space.
> > This is not related to the memory available and therefore it won't be
> > handled by the oom killer. The error message shouldn't imply the vmalloc
> > allocation failure IMHO but I am open to suggestions.
> 
> The word "failed" does seem to imply what you don't want it to imply...
> 
> I guess it is reasonable to have this warning, but maybe add " -- retrying"
> if __GFP_NOFAIL.

I do not have a strong opinion on that. I can surely do
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 602649919a9d..3489928fafa2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3033,10 +3033,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 				  VM_UNINITIALIZED | vm_flags, start, end, node,
 				  gfp_mask, caller);
 	if (!area) {
+		bool nofail = gfp_mask & __GFP_NOFAIL;
 		warn_alloc(gfp_mask, NULL,
-			"vmalloc error: size %lu, vm_struct allocation failed",
-			real_size);
-		if (gfp_mask & __GFP_NOFAIL) {
+			"vmalloc error: size %lu, vm_struct allocation failed%s",
+			real_size, (nofail) ? ". Retrying." : "");
+		if (nofail) {
 			schedule_timeout_uninterruptible(1);
 			goto again;
 		}
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags.
  2021-10-26 10:43       ` NeilBrown
@ 2021-10-26 12:20         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-26 12:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 21:43:17, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > On Tue 26-10-21 10:26:06, Neil Brown wrote:
> > > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > The core of the vmalloc allocator __vmalloc_area_node doesn't say
> > > > anything about gfp mask argument. Not all gfp flags are supported
> > > > though. Be more explicit about constrains.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 12 ++++++++++--
> > > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 602649919a9d..2199d821c981 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2980,8 +2980,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >   * @caller:		  caller's return address
> > > >   *
> > > >   * Allocate enough pages to cover @size from the page level
> > > > - * allocator with @gfp_mask flags.  Map them into contiguous
> > > > - * kernel virtual space, using a pagetable protection of @prot.
> > > > + * allocator with @gfp_mask flags. Please note that the full set of gfp
> > > > + * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> > > > + * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> > > 
> > > In what sense is GFP_KERNEL "preferred"??
> > > The choice of GFP_NOFS, when necessary, isn't based on preference but
> > > on need.
> > > 
> > > I understand that you would prefer no one ever used GFP_NOFs ever - just
> > > use the scope API.  I even agree.  But this is not the place to make
> > > that case. 
> > 
> > Any suggestion for a better wording?
> 
>  "GFP_KERNEL, GFP_NOFS, and GFP_NOIO are all supported".

OK. Check the incremental update at the end of the email

> > > > + * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> > > > + * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> > > 
> > > I don't think "aka" is the right thing to use here.  It is short for
> > > "also known as" and there is nothing that is being known as something
> > > else.
> > > It would be appropriate to say (i.e. GFP_NOWAIT is not supported).
> > > "i.e." is short for the Latin "id est" which means "that is" and
> > > normally introduces an alternate description (whereas aka introduces an
> > > alternate name).
> > 
> > OK
> >  
> > > > + * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> > > 
> > > Why do you think __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported.
> > 
> > Because they cannot be passed to the page table allocator. In both cases
> > the allocation would fail when system is short on memory. GFP_KERNEL
> > used for ptes implicitly doesn't behave that way.
> 
> Could you please point me to the particular allocation which uses
> GFP_KERNEL rather than the flags passed to __vmalloc_node()?  I cannot
> find it.
> 

It is dug 
__vmalloc_area_node
  vmap_pages_range
    vmap_pages_range_noflush
      vmap_range_noflush || vmap_small_pages_range_noflush
        vmap_p4d_range
	  p4d_alloc_track
	    __p4d_alloc
	      p4d_alloc_one
	        get_zeroed_page(GFP_KERNEL_ACCOUNT)

the same applies for all other levels of page tables.

This is what I have currently
commit ae7fc6c2ef6949a76d697fc61bb350197dfca330
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Oct 26 14:16:32 2021 +0200

    fold me "mm/vmalloc: be more explicit about supported gfp flags."

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2ddaa9410aee..82a07b04317e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2981,12 +2981,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
  *
  * Allocate enough pages to cover @size from the page level
  * allocator with @gfp_mask flags. Please note that the full set of gfp
- * flags are not supported. GFP_KERNEL would be a preferred allocation mode
- * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
- * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
- * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
- * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
- * __GFP_NOWARN can be used to suppress error messages about failures.
+ * flags are not supported. GFP_KERNEL, GFP_NOFS, and GFP_NOIO are all
+ * supported.
+ * Zone modifiers are not supported. From the reclaim modifiers
+ * __GFP_DIRECT_RECLAIM is required (aka GFP_NOWAIT is not supported)
+ * and only __GFP_NOFAIL is supported (i.e. __GFP_NORETRY and 
+ * __GFP_RETRY_MAYFAIL are not supported).
+ *
+ * __GFP_NOWARN can be used to suppress failures messages.
  * 
  * Map them into contiguous kernel virtual space, using a pagetable
  * protection of @prot.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc
  2021-10-26 10:48       ` NeilBrown
@ 2021-10-26 12:23         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-26 12:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Dave Chinner, Andrew Morton, Christoph Hellwig,
	Uladzislau Rezki, linux-fsdevel, LKML, Ilya Dryomov, Jeff Layton

On Tue 26-10-21 21:48:05, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
[...]
> > > GFP_NOWAIT is not a modifier.  It is a base value that can be modified.
> > > I think you mean that
> > >     __GFP_NORETRY is not supported and __GFP_DIRECT_RECLAIM is required
> > 
> > I thought naming the higher level gfp mask would be more helpful here.
> > Most people do not tend to think in terms of __GFP_DIRECT_RECLAIM but
> > rather GFP_NOWAIT or GFP_ATOMIC.
> 
> Maybe it would.  But the text says "Reclaim modifiers" and then lists
> one modifier and one mask.  That is confusing.
> If you want to mention both, keep them separate.
> 
>   GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the
>   __GFP_NORETRY modifier.
> 
> or something like that.

Fair enough. I went with this
commit fb93996c217cea864a3b3ffa8a8cd482bf0a1f62
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Oct 26 14:23:00 2021 +0200

    fold me "mm: allow !GFP_KERNEL allocations for kvmalloc"

diff --git a/mm/util.c b/mm/util.c
index fdec6b4b1267..1fb6dd907bb0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -549,7 +549,7 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.
+ * GFP_NOWAIT and GFP_ATOMIC are not supported, neither is the __GFP_NORETRY modifier.
  * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
  * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-25 15:02 ` [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
  2021-10-25 22:59   ` NeilBrown
@ 2021-10-26 15:48   ` Uladzislau Rezki
  2021-10-26 16:28     ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2021-10-26 15:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton, Michal Hocko

> From: Michal Hocko <mhocko@suse.com>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The larg part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c6cc77d2f366..602649919a9d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>                 flags = memalloc_noio_save();
>
> -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +       do {
> +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>                         page_shift);
> +               if (ret < 0)
> +                       schedule_timeout_uninterruptible(1);
> +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>

1.
After that change a below code:

<snip>
if (ret < 0) {
    warn_alloc(orig_gfp_mask, NULL,
        "vmalloc error: size %lu, failed to map pages",
        area->nr_pages * PAGE_SIZE);
    goto fail;
}
<snip>

does not make any sense anymore.

2.
Can we combine two places where we handle __GFP_NOFAIL into one place?
That would look like as more sorted out.

-- 
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26 15:48   ` Uladzislau Rezki
@ 2021-10-26 16:28     ` Michal Hocko
  2021-10-26 19:33       ` Uladzislau Rezki
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-26 16:28 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton

On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> >
> > The larg part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> >
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmalloc.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index c6cc77d2f366..602649919a9d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> >                 flags = memalloc_noio_save();
> >
> > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > +       do {
> > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> >                         page_shift);
> > +               if (ret < 0)
> > +                       schedule_timeout_uninterruptible(1);
> > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> >
> 
> 1.
> After that change a below code:
> 
> <snip>
> if (ret < 0) {
>     warn_alloc(orig_gfp_mask, NULL,
>         "vmalloc error: size %lu, failed to map pages",
>         area->nr_pages * PAGE_SIZE);
>     goto fail;
> }
> <snip>
> 
> does not make any sense anymore.

Why? Allocations without __GFP_NOFAIL can still fail, no?

> 2.
> Can we combine two places where we handle __GFP_NOFAIL into one place?
> That would look like as more sorted out.

I have to admit I am not really fluent at vmalloc code so I wanted to
make the code as simple as possible. How would I unwind all the allocated
memory (already allocated as GFP_NOFAIL) before retrying at
__vmalloc_node_range (if that is what you suggest). And isn't that a
bit wasteful?

Or did you have anything else in mind?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26 16:28     ` Michal Hocko
@ 2021-10-26 19:33       ` Uladzislau Rezki
  2021-10-27  6:46         ` Michal Hocko
  2021-10-27 17:55         ` Uladzislau Rezki
  0 siblings, 2 replies; 26+ messages in thread
From: Uladzislau Rezki @ 2021-10-26 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Linux Memory Management List, Dave Chinner,
	Neil Brown, Andrew Morton, Christoph Hellwig, linux-fsdevel,
	LKML, Ilya Dryomov, Jeff Layton

On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > >
> > > The larg part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > >
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > >
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmalloc.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index c6cc77d2f366..602649919a9d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > >                 flags = memalloc_noio_save();
> > >
> > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > +       do {
> > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > >                         page_shift);
> > > +               if (ret < 0)
> > > +                       schedule_timeout_uninterruptible(1);
> > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > >
> > 
> > 1.
> > After that change a below code:
> > 
> > <snip>
> > if (ret < 0) {
> >     warn_alloc(orig_gfp_mask, NULL,
> >         "vmalloc error: size %lu, failed to map pages",
> >         area->nr_pages * PAGE_SIZE);
> >     goto fail;
> > }
> > <snip>
> > 
> > does not make any sense anymore.
> 
> Why? Allocations without __GFP_NOFAIL can still fail, no?
> 
Right. I meant one thing but wrote slightly differently. In case of
vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
warning message? Because either we can recover on a future iteration
or it stuck there infinitely so a user does not understand what happened.
From the other hand this is how __GFP_NOFAIL works, hm..

Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
for all cases even when __GFP_NOFAIL is not set, in that scenario we do
not want to wait, instead we should return back to a caller asap. Or am
i missing something here?

> > 2.
> > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > That would look like as more sorted out.
> 
> I have to admit I am not really fluent at vmalloc code so I wanted to
> make the code as simple as possible. How would I unwind all the allocated
> memory (already allocated as GFP_NOFAIL) before retrying at
> __vmalloc_node_range (if that is what you suggest). And isn't that a
> bit wasteful?
> 
> Or did you have anything else in mind?
>
It depends on how often all this can fail. But let me double check if
such combining is easy.

--
Vlad Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26 19:33       ` Uladzislau Rezki
@ 2021-10-27  6:46         ` Michal Hocko
  2021-10-27 17:55         ` Uladzislau Rezki
  1 sibling, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2021-10-27  6:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton

On Tue 26-10-21 21:33:15, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > >
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > >
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index c6cc77d2f366..602649919a9d 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > >                 flags = memalloc_noio_save();
> > > >
> > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > +       do {
> > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > >                         page_shift);
> > > > +               if (ret < 0)
> > > > +                       schedule_timeout_uninterruptible(1);
> > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >
> > > 
> > > 1.
> > > After that change a below code:
> > > 
> > > <snip>
> > > if (ret < 0) {
> > >     warn_alloc(orig_gfp_mask, NULL,
> > >         "vmalloc error: size %lu, failed to map pages",
> > >         area->nr_pages * PAGE_SIZE);
> > >     goto fail;
> > > }
> > > <snip>
> > > 
> > > does not make any sense anymore.
> > 
> > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > 
> Right. I meant one thing but wrote slightly differently. In case of
> vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> warning message? Because either we can recover on a future iteration
> or it stuck there infinitely so a user does not understand what happened.
> From the other hand this is how __GFP_NOFAIL works, hm..

Yes, the page allocator doesn't warn either and I would like to keep
this in sync.

> Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> not want to wait, instead we should return back to a caller asap. Or am
> i missing something here?

OK, I will change that.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-26 19:33       ` Uladzislau Rezki
  2021-10-27  6:46         ` Michal Hocko
@ 2021-10-27 17:55         ` Uladzislau Rezki
  2021-10-29  7:57           ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2021-10-27 17:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Hocko, Linux Memory Management List, Dave Chinner,
	Neil Brown, Andrew Morton, Christoph Hellwig, linux-fsdevel,
	LKML, Ilya Dryomov, Jeff Layton

On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > >
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > >
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index c6cc77d2f366..602649919a9d 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > >                 flags = memalloc_noio_save();
> > > >
> > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > +       do {
> > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > >                         page_shift);
> > > > +               if (ret < 0)
> > > > +                       schedule_timeout_uninterruptible(1);
> > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >
> > > 
> > > 1.
> > > After that change a below code:
> > > 
> > > <snip>
> > > if (ret < 0) {
> > >     warn_alloc(orig_gfp_mask, NULL,
> > >         "vmalloc error: size %lu, failed to map pages",
> > >         area->nr_pages * PAGE_SIZE);
> > >     goto fail;
> > > }
> > > <snip>
> > > 
> > > does not make any sense anymore.
> > 
> > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > 
> Right. I meant one thing but wrote slightly differently. In case of
> vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> warning message? Because either we can recover on a future iteration
> or it stuck there infinitely so a user does not understand what happened.
> From the other hand this is how __GFP_NOFAIL works, hm..
> 
> Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> not want to wait, instead we should return back to a caller asap. Or am
> i missing something here?
> 
> > > 2.
> > > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > > That would look like as more sorted out.
> > 
> > I have to admit I am not really fluent at vmalloc code so I wanted to
> > make the code as simple as possible. How would I unwind all the allocated
> > memory (already allocated as GFP_NOFAIL) before retrying at
> > __vmalloc_node_range (if that is what you suggest). And isn't that a
> > bit wasteful?
> > 
> > Or did you have anything else in mind?
> >
> It depends on how often all this can fail. But let me double check if
> such combining is easy.
> 
I mean something like below. The idea is to not spread the __GFP_NOFAIL
across the vmalloc file keeping it in one solid place:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..f4b7927e217e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	unsigned long array_size;
 	unsigned int nr_small_pages = size >> PAGE_SHIFT;
 	unsigned int page_order;
+	unsigned long flags;
+	int ret;
 
 	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
+
+	/*
+	 * This is i do not understand why we do not want to see warning messages.
+	 */
 	gfp_mask |= __GFP_NOWARN;
 	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
 		gfp_mask |= __GFP_HIGHMEM;
@@ -2930,8 +2936,23 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		goto fail;
 	}
 
-	if (vmap_pages_range(addr, addr + size, prot, area->pages,
-			page_shift) < 0) {
+	/*
+	 * page tables allocations ignore external gfp mask, enforce it
+	 * by the scope API
+	 */
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		flags = memalloc_nofs_save();
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		flags = memalloc_noio_save();
+
+	ret = vmap_pages_range(addr, addr + size, prot, area->pages, page_shift);
+
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		memalloc_nofs_restore(flags);
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		memalloc_noio_restore(flags);
+
+	if (ret < 0) {
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to map pages",
 			area->nr_pages * PAGE_SIZE);
@@ -2984,6 +3005,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		return NULL;
 	}
 
+	/*
+	 * Suppress all warnings for __GFP_NOFAIL allocation.
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		gfp_mask |= __GFP_NOWARN;
+
 	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
 		unsigned long size_per_node;
 
@@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
 				  VM_UNINITIALIZED | vm_flags, start, end, node,
 				  gfp_mask, caller);
-	if (!area) {
-		warn_alloc(gfp_mask, NULL,
-			"vmalloc error: size %lu, vm_struct allocation failed",
-			real_size);
-		goto fail;
-	}
+	if (area)
+		addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
+
+	if (!area || !addr) {
+		if (gfp_mask & __GFP_NOFAIL) {
+			schedule_timeout_uninterruptible(1);
+			goto again;
+		}
+
+		if (!area)
+			warn_alloc(gfp_mask, NULL,
+				"vmalloc error: size %lu, vm_struct allocation failed",
+				real_size);
 
-	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
-	if (!addr)
 		goto fail;
+	}
 
 	/*
 	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
<snip>

--
Vlad Rezki


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-27 17:55         ` Uladzislau Rezki
@ 2021-10-29  7:57           ` Michal Hocko
  2021-10-29 14:05             ` Uladzislau Rezki
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-29  7:57 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton

On Wed 27-10-21 19:55:50, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote:
> > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > > > >
> > > > > The larg part of the vmalloc implementation already complies with the
> > > > > given gfp flags so there is no work for those to be done. The area
> > > > > and page table allocations are an exception to that. Implement a retry
> > > > > loop for those.
> > > > >
> > > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > > a change to the vmalloc space change if the failure was caused by
> > > > > the space fragmentation or depletion. But there are multiple different
> > > > > reasons to retry and this could become much more complex. Keep the retry
> > > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > >
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 10 +++++++++-
> > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index c6cc77d2f366..602649919a9d 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > > >                 flags = memalloc_noio_save();
> > > > >
> > > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > > +       do {
> > > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > >                         page_shift);
> > > > > +               if (ret < 0)
> > > > > +                       schedule_timeout_uninterruptible(1);
> > > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > > >
> > > > 
> > > > 1.
> > > > After that change a below code:
> > > > 
> > > > <snip>
> > > > if (ret < 0) {
> > > >     warn_alloc(orig_gfp_mask, NULL,
> > > >         "vmalloc error: size %lu, failed to map pages",
> > > >         area->nr_pages * PAGE_SIZE);
> > > >     goto fail;
> > > > }
> > > > <snip>
> > > > 
> > > > does not make any sense anymore.
> > > 
> > > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > > 
> > Right. I meant one thing but wrote slightly differently. In case of
> > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> > warning message? Because either we can recover on a future iteration
> > or it stuck there infinitely so a user does not understand what happened.
> > From the other hand this is how __GFP_NOFAIL works, hm..
> > 
> > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> > for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> > not want to wait, instead we should return back to a caller asap. Or am
> > i missing something here?
> > 
> > > > 2.
> > > > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > > > That would look like as more sorted out.
> > > 
> > > I have to admit I am not really fluent at vmalloc code so I wanted to
> > > make the code as simple as possible. How would I unwind all the allocated
> > > memory (already allocated as GFP_NOFAIL) before retrying at
> > > __vmalloc_node_range (if that is what you suggest). And isn't that a
> > > bit wasteful?
> > > 
> > > Or did you have anything else in mind?
> > >
> > It depends on how often all this can fail. But let me double check if
> > such combining is easy.
> > 
> I mean something like below. The idea is to not spread the __GFP_NOFAIL
> across the vmalloc file keeping it in one solid place:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..f4b7927e217e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned long array_size;
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
> +	unsigned long flags;
> +	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> +
> +	/*
> +	 * This is i do not understand why we do not want to see warning messages.
> +	 */
>  	gfp_mask |= __GFP_NOWARN;

I suspect this is becauser vmalloc wants to have its own failure
reporting.

[...]
> @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
>  				  VM_UNINITIALIZED | vm_flags, start, end, node,
>  				  gfp_mask, caller);
> -	if (!area) {
> -		warn_alloc(gfp_mask, NULL,
> -			"vmalloc error: size %lu, vm_struct allocation failed",
> -			real_size);
> -		goto fail;
> -	}
> +	if (area)
> +		addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> +
> +	if (!area || !addr) {
> +		if (gfp_mask & __GFP_NOFAIL) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}
> +
> +		if (!area)
> +			warn_alloc(gfp_mask, NULL,
> +				"vmalloc error: size %lu, vm_struct allocation failed",
> +				real_size);
>  
> -	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> -	if (!addr)
>  		goto fail;
> +	}
>  
>  	/*
>  	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> <snip>

OK, this looks easier from the code reading but isn't it quite wasteful
to throw all the pages backing the area (all of them allocated as
__GFP_NOFAIL) just to then fail to allocate few page tables pages and
drop all of that on the floor (this will happen in __vunmap AFAICS).

I mean I do not care all that strongly but it seems to me that more
changes would need to be done here and optimizations can be done on top.

Is this something you feel strongly about?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-29  7:57           ` Michal Hocko
@ 2021-10-29 14:05             ` Uladzislau Rezki
  2021-10-29 14:45               ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2021-10-29 14:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton

> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d77830ff604c..f4b7927e217e 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >       unsigned long array_size;
> >       unsigned int nr_small_pages = size >> PAGE_SHIFT;
> >       unsigned int page_order;
> > +     unsigned long flags;
> > +     int ret;
> >
> >       array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> > +
> > +     /*
> > +      * This is i do not understand why we do not want to see warning messages.
> > +      */
> >       gfp_mask |= __GFP_NOWARN;
>
> I suspect this is becauser vmalloc wants to have its own failure
> reporting.
>
But as i see it is broken. All three warn_alloc() reports in the
__vmalloc_area_node()
are useless because the __GFP_NOWARN is added on top of gfp_mask:

void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
        struct va_format vaf;
        va_list args;
        static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1);

        if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
                return;
...

everything with the __GFP_NOWARN is just reverted.

> [...]
> > @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >       area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
> >                                 VM_UNINITIALIZED | vm_flags, start, end, node,
> >                                 gfp_mask, caller);
> > -     if (!area) {
> > -             warn_alloc(gfp_mask, NULL,
> > -                     "vmalloc error: size %lu, vm_struct allocation failed",
> > -                     real_size);
> > -             goto fail;
> > -     }
> > +     if (area)
> > +             addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> > +
> > +     if (!area || !addr) {
> > +             if (gfp_mask & __GFP_NOFAIL) {
> > +                     schedule_timeout_uninterruptible(1);
> > +                     goto again;
> > +             }
> > +
> > +             if (!area)
> > +                     warn_alloc(gfp_mask, NULL,
> > +                             "vmalloc error: size %lu, vm_struct allocation failed",
> > +                             real_size);
> >
> > -     addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> > -     if (!addr)
> >               goto fail;
> > +     }
> >
> >       /*
> >        * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> > <snip>
>
> OK, this looks easier from the code reading but isn't it quite wasteful
> to throw all the pages backing the area (all of them allocated as
> __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> drop all of that on the floor (this will happen in __vunmap AFAICS).
>
> I mean I do not care all that strongly but it seems to me that more
> changes would need to be done here and optimizations can be done on top.
>
> Is this something you feel strongly about?
>
Will try to provide some motivations :)

It depends on how to look at it. My view is as follows a more simple code
is preferred. It is not considered as a hot path and it is rather a corner
case to me. I think "unwinding" has some advantage. At least one motivation
is to release a memory(on failure) before a delay that will prevent holding
of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
forever.

-- 
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-29 14:05             ` Uladzislau Rezki
@ 2021-10-29 14:45               ` Michal Hocko
  2021-10-29 17:23                 ` Uladzislau Rezki
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2021-10-29 14:45 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Linux Memory Management List, Dave Chinner, Neil Brown,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, LKML,
	Ilya Dryomov, Jeff Layton

On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote:
[...]
> > OK, this looks easier from the code reading but isn't it quite wasteful
> > to throw all the pages backing the area (all of them allocated as
> > __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> > drop all of that on the floor (this will happen in __vunmap AFAICS).
> >
> > I mean I do not care all that strongly but it seems to me that more
> > changes would need to be done here and optimizations can be done on top.
> >
> > Is this something you feel strongly about?
> >
> Will try to provide some motivations :)
> 
> It depends on how to look at it. My view is as follows a more simple code
> is preferred. It is not considered as a hot path and it is rather a corner
> case to me.

Yes, we are definitely talking about corner cases here. Even GFP_KERNEL
allocations usually do not fail.

> I think "unwinding" has some advantage. At least one motivation
> is to release a memory(on failure) before a delay that will prevent holding
> of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
> if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
> forever.

Well, I suspect this is something that we can disagree on and both of us
would be kinda right. I would see it as throwing baby out with the
bathwater. The vast majority of the memory will be in the area pages and
sacrificing that just to allocate few page tables or whatever that might
fail in that code path is just a lot of cycles wasted.

So unless you really feel strongly about this then I would stick with
this approach.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL
  2021-10-29 14:45               ` Michal Hocko
@ 2021-10-29 17:23                 ` Uladzislau Rezki
  0 siblings, 0 replies; 26+ messages in thread
From: Uladzislau Rezki @ 2021-10-29 17:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Linux Memory Management List, Dave Chinner,
	Neil Brown, Andrew Morton, Christoph Hellwig, linux-fsdevel,
	LKML, Ilya Dryomov, Jeff Layton

> On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote:
> [...]
> > > OK, this looks easier from the code reading but isn't it quite wasteful
> > > to throw all the pages backing the area (all of them allocated as
> > > __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> > > drop all of that on the floor (this will happen in __vunmap AFAICS).
> > >
> > > I mean I do not care all that strongly but it seems to me that more
> > > changes would need to be done here and optimizations can be done on top.
> > >
> > > Is this something you feel strongly about?
> > >
> > Will try to provide some motivations :)
> > 
> > It depends on how to look at it. My view is as follows a more simple code
> > is preferred. It is not considered as a hot path and it is rather a corner
> > case to me.
> 
> Yes, we are definitely talking about corner cases here. Even GFP_KERNEL
> allocations usually do not fail.
> 
> > I think "unwinding" has some advantage. At least one motivation
> > is to release a memory(on failure) before a delay that will prevent holding
> > of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
> > if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
> > forever.
> 
> Well, I suspect this is something that we can disagree on and both of us
> would be kinda right. I would see it as throwing baby out with the
> bathwater. The vast majority of the memory will be in the area pages and
> sacrificing that just to allocate few page tables or whatever that might
> fail in that code path is just a lot of cycles wasted.
>
We are not talking about performance, no sense to measure cycles here :)

> 
> So unless you really feel strongly about this then I would stick with
> this approach.
>
I have raised one concern. The memory resource is shared between all
process in case of __GFP_NOFAIL it might be that we never return back
to user in that scenario i prefer to release hold memory for other
needs instead of keeping it for nothing.

If you think it is not a problem, then i do not have much to say.

--
Vlad Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-10-29 17:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 15:02 [PATCH 0/4] extend vmalloc support for constrained allocations Michal Hocko
2021-10-25 15:02 ` [PATCH 1/4] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Michal Hocko
2021-10-25 15:02 ` [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL Michal Hocko
2021-10-25 22:59   ` NeilBrown
2021-10-26  7:03     ` Michal Hocko
2021-10-26 10:30       ` NeilBrown
2021-10-26 11:29         ` Michal Hocko
2021-10-26 15:48   ` Uladzislau Rezki
2021-10-26 16:28     ` Michal Hocko
2021-10-26 19:33       ` Uladzislau Rezki
2021-10-27  6:46         ` Michal Hocko
2021-10-27 17:55         ` Uladzislau Rezki
2021-10-29  7:57           ` Michal Hocko
2021-10-29 14:05             ` Uladzislau Rezki
2021-10-29 14:45               ` Michal Hocko
2021-10-29 17:23                 ` Uladzislau Rezki
2021-10-25 15:02 ` [PATCH 3/4] mm/vmalloc: be more explicit about supported gfp flags Michal Hocko
2021-10-25 23:26   ` NeilBrown
2021-10-26  7:10     ` Michal Hocko
2021-10-26 10:43       ` NeilBrown
2021-10-26 12:20         ` Michal Hocko
2021-10-25 15:02 ` [PATCH 4/4] mm: allow !GFP_KERNEL allocations for kvmalloc Michal Hocko
2021-10-25 23:34   ` NeilBrown
2021-10-26  7:15     ` Michal Hocko
2021-10-26 10:48       ` NeilBrown
2021-10-26 12:23         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).