All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
@ 2021-10-16  6:51 Paolo Bonzini
  2021-10-18 15:09 ` Kees Cook
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2021-10-16  6:51 UTC (permalink / raw)
  To: torvalds, kvm
  Cc: linux-kernel, seanjc, Willy Tarreau, Kees Cook,
	syzbot+e0de2333cbf95ea473e8

Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
restricted memory allocation with 'kvmalloc()' to sizes that fit
in an 'int', to protect against trivial integer conversion issues.

However, the WARN triggers with KVM, when it allocates ancillary page
data whose size essentially depends on whatever userspace has passed to
the KVM_SET_USER_MEMORY_REGION ioctl.  The warnings are easily raised by
syzkaller, but the largest allocation that KVM can do is 8 bytes per page
of guest memory; therefore, a 1 TiB memslot will cause a warning even
outside fuzzing, and those allocations are known to happen in the wild.
Google for example already has VMs that create 1.5tb memslots (12tb of
total guest memory spread across 8 virtual NUMA nodes).

Use memcg accounting as evidence that the crazy large allocations are
expected---in which case, it is indeed a good idea to have them
properly accounted---and exempt them from the warning.

Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: syzbot+e0de2333cbf95ea473e8@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	Linus, what do you think of this?  It is a bit of a hack,
	but the reasoning in the commit message does make at least
	some sense.

	The alternative would be to just use __vmalloc in KVM, and add
	__vcalloc too.	The two underscores would suggest that something
	"different" is going on, but I wonder what you prefer between
	this and having a __vcalloc with 2-3 uses in the whole source.

 mm/util.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 499b6b5767ed..31fca4a999c6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -593,8 +593,12 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
-	/* Don't even allow crazy sizes */
-	if (WARN_ON_ONCE(size > INT_MAX))
+	/*
+	 * Don't even allow crazy sizes unless memcg accounting is
+	 * request.  We take that as a sign that huge allocations
+	 * are indeed expected.
+	 */
+	if (likely(!(flags & __GFP_ACCOUNT)) && WARN_ON_ONCE(size > INT_MAX))
 		return NULL;
 
 	return __vmalloc_node(size, 1, flags, node,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16  6:51 [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg Paolo Bonzini
@ 2021-10-18 15:09 ` Kees Cook
  0 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2021-10-18 15:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: torvalds, kvm, linux-kernel, seanjc, Willy Tarreau,
	syzbot+e0de2333cbf95ea473e8

On Sat, Oct 16, 2021 at 02:51:30AM -0400, Paolo Bonzini wrote:
> Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
> restricted memory allocation with 'kvmalloc()' to sizes that fit
> in an 'int', to protect against trivial integer conversion issues.
> 
> However, the WARN triggers with KVM, when it allocates ancillary page
> data whose size essentially depends on whatever userspace has passed to
> the KVM_SET_USER_MEMORY_REGION ioctl.  The warnings are easily raised by
> syzkaller, but the largest allocation that KVM can do is 8 bytes per page
> of guest memory; therefore, a 1 TiB memslot will cause a warning even
> outside fuzzing, and those allocations are known to happen in the wild.
> Google for example already has VMs that create 1.5tb memslots (12tb of
> total guest memory spread across 8 virtual NUMA nodes).
> 
> Use memcg accounting as evidence that the crazy large allocations are
> expected---in which case, it is indeed a good idea to have them
> properly accounted---and exempt them from the warning.

Will memcg always have a "sane" upper bound? If so, yeah, this seems a
better solution than dropping the WARN completely. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> 
> Cc: Willy Tarreau <w@1wt.eu>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Reported-by: syzbot+e0de2333cbf95ea473e8@syzkaller.appspotmail.com
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	Linus, what do you think of this?  It is a bit of a hack,
> 	but the reasoning in the commit message does make at least
> 	some sense.
> 
> 	The alternative would be to just use __vmalloc in KVM, and add
> 	__vcalloc too.	The two underscores would suggest that something
> 	"different" is going on, but I wonder what you prefer between
> 	this and having a __vcalloc with 2-3 uses in the whole source.
> 
>  mm/util.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 499b6b5767ed..31fca4a999c6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -593,8 +593,12 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
>  
> -	/* Don't even allow crazy sizes */
> -	if (WARN_ON_ONCE(size > INT_MAX))
> +	/*
> +	 * Don't even allow crazy sizes unless memcg accounting is
> +	 * request.  We take that as a sign that huge allocations
> +	 * are indeed expected.
> +	 */
> +	if (likely(!(flags & __GFP_ACCOUNT)) && WARN_ON_ONCE(size > INT_MAX))
>  		return NULL;
>  
>  	return __vmalloc_node(size, 1, flags, node,
> -- 
> 2.27.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-18 15:19         ` Kees Cook
@ 2021-10-18 15:24           ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2021-10-18 15:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, KVM list, Linux Kernel Mailing List,
	Sean Christopherson, Willy Tarreau, syzbot+e0de2333cbf95ea473e8

On 18/10/21 17:19, Kees Cook wrote:
> Ah, so memcg wasn't doing sanity checks?
> 
> Is there a cheap way to resolve the question "does this much memory
> exist"? The "__" versions end up lacking context for why they're "__"
> versions. I.e. do we want something more descriptive, like
> __huge_kvmalloc_node() or __unbounded_kvmalloc_node()?

No problem with that, I think "unbounded" is descriptive enough that we 
can remove the __ too.  So that would be kvmalloc_node_unbounded / 
kvmalloc_array_unbounded / kvcallc_unbounded?

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16 23:17       ` Paolo Bonzini
@ 2021-10-18 15:19         ` Kees Cook
  2021-10-18 15:24           ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Kees Cook @ 2021-10-18 15:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, KVM list, Linux Kernel Mailing List,
	Sean Christopherson, Willy Tarreau, syzbot+e0de2333cbf95ea473e8

On Sun, Oct 17, 2021 at 01:17:56AM +0200, Paolo Bonzini wrote:
> On 16/10/21 20:10, Linus Torvalds wrote:
> > That said, I also do wonder if we could possibly change "kvcalloc()"
> > to avoid the warning. The reason I didn't like your patch is that
> > kvmalloc_node() only takes a "size_t", and the overflow condition
> > there is that "MAX_INT".
> > 
> > But the "kvcalloc()" case that takes a "number of elements and size"
> > should _conceptually_ warn not when the total size overflows, but when
> > either number or the element size overflows.
> 
> That makes sense, but the number could still overflow in KVM's case; the
> size is small, just 8, it's the count that's humongous.  In general,
> users of kvcalloc of kvmalloc_array *should* not be doing
> multiplications (that's the whole point of the functions), and that
> lowers a lot the risk of overflows, but the safest way is to provide
> a variant that does not warn.  See the (compile-tested only) patch
> below.
> 
> Pulling the WARN in the inline function is a bit ugly.  For kvcalloc()
> and kvmalloc_array(), one of the two is almost always constant, but
> it is unlikely that the compiler eliminates both.  The impact on a
> localyesconfig build seems to be minimal though (about 150 bytes
> larger out of 20 megabytes of code).
> 
> Paolo
> 
> ---------------- 8< -----------------
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH] mm: add kvmalloc variants that do not to warn
> 
> Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
> restricted memory allocation with 'kvmalloc()' to sizes that fit
> in an 'int', to protect against trivial integer conversion issues.
> However, the WARN triggers with KVM when it allocates ancillary page
> data, whose size essentially depends on whatever userspace has passed to
> the KVM_SET_USER_MEMORY_REGION ioctl.  The warnings are quickly found by
> syzkaller, but they can also happen with huge but real-world VMs.
> The largest allocation that KVM can do is 8 bytes per page of guest
> memory, meaning a 1 TiB memslot will cause a warning even outside fuzzing.
> In fact, Google already has VMs that create 1.5 TiB memslots (12 TiB of
> total guest memory spread across 8 virtual NUMA nodes).
> 
> For kvcalloc() and kvmalloc_array(), Linus suggested warning if either
> the number or the size are big.  However, this would only move the
> goalpost for KVM's warning without fully avoiding it.  Therefore,
> provide a "double underscore" version of kvcalloc(), kvmalloc_array()
> and kvmalloc_node() that omits the check.
> 
> Cc: Willy Tarreau <w@1wt.eu>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Ah, so memcg wasn't doing sanity checks?

Is there a cheap way to resolve the question "does this much memory
exist"? The "__" versions end up lacking context for why they're "__"
versions. I.e. do we want something more descriptive, like
__huge_kvmalloc_node() or __unbounded_kvmalloc_node()?

-Kees

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..92aba7327bd8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -799,7 +799,15 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>  }
>  #endif
> -extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
> +extern void *__kvmalloc_node(size_t size, gfp_t flags, int node);
> +static inline void *kvmalloc_node(size_t size, gfp_t flags, int node)
> +{
> +	/* Don't even allow crazy sizes */
> +	if (WARN_ON(size > INT_MAX))
> +		return NULL;
> +	return __kvmalloc_node(size, flags, node);
> +}
> +
>  static inline void *kvmalloc(size_t size, gfp_t flags)
>  {
>  	return kvmalloc_node(size, flags, NUMA_NO_NODE);
> @@ -813,14 +821,31 @@ static inline void *kvzalloc(size_t size, gfp_t flags)
>  	return kvmalloc(size, flags | __GFP_ZERO);
>  }
> -static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +static inline void *__kvmalloc_array(size_t n, size_t size, gfp_t flags)
>  {
>  	size_t bytes;
>  	if (unlikely(check_mul_overflow(n, size, &bytes)))
>  		return NULL;
> -	return kvmalloc(bytes, flags);
> +	return __kvmalloc_node(bytes, flags, NUMA_NO_NODE);
> +}
> +
> +static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
> +{
> +	/*
> +	 * Don't allow crazy sizes here, either.  For 64-bit,
> +	 * this also lets the compiler avoid the overflow check.
> +	 */
> +	if (WARN_ON(size > INT_MAX || n > INT_MAX))
> +		return NULL;
> +
> +	return __kvmalloc_array(n, size, flags);
> +}
> +
> +static inline void *__kvcalloc(size_t n, size_t size, gfp_t flags)
> +{
> +	return __kvmalloc_array(n, size, flags | __GFP_ZERO);
>  }
>  static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
> diff --git a/mm/util.c b/mm/util.c
> index 499b6b5767ed..0406709d8097 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -558,7 +558,7 @@ EXPORT_SYMBOL(vm_mmap);
>   *
>   * Return: pointer to the allocated memory of %NULL in case of failure
>   */
> -void *kvmalloc_node(size_t size, gfp_t flags, int node)
> +void *__kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
> @@ -593,14 +593,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
> -	/* Don't even allow crazy sizes */
> -	if (WARN_ON_ONCE(size > INT_MAX))
> -		return NULL;
> -
>  	return __vmalloc_node(size, 1, flags, node,
>  			__builtin_return_address(0));
>  }
> -EXPORT_SYMBOL(kvmalloc_node);
> +EXPORT_SYMBOL(__kvmalloc_node);
>  /**
>   * kvfree() - Free memory.
> 
> > So I would also accept a patch that just changes how "kvcalloc()"
> > works (or how "kvmalloc_array()" works).
> > 
> > It's a bit annoying how we've ended up losing that "n/size"
> > information by the time we hit kvmalloc().
> > 
> >                 Linus
> > 
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16 18:10     ` Linus Torvalds
@ 2021-10-16 23:17       ` Paolo Bonzini
  2021-10-18 15:19         ` Kees Cook
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2021-10-16 23:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KVM list, Linux Kernel Mailing List, Sean Christopherson,
	Willy Tarreau, Kees Cook, syzbot+e0de2333cbf95ea473e8

On 16/10/21 20:10, Linus Torvalds wrote:
> That said, I also do wonder if we could possibly change "kvcalloc()"
> to avoid the warning. The reason I didn't like your patch is that
> kvmalloc_node() only takes a "size_t", and the overflow condition
> there is that "MAX_INT".
> 
> But the "kvcalloc()" case that takes a "number of elements and size"
> should _conceptually_ warn not when the total size overflows, but when
> either number or the element size overflows.

That makes sense, but the number could still overflow in KVM's case; the
size is small, just 8, it's the count that's humongous.  In general,
users of kvcalloc of kvmalloc_array *should* not be doing
multiplications (that's the whole point of the functions), and that
lowers a lot the risk of overflows, but the safest way is to provide
a variant that does not warn.  See the (compile-tested only) patch
below.

Pulling the WARN in the inline function is a bit ugly.  For kvcalloc()
and kvmalloc_array(), one of the two is almost always constant, but
it is unlikely that the compiler eliminates both.  The impact on a
localyesconfig build seems to be minimal though (about 150 bytes
larger out of 20 megabytes of code).

Paolo

---------------- 8< -----------------
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH] mm: add kvmalloc variants that do not to warn

Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
restricted memory allocation with 'kvmalloc()' to sizes that fit
in an 'int', to protect against trivial integer conversion issues.
     
However, the WARN triggers with KVM when it allocates ancillary page
data, whose size essentially depends on whatever userspace has passed to
the KVM_SET_USER_MEMORY_REGION ioctl.  The warnings are quickly found by
syzkaller, but they can also happen with huge but real-world VMs.
The largest allocation that KVM can do is 8 bytes per page of guest
memory, meaning a 1 TiB memslot will cause a warning even outside fuzzing.
In fact, Google already has VMs that create 1.5 TiB memslots (12 TiB of
total guest memory spread across 8 virtual NUMA nodes).

For kvcalloc() and kvmalloc_array(), Linus suggested warning if either
the number or the size are big.  However, this would only move the
goalpost for KVM's warning without fully avoiding it.  Therefore,
provide a "double underscore" version of kvcalloc(), kvmalloc_array()
and kvmalloc_node() that omits the check.

Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..92aba7327bd8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -799,7 +799,15 @@ static inline int is_vmalloc_or_module_addr(const void *x)
  }
  #endif
  
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
+extern void *__kvmalloc_node(size_t size, gfp_t flags, int node);
+static inline void *kvmalloc_node(size_t size, gfp_t flags, int node)
+{
+	/* Don't even allow crazy sizes */
+	if (WARN_ON(size > INT_MAX))
+		return NULL;
+	return __kvmalloc_node(size, flags, node);
+}
+
  static inline void *kvmalloc(size_t size, gfp_t flags)
  {
  	return kvmalloc_node(size, flags, NUMA_NO_NODE);
@@ -813,14 +821,31 @@ static inline void *kvzalloc(size_t size, gfp_t flags)
  	return kvmalloc(size, flags | __GFP_ZERO);
  }
  
-static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline void *__kvmalloc_array(size_t n, size_t size, gfp_t flags)
  {
  	size_t bytes;
  
  	if (unlikely(check_mul_overflow(n, size, &bytes)))
  		return NULL;
  
-	return kvmalloc(bytes, flags);
+	return __kvmalloc_node(bytes, flags, NUMA_NO_NODE);
+}
+
+static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	/*
+	 * Don't allow crazy sizes here, either.  For 64-bit,
+	 * this also lets the compiler avoid the overflow check.
+	 */
+	if (WARN_ON(size > INT_MAX || n > INT_MAX))
+		return NULL;
+
+	return __kvmalloc_array(n, size, flags);
+}
+
+static inline void *__kvcalloc(size_t n, size_t size, gfp_t flags)
+{
+	return __kvmalloc_array(n, size, flags | __GFP_ZERO);
  }
  
  static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
diff --git a/mm/util.c b/mm/util.c
index 499b6b5767ed..0406709d8097 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -558,7 +558,7 @@ EXPORT_SYMBOL(vm_mmap);
   *
   * Return: pointer to the allocated memory of %NULL in case of failure
   */
-void *kvmalloc_node(size_t size, gfp_t flags, int node)
+void *__kvmalloc_node(size_t size, gfp_t flags, int node)
  {
  	gfp_t kmalloc_flags = flags;
  	void *ret;
@@ -593,14 +593,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
  	if (ret || size <= PAGE_SIZE)
  		return ret;
  
-	/* Don't even allow crazy sizes */
-	if (WARN_ON_ONCE(size > INT_MAX))
-		return NULL;
-
  	return __vmalloc_node(size, 1, flags, node,
  			__builtin_return_address(0));
  }
-EXPORT_SYMBOL(kvmalloc_node);
+EXPORT_SYMBOL(__kvmalloc_node);
  
  /**
   * kvfree() - Free memory.

> So I would also accept a patch that just changes how "kvcalloc()"
> works (or how "kvmalloc_array()" works).
> 
> It's a bit annoying how we've ended up losing that "n/size"
> information by the time we hit kvmalloc().
> 
>                 Linus
> 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16 17:53   ` Paolo Bonzini
@ 2021-10-16 18:10     ` Linus Torvalds
  2021-10-16 23:17       ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2021-10-16 18:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Linux Kernel Mailing List, Sean Christopherson,
	Willy Tarreau, Kees Cook, syzbot+e0de2333cbf95ea473e8

On Sat, Oct 16, 2021 at 10:53 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Sounds good, and you'll get a pull request for that tomorrow.  Then I'll
> send via Andrew a patch to add __vcalloc, so that the accounting is
> restored.

Ahh. So you used kvmalloc() mainly because the regular vmalloc()
doesn't do the gfp-flags.

We do have that "__vmalloc()" thing, but the double underscore naming
makes it a bit unfortunate (since it tends to mean "local special use
only").

I suspect you could just make "vcalloc()".

That said, I also do wonder if we could possibly change "kvcalloc()"
to avoid the warning. The reason I didn't like your patch is that
kvmalloc_node() only takes a "size_t", and the overflow condition
there is that "MAX_INT".

But the "kvcalloc()" case that takes a "number of elements and size"
should _conceptually_ warn not when the total size overflows, but when
either number or the element size overflows.

So I would also accept a patch that just changes how "kvcalloc()"
works (or how "kvmalloc_array()" works).

It's a bit annoying how we've ended up losing that "n/size"
information by the time we hit kvmalloc().

               Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16 15:39 ` Linus Torvalds
@ 2021-10-16 17:53   ` Paolo Bonzini
  2021-10-16 18:10     ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2021-10-16 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KVM list, Linux Kernel Mailing List, Sean Christopherson,
	Willy Tarreau, Kees Cook, syzbot+e0de2333cbf95ea473e8

On 16/10/21 17:39, Linus Torvalds wrote:
> The big allocation warnings are not about whether we have the memory
> or not, or about whether it's accounted or not.
> It's about bugs and overflows. Which we've had.

Yes, I understand that...

> At least GFP_NOWARN would be somewhat sensible - although still wrong.

... and it also seemed wrong to overload GFP_NOWARN.

> It should really be about "I've been careful with growing my
> allocations", not about whether accounting or similar should be
> disabled. If the allocations really are expected to be that big, and
> it's actually valid, just do vmalloc(), which doesn't warn.
Sounds good, and you'll get a pull request for that tomorrow.  Then I'll 
send via Andrew a patch to add __vcalloc, so that the accounting is 
restored.

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
  2021-10-16  6:43 Paolo Bonzini
@ 2021-10-16 15:39 ` Linus Torvalds
  2021-10-16 17:53   ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2021-10-16 15:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Linux Kernel Mailing List, Sean Christopherson,
	Willy Tarreau, Kees Cook, syzbot+e0de2333cbf95ea473e8

On Fri, Oct 15, 2021 at 11:43 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Use memcg accounting as evidence that the crazy large allocations are
> expected---in which case, it is indeed a good idea to have them
> properly accounted---and exempt them from the warning.

This is not sensible.

The big allocation warnings are not about whether we have the memory
or not, or about whether it's accounted or not.

It's about bugs and overflows. Which we've had.

At least GFP_NOWARN would be somewhat sensible - although still wrong.
It should really be about "I've been careful with growing my
allocations", not about whether accounting or similar should be
disabled.

If the allocations really are expected to be that big, and it's
actually valid, just do vmalloc(), which doesn't warn.

                      Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg
@ 2021-10-16  6:43 Paolo Bonzini
  2021-10-16 15:39 ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2021-10-16  6:43 UTC (permalink / raw)
  To: torvalds, kvm
  Cc: linux-kernel, seanjc, Willy Tarreau, Kees Cook,
	syzbot+e0de2333cbf95ea473e8

Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
restricted memory allocation with 'kvmalloc()' to sizes that fit
in an 'int', to protect against trivial integer conversion issues.

However, the WARN triggers with KVM, when it allocates ancillary page
data whose size essentially depends on whatever userspace has passed to
the KVM_SET_USER_MEMORY_REGION ioctl.  The warnings are easily raised by
syzkaller, but the largest allocation that KVM can do is 8 bytes per page
of guest memory; therefore, a 1 TiB memslot will cause a warning even
outside fuzzing, and those allocations are known to happen in the wild.
Google for example already has VMs that create 1.5tb memslots (12tb of
total guest memory spread across 8 virtual NUMA nodes).

Use memcg accounting as evidence that the crazy large allocations are
expected---in which case, it is indeed a good idea to have them
properly accounted---and exempt them from the warning.

Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: syzbot+e0de2333cbf95ea473e8@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	Linus, what do you think of this?  It is a bit of a hack,
	but the reasoning in the commit message does make at least
	some sense.

	The alternative would be to just use __vmalloc in KVM, and add
	__vcalloc too.	The two underscores would suggest that something
	"different" is going on, but I wonder what you prefer between
	this and having a __vcalloc with 2-3 uses in the whole source.

 mm/util.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 499b6b5767ed..31fca4a999c6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -593,8 +593,12 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
-	/* Don't even allow crazy sizes */
-	if (WARN_ON_ONCE(size > INT_MAX))
+	/*
+	 * Don't even allow crazy sizes unless memcg accounting is
+	 * request.  We take that as a sign that huge allocations
+	 * are indeed expected.
+	 */
+	if (likely(!(flags & __GFP_ACCOUNT)) && WARN_ON_ONCE(size > INT_MAX))
 		return NULL;
 
 	return __vmalloc_node(size, 1, flags, node,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-10-18 15:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-16  6:51 [PATCH] mm: allow huge kvmalloc() calls if they're accounted to memcg Paolo Bonzini
2021-10-18 15:09 ` Kees Cook
  -- strict thread matches above, loose matches on Subject: below --
2021-10-16  6:43 Paolo Bonzini
2021-10-16 15:39 ` Linus Torvalds
2021-10-16 17:53   ` Paolo Bonzini
2021-10-16 18:10     ` Linus Torvalds
2021-10-16 23:17       ` Paolo Bonzini
2021-10-18 15:19         ` Kees Cook
2021-10-18 15:24           ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.