All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
@ 2024-03-05 10:10 Kees Cook
  2024-03-05 10:10 ` [PATCH v2 1/9] slab: Introduce kmem_buckets typedef Kees Cook
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG,
	Ruiqi, Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet,
	Jann Horn, Matteo Rizzo, linux-kernel, linux-mm, linux-hardening

Hi,

Repeating the commit logs for patch 4 here:

    Dedicated caches are available For fixed size allocations via
    kmem_cache_alloc(), but for dynamically sized allocations there is only
    the global kmalloc API's set of buckets available. This means it isn't
    possible to separate specific sets of dynamically sized allocations into
    a separate collection of caches.

    This leads to a use-after-free exploitation weakness in the Linux
    kernel since many heap memory spraying/grooming attacks depend on using
    userspace-controllable dynamically sized allocations to collide with
    fixed size allocations that end up in same cache.

    While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
    against these kinds of "type confusion" attacks, including for fixed
    same-size heap objects, we can create a complementary deterministic
    defense for dynamically sized allocations.

    In order to isolate user-controllable sized allocations from system
    allocations, introduce kmem_buckets_create(), which behaves like
    kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
    which behaves like kmem_cache_alloc().)

    Allows for confining allocations to a dedicated set of sized caches
    (which have the same layout as the kmalloc caches).

    This can also be used in the future once codetag allocation annotations
    exist to implement per-caller allocation cache isolation[0] even for
    dynamic allocations.

    Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]

After the implemetation are 2 example patches of how this could be used
for some repeat "offenders" that get used in exploits. There are more to
be isolated beyond just these. Repeating the commit log for patch 8 here:

    The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
    use-after-free type confusion flaws in the kernel for both read and
    write primitives. Avoid having a user-controlled size cache share the
    global kmalloc allocator by using a separate set of kmalloc buckets.

    Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
    Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
    Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
    Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
    Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
    Link: https://zplin.me/papers/ELOISE.pdf [6]

-Kees

 v2: significant rewrite, generalized the buckets type, added kvmalloc style
 v1: https://lore.kernel.org/lkml/20240304184252.work.496-kees@kernel.org/

Kees Cook (9):
  slab: Introduce kmem_buckets typedef
  slub: Plumb kmem_buckets into __do_kmalloc_node()
  util: Introduce __kvmalloc_node() that can take kmem_buckets argument
  slab: Introduce kmem_buckets_create()
  slab: Introduce kmem_buckets_alloc()
  slub: Introduce kmem_buckets_alloc_track_caller()
  slab: Introduce kmem_buckets_valloc()
  ipc, msg: Use dedicated slab buckets for alloc_msg()
  mm/util: Use dedicated slab buckets for memdup_user()

 include/linux/slab.h | 50 +++++++++++++++++++++-------
 ipc/msgutil.c        | 13 +++++++-
 lib/fortify_kunit.c  |  2 +-
 mm/slab.h            |  6 ++--
 mm/slab_common.c     | 77 ++++++++++++++++++++++++++++++++++++++++++--
 mm/slub.c            | 14 ++++----
 mm/util.c            | 23 +++++++++----
 7 files changed, 154 insertions(+), 31 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/9] slab: Introduce kmem_buckets typedef
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 2/9] slub: Plumb kmem_buckets into __do_kmalloc_node() Kees Cook
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

Encapsulate the concept of a single set of kmem_caches that are used
for the kmalloc size buckets. Redefine kmalloc_caches as an array
of these buckets (for the different global cache buckets).

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h | 5 +++--
 mm/slab_common.c     | 3 +--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b5f5ee8308d0..55059faf166c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -375,8 +375,9 @@ enum kmalloc_cache_type {
 	NR_KMALLOC_TYPES
 };
 
-extern struct kmem_cache *
-kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
+typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH + 1];
+
+extern kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES];
 
 /*
  * Define gfp bits that should not be set for KMALLOC_NORMAL.
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 238293b1dbe1..8787cf17d6e4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -649,8 +649,7 @@ static struct kmem_cache *__init create_kmalloc_cache(const char *name,
 	return s;
 }
 
-struct kmem_cache *
-kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init =
+kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES] __ro_after_init =
 { /* initialization for https://bugs.llvm.org/show_bug.cgi?id=42570 */ };
 EXPORT_SYMBOL(kmalloc_caches);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/9] slub: Plumb kmem_buckets into __do_kmalloc_node()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
  2024-03-05 10:10 ` [PATCH v2 1/9] slab: Introduce kmem_buckets typedef Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 3/9] util: Introduce __kvmalloc_node() that can take kmem_buckets argument Kees Cook
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, linux-hardening, GONG, Ruiqi, Xiu Jianfeng,
	Suren Baghdasaryan, Kent Overstreet, Jann Horn, Matteo Rizzo,
	linux-kernel

To be able to choose which buckets to allocate from, make the buckets
available to the lower level kmalloc interfaces.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-hardening@vger.kernel.org
---
 include/linux/slab.h |  8 ++++----
 lib/fortify_kunit.c  |  2 +-
 mm/slab.h            |  6 ++++--
 mm/slab_common.c     |  2 +-
 mm/slub.c            | 12 ++++++------
 5 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 55059faf166c..1cc1a7637b56 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -508,8 +508,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
 	kmem_cache_free_bulk(NULL, size, p);
 }
 
-void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
-							 __alloc_size(1);
+void *__kmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node)
+		     __assume_kmalloc_alignment __alloc_size(2);
 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
 									 __malloc;
 
@@ -608,7 +608,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
 				kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
 				flags, node, size);
 	}
-	return __kmalloc_node(size, flags, node);
+	return __kmalloc_node(NULL, size, flags, node);
 }
 
 /**
@@ -686,7 +686,7 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size,
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
 		return kmalloc_node(bytes, flags, node);
-	return __kmalloc_node(bytes, flags, node);
+	return __kmalloc_node(NULL, bytes, flags, node);
 }
 
 static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
diff --git a/lib/fortify_kunit.c b/lib/fortify_kunit.c
index 2e4fedc81621..c44400b577f3 100644
--- a/lib/fortify_kunit.c
+++ b/lib/fortify_kunit.c
@@ -182,7 +182,7 @@ static void alloc_size_##allocator##_dynamic_test(struct kunit *test)	\
 	checker(expected_size, __kmalloc(alloc_size, gfp),		\
 		kfree(p));						\
 	checker(expected_size,						\
-		__kmalloc_node(alloc_size, gfp, NUMA_NO_NODE),		\
+		__kmalloc_node(NULL, alloc_size, gfp, NUMA_NO_NODE),	\
 		kfree(p));						\
 									\
 	orig = kmalloc(alloc_size, gfp);				\
diff --git a/mm/slab.h b/mm/slab.h
index 54deeb0428c6..931f261bde48 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -404,16 +404,18 @@ static inline unsigned int size_index_elem(unsigned int bytes)
  * KMALLOC_MAX_CACHE_SIZE and the caller must check that.
  */
 static inline struct kmem_cache *
-kmalloc_slab(size_t size, gfp_t flags, unsigned long caller)
+kmalloc_slab(kmem_buckets *b, size_t size, gfp_t flags, unsigned long caller)
 {
 	unsigned int index;
 
+	if (!b)
+		b = &kmalloc_caches[kmalloc_type(flags, caller)];
 	if (size <= 192)
 		index = kmalloc_size_index[size_index_elem(size)];
 	else
 		index = fls(size - 1);
 
-	return kmalloc_caches[kmalloc_type(flags, caller)][index];
+	return (*b)[index];
 }
 
 gfp_t kmalloc_fix_flags(gfp_t flags);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8787cf17d6e4..1d0f25b6ae91 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -698,7 +698,7 @@ size_t kmalloc_size_roundup(size_t size)
 		 * The flags don't matter since size_index is common to all.
 		 * Neither does the caller for just getting ->object_size.
 		 */
-		return kmalloc_slab(size, GFP_KERNEL, 0)->object_size;
+		return kmalloc_slab(NULL, size, GFP_KERNEL, 0)->object_size;
 	}
 
 	/* Above the smaller buckets, size is a multiple of page size. */
diff --git a/mm/slub.c b/mm/slub.c
index 2ef88bbf56a3..71220b4b1f79 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3959,7 +3959,7 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 EXPORT_SYMBOL(kmalloc_large_node);
 
 static __always_inline
-void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
+void *__do_kmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node,
 			unsigned long caller)
 {
 	struct kmem_cache *s;
@@ -3975,7 +3975,7 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
 	if (unlikely(!size))
 		return ZERO_SIZE_PTR;
 
-	s = kmalloc_slab(size, flags, caller);
+	s = kmalloc_slab(b, size, flags, caller);
 
 	ret = slab_alloc_node(s, NULL, flags, node, caller, size);
 	ret = kasan_kmalloc(s, ret, size, flags);
@@ -3983,22 +3983,22 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
 	return ret;
 }
 
-void *__kmalloc_node(size_t size, gfp_t flags, int node)
+void *__kmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node)
 {
-	return __do_kmalloc_node(size, flags, node, _RET_IP_);
+	return __do_kmalloc_node(b, size, flags, node, _RET_IP_);
 }
 EXPORT_SYMBOL(__kmalloc_node);
 
 void *__kmalloc(size_t size, gfp_t flags)
 {
-	return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
+	return __do_kmalloc_node(NULL, size, flags, NUMA_NO_NODE, _RET_IP_);
 }
 EXPORT_SYMBOL(__kmalloc);
 
 void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
 				  int node, unsigned long caller)
 {
-	return __do_kmalloc_node(size, flags, node, caller);
+	return __do_kmalloc_node(NULL, size, flags, node, caller);
 }
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/9] util: Introduce __kvmalloc_node() that can take kmem_buckets argument
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
  2024-03-05 10:10 ` [PATCH v2 1/9] slab: Introduce kmem_buckets typedef Kees Cook
  2024-03-05 10:10 ` [PATCH v2 2/9] slub: Plumb kmem_buckets into __do_kmalloc_node() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 4/9] slab: Introduce kmem_buckets_create() Kees Cook
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

Provide an API to perform kvmalloc-style allocations with a particular
set of buckets.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h | 9 ++++++++-
 mm/util.c            | 9 +++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1cc1a7637b56..f26ac9a6ef9f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -723,7 +723,14 @@ static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int n
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
 
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
+void * __alloc_size(2)
+__kvmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node);
+
+static inline __alloc_size(1) void *kvmalloc_node(size_t size, gfp_t flags, int node)
+{
+	return __kvmalloc_node(NULL, size, flags, node);
+}
+
 static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
 {
 	return kvmalloc_node(size, flags, NUMA_NO_NODE);
diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..02c895b87a28 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -577,8 +577,9 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 EXPORT_SYMBOL(vm_mmap);
 
 /**
- * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
+ * __kvmalloc_node - attempt to allocate physically contiguous memory, but upon
  * failure, fall back to non-contiguous (vmalloc) allocation.
+ * @b: which set of kmalloc buckets to allocate from.
  * @size: size of the request.
  * @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
  * @node: numa node to allocate from
@@ -592,7 +593,7 @@ EXPORT_SYMBOL(vm_mmap);
  *
  * Return: pointer to the allocated memory of %NULL in case of failure
  */
-void *kvmalloc_node(size_t size, gfp_t flags, int node)
+void *__kvmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node)
 {
 	gfp_t kmalloc_flags = flags;
 	void *ret;
@@ -614,7 +615,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 		kmalloc_flags &= ~__GFP_NOFAIL;
 	}
 
-	ret = kmalloc_node(size, kmalloc_flags, node);
+	ret = __kmalloc_node(b, size, kmalloc_flags, node);
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
@@ -643,7 +644,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 			flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
 			node, __builtin_return_address(0));
 }
-EXPORT_SYMBOL(kvmalloc_node);
+EXPORT_SYMBOL(__kvmalloc_node);
 
 /**
  * kvfree() - Free memory.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/9] slab: Introduce kmem_buckets_create()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (2 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 3/9] util: Introduce __kvmalloc_node() that can take kmem_buckets argument Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-25 19:40   ` Kent Overstreet
  2024-03-05 10:10 ` [PATCH v2 5/9] slab: Introduce kmem_buckets_alloc() Kees Cook
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

Dedicated caches are available For fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.

This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.

While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations.

In order to isolate user-controllable sized allocations from system
allocations, introduce kmem_buckets_create(), which behaves like
kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
which behaves like kmem_cache_alloc().)

Allows for confining allocations to a dedicated set of sized caches
(which have the same layout as the kmalloc caches).

This can also be used in the future once codetag allocation annotations
exist to implement per-caller allocation cache isolation[1] even for
dynamic allocations.

Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h |  5 +++
 mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index f26ac9a6ef9f..058d0e3cd181 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
 			   gfp_t gfpflags) __assume_slab_alignment __malloc;
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset, unsigned int usersize,
+				  void (*ctor)(void *));
+
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
  * allocator specific way to avoid taking locks repeatedly or building
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1d0f25b6ae91..03ba9aac96b6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -392,6 +392,74 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
+static struct kmem_cache *kmem_buckets_cache __ro_after_init;
+
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset,
+				  unsigned int usersize,
+				  void (*ctor)(void *))
+{
+	kmem_buckets *b;
+	int idx;
+
+	if (WARN_ON(!kmem_buckets_cache))
+		return NULL;
+
+	b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
+	if (WARN_ON(!b))
+		return NULL;
+
+	flags |= SLAB_NO_MERGE;
+
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+		char *short_size, *cache_name;
+		unsigned int cache_useroffset, cache_usersize;
+		unsigned int size;
+
+		if (!kmalloc_caches[KMALLOC_NORMAL][idx])
+			continue;
+
+		size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
+		if (!size)
+			continue;
+
+		short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
+		if (WARN_ON(!short_size))
+			goto fail;
+
+		cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
+		if (WARN_ON(!cache_name))
+			goto fail;
+
+		if (useroffset >= size) {
+			cache_useroffset = 0;
+			cache_usersize = 0;
+		} else {
+			cache_useroffset = useroffset;
+			cache_usersize = min(size - cache_useroffset, usersize);
+		}
+		(*b)[idx] = kmem_cache_create_usercopy(cache_name, size,
+					align, flags, cache_useroffset,
+					cache_usersize, ctor);
+		kfree(cache_name);
+		if (WARN_ON(!(*b)[idx]))
+			goto fail;
+	}
+
+	return b;
+
+fail:
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+		if ((*b)[idx])
+			kmem_cache_destroy((*b)[idx]);
+	}
+	kfree(b);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_buckets_create);
+
 #ifdef SLAB_SUPPORTS_SYSFS
 /*
  * For a given kmem_cache, kmem_cache_destroy() should only be called
@@ -933,6 +1001,10 @@ void __init create_kmalloc_caches(slab_flags_t flags)
 
 	/* Kmalloc array is now usable */
 	slab_state = UP;
+
+	kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
+					       sizeof(kmem_buckets),
+					       0, 0, NULL);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/9] slab: Introduce kmem_buckets_alloc()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (3 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 4/9] slab: Introduce kmem_buckets_create() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 6/9] slub: Introduce kmem_buckets_alloc_track_caller() Kees Cook
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

To perform allocations with the buckets allocated with
kmem_buckets_create(), introduce kmem_buckets_alloc() which behaves
like kmem_cache_alloc().

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 058d0e3cd181..08d248f9a1ba 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -600,6 +600,12 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
 	return __kmalloc(size, flags);
 }
 
+static __always_inline __alloc_size(2)
+void *kmem_buckets_alloc(kmem_buckets *b, size_t size, gfp_t flags)
+{
+	return __kmalloc_node(b, size, flags, NUMA_NO_NODE);
+}
+
 static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) && size) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/9] slub: Introduce kmem_buckets_alloc_track_caller()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (4 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 5/9] slab: Introduce kmem_buckets_alloc() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 7/9] slab: Introduce kmem_buckets_valloc() Kees Cook
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

For better capturing the caller details for allocation wrappers,
introduce kmem_buckets_alloc_track_caller() by plumbing the
buckets into the existing *_track_caller() interfaces.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h | 11 +++++++----
 mm/slub.c            |  4 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 08d248f9a1ba..7d84f875dcf4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -606,6 +606,9 @@ void *kmem_buckets_alloc(kmem_buckets *b, size_t size, gfp_t flags)
 	return __kmalloc_node(b, size, flags, NUMA_NO_NODE);
 }
 
+#define kmem_buckets_alloc_track_caller(b, size, flags)	\
+	__kmalloc_node_track_caller(b, size, flags, NUMA_NO_NODE, _RET_IP_)
+
 static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) && size) {
@@ -670,10 +673,10 @@ static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flag
 	return kmalloc_array(n, size, flags | __GFP_ZERO);
 }
 
-void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
-				  unsigned long caller) __alloc_size(1);
+void *__kmalloc_node_track_caller(kmem_buckets *b, size_t size, gfp_t flags, int node,
+				  unsigned long caller) __alloc_size(2);
 #define kmalloc_node_track_caller(size, flags, node) \
-	__kmalloc_node_track_caller(size, flags, node, \
+	__kmalloc_node_track_caller(NULL, size, flags, node, \
 				    _RET_IP_)
 
 /*
@@ -685,7 +688,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
  * request comes from.
  */
 #define kmalloc_track_caller(size, flags) \
-	__kmalloc_node_track_caller(size, flags, \
+	__kmalloc_node_track_caller(NULL, size, flags, \
 				    NUMA_NO_NODE, _RET_IP_)
 
 static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
diff --git a/mm/slub.c b/mm/slub.c
index 71220b4b1f79..ae54ec452a11 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3995,10 +3995,10 @@ void *__kmalloc(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
+void *__kmalloc_node_track_caller(kmem_buckets *b, size_t size, gfp_t flags,
 				  int node, unsigned long caller)
 {
-	return __do_kmalloc_node(NULL, size, flags, node, caller);
+	return __do_kmalloc_node(b, size, flags, node, caller);
 }
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 7/9] slab: Introduce kmem_buckets_valloc()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (5 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 6/9] slub: Introduce kmem_buckets_alloc_track_caller() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 8/9] ipc, msg: Use dedicated slab buckets for alloc_msg() Kees Cook
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel,
	linux-hardening

For allocations that may need to fallback to vmalloc, add
kmem_buckets_valloc().

Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7d84f875dcf4..0cf72861d5fa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -740,6 +740,12 @@ static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int n
 void * __alloc_size(2)
 __kvmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node);
 
+static __always_inline __alloc_size(2)
+void *kmem_buckets_valloc(kmem_buckets *b, size_t size, gfp_t flags)
+{
+	return __kvmalloc_node(b, size, flags, NUMA_NO_NODE);
+}
+
 static inline __alloc_size(1) void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
 	return __kvmalloc_node(NULL, size, flags, node);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 8/9] ipc, msg: Use dedicated slab buckets for alloc_msg()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (6 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 7/9] slab: Introduce kmem_buckets_valloc() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-05 10:10 ` [PATCH v2 9/9] mm/util: Use dedicated slab buckets for memdup_user() Kees Cook
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, Hyeonggon Yoo, linux-kernel, linux-mm,
	linux-hardening

The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
use-after-free type confusion flaws in the kernel for both read and
write primitives. Avoid having a user-controlled size cache share the
global kmalloc allocator by using a separate set of kmalloc buckets.

Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
Link: https://zplin.me/papers/ELOISE.pdf [6]
Link: https://syst3mfailure.io/wall-of-perdition/
Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: "GONG, Ruiqi" <gongruiqi@huaweicloud.com>
Cc: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jann Horn <jannh@google.com>
Cc: Matteo Rizzo <matteorizzo@google.com>
---
 ipc/msgutil.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d0a0e877cadd..f392f30a057a 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -42,6 +42,17 @@ struct msg_msgseg {
 #define DATALEN_MSG	((size_t)PAGE_SIZE-sizeof(struct msg_msg))
 #define DATALEN_SEG	((size_t)PAGE_SIZE-sizeof(struct msg_msgseg))
 
+static kmem_buckets *msg_buckets __ro_after_init;
+
+static int __init init_msg_buckets(void)
+{
+	msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
+					  sizeof(struct msg_msg),
+					  DATALEN_MSG, NULL);
+
+	return 0;
+}
+subsys_initcall(init_msg_buckets);
 
 static struct msg_msg *alloc_msg(size_t len)
 {
@@ -50,7 +61,7 @@ static struct msg_msg *alloc_msg(size_t len)
 	size_t alen;
 
 	alen = min(len, DATALEN_MSG);
-	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
+	msg = kmem_buckets_alloc(msg_buckets, sizeof(*msg) + alen, GFP_KERNEL);
 	if (msg == NULL)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 9/9] mm/util: Use dedicated slab buckets for memdup_user()
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (7 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 8/9] ipc, msg: Use dedicated slab buckets for alloc_msg() Kees Cook
@ 2024-03-05 10:10 ` Kees Cook
  2024-03-06  1:47 ` [PATCH v2 0/9] slab: Introduce dedicated bucket allocator GONG, Ruiqi
  2024-03-25  9:03 ` Vlastimil Babka
  10 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-05 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Andrew Morton, GONG, Ruiqi, Xiu Jianfeng,
	Suren Baghdasaryan, Kent Overstreet, Jann Horn, Matteo Rizzo,
	linux-mm, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, linux-kernel,
	linux-hardening

Both memdup_user() and vmemdup_user() handle allocations that are
regularly used for exploiting use-after-free type confusion flaws in
the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
respectively).

Since both are designed for contents coming from userspace, it allows
for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
buckets so these allocations do not share caches with the global kmalloc
buckets.

After a fresh boot under Ubuntu 23.10, we can see the caches are already
in active use:

 # grep ^memdup /proc/slabinfo
 memdup_user-8k         4      4   8192    4    8 : ...
 memdup_user-4k         8      8   4096    8    8 : ...
 memdup_user-2k        16     16   2048   16    8 : ...
 memdup_user-1k         0      0   1024   16    4 : ...
 memdup_user-512        0      0    512   16    2 : ...
 memdup_user-256        0      0    256   16    1 : ...
 memdup_user-128        0      0    128   32    1 : ...
 memdup_user-64       256    256     64   64    1 : ...
 memdup_user-32       512    512     32  128    1 : ...
 memdup_user-16      1024   1024     16  256    1 : ...
 memdup_user-8       2048   2048      8  512    1 : ...
 memdup_user-192        0      0    192   21    1 : ...
 memdup_user-96       168    168     96   42    1 : ...

Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
Link: https://etenal.me/archives/1336 [3]
Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "GONG, Ruiqi" <gongruiqi@huaweicloud.com>
Cc: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jann Horn <jannh@google.com>
Cc: Matteo Rizzo <matteorizzo@google.com>
Cc: linux-mm@kvack.org
---
 mm/util.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 02c895b87a28..25b9122022a7 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -181,6 +181,16 @@ char *kmemdup_nul(const char *s, size_t len, gfp_t gfp)
 }
 EXPORT_SYMBOL(kmemdup_nul);
 
+static kmem_buckets *user_buckets __ro_after_init;
+
+static int __init init_user_buckets(void)
+{
+	user_buckets = kmem_buckets_create("memdup_user", 0, 0, 0, INT_MAX, NULL);
+
+	return 0;
+}
+subsys_initcall(init_user_buckets);
+
 /**
  * memdup_user - duplicate memory region from user space
  *
@@ -194,7 +204,7 @@ void *memdup_user(const void __user *src, size_t len)
 {
 	void *p;
 
-	p = kmalloc_track_caller(len, GFP_USER | __GFP_NOWARN);
+	p = kmem_buckets_alloc_track_caller(user_buckets, len, GFP_USER | __GFP_NOWARN);
 	if (!p)
 		return ERR_PTR(-ENOMEM);
 
@@ -220,7 +230,7 @@ void *vmemdup_user(const void __user *src, size_t len)
 {
 	void *p;
 
-	p = kvmalloc(len, GFP_USER);
+	p = kmem_buckets_valloc(user_buckets, len, GFP_USER);
 	if (!p)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (8 preceding siblings ...)
  2024-03-05 10:10 ` [PATCH v2 9/9] mm/util: Use dedicated slab buckets for memdup_user() Kees Cook
@ 2024-03-06  1:47 ` GONG, Ruiqi
  2024-03-07 20:31   ` Kees Cook
  2024-03-25  9:03 ` Vlastimil Babka
  10 siblings, 1 reply; 23+ messages in thread
From: GONG, Ruiqi @ 2024-03-06  1:47 UTC (permalink / raw)
  To: Kees Cook, Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, Xiu Jianfeng,
	Suren Baghdasaryan, Kent Overstreet, Jann Horn, Matteo Rizzo,
	linux-kernel, linux-mm, linux-hardening



On 2024/03/05 18:10, Kees Cook wrote:
> Hi,
> 
> Repeating the commit logs for patch 4 here:
> 
>     Dedicated caches are available For fixed size allocations via
>     kmem_cache_alloc(), but for dynamically sized allocations there is only
>     the global kmalloc API's set of buckets available. This means it isn't
>     possible to separate specific sets of dynamically sized allocations into
>     a separate collection of caches.
> 
>     This leads to a use-after-free exploitation weakness in the Linux
>     kernel since many heap memory spraying/grooming attacks depend on using
>     userspace-controllable dynamically sized allocations to collide with
>     fixed size allocations that end up in same cache.
> 
>     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
>     against these kinds of "type confusion" attacks, including for fixed
>     same-size heap objects, we can create a complementary deterministic
>     defense for dynamically sized allocations.
> 
>     In order to isolate user-controllable sized allocations from system
>     allocations, introduce kmem_buckets_create(), which behaves like
>     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
>     which behaves like kmem_cache_alloc().)

So can I say the vision here would be to make all the kernel interfaces
that handles user space input to use separated caches? Which looks like
creating a "grey zone" in the middle of kernel space (trusted) and user
space (untrusted) memory. I've also thought that maybe hardening on the
"border" could be more efficient and targeted than a mitigation that
affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-06  1:47 ` [PATCH v2 0/9] slab: Introduce dedicated bucket allocator GONG, Ruiqi
@ 2024-03-07 20:31   ` Kees Cook
  2024-03-15 10:28     ` GONG, Ruiqi
  0 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2024-03-07 20:31 UTC (permalink / raw)
  To: GONG, Ruiqi
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet, Jann Horn,
	Matteo Rizzo, linux-kernel, linux-mm, linux-hardening

On Wed, Mar 06, 2024 at 09:47:36AM +0800, GONG, Ruiqi wrote:
> 
> 
> On 2024/03/05 18:10, Kees Cook wrote:
> > Hi,
> > 
> > Repeating the commit logs for patch 4 here:
> > 
> >     Dedicated caches are available For fixed size allocations via
> >     kmem_cache_alloc(), but for dynamically sized allocations there is only
> >     the global kmalloc API's set of buckets available. This means it isn't
> >     possible to separate specific sets of dynamically sized allocations into
> >     a separate collection of caches.
> > 
> >     This leads to a use-after-free exploitation weakness in the Linux
> >     kernel since many heap memory spraying/grooming attacks depend on using
> >     userspace-controllable dynamically sized allocations to collide with
> >     fixed size allocations that end up in same cache.
> > 
> >     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> >     against these kinds of "type confusion" attacks, including for fixed
> >     same-size heap objects, we can create a complementary deterministic
> >     defense for dynamically sized allocations.
> > 
> >     In order to isolate user-controllable sized allocations from system
> >     allocations, introduce kmem_buckets_create(), which behaves like
> >     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> >     which behaves like kmem_cache_alloc().)
> 
> So can I say the vision here would be to make all the kernel interfaces
> that handles user space input to use separated caches? Which looks like
> creating a "grey zone" in the middle of kernel space (trusted) and user
> space (untrusted) memory. I've also thought that maybe hardening on the
> "border" could be more efficient and targeted than a mitigation that
> affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES.

I think it ends up having a similar effect, yes. The more copies that
move to memdup_user(), the more coverage is created. The main point is to
just not share caches between different kinds of allocations. The most
abused version of this is the userspace size-controllable allocations,
which this targets. The existing caches (which could still be used for
type confusion attacks when the sizes are sufficiently similar) have a
good chance of being mitigated by CONFIG_RANDOM_KMALLOC_CACHES already,
so this proposed change is just complementary, IMO.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-07 20:31   ` Kees Cook
@ 2024-03-15 10:28     ` GONG, Ruiqi
  0 siblings, 0 replies; 23+ messages in thread
From: GONG, Ruiqi @ 2024-03-15 10:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet, Jann Horn,
	Matteo Rizzo, linux-kernel, linux-mm, linux-hardening



On 2024/03/08 4:31, Kees Cook wrote:
> On Wed, Mar 06, 2024 at 09:47:36AM +0800, GONG, Ruiqi wrote:
>>
>>
>> On 2024/03/05 18:10, Kees Cook wrote:
>>> Hi,
>>>
>>> Repeating the commit logs for patch 4 here:
>>>
>>>     Dedicated caches are available For fixed size allocations via
>>>     kmem_cache_alloc(), but for dynamically sized allocations there is only
>>>     the global kmalloc API's set of buckets available. This means it isn't
>>>     possible to separate specific sets of dynamically sized allocations into
>>>     a separate collection of caches.
>>>
>>>     This leads to a use-after-free exploitation weakness in the Linux
>>>     kernel since many heap memory spraying/grooming attacks depend on using
>>>     userspace-controllable dynamically sized allocations to collide with
>>>     fixed size allocations that end up in same cache.
>>>
>>>     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
>>>     against these kinds of "type confusion" attacks, including for fixed
>>>     same-size heap objects, we can create a complementary deterministic
>>>     defense for dynamically sized allocations.
>>>
>>>     In order to isolate user-controllable sized allocations from system
>>>     allocations, introduce kmem_buckets_create(), which behaves like
>>>     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
>>>     which behaves like kmem_cache_alloc().)
>>
>> So can I say the vision here would be to make all the kernel interfaces
>> that handles user space input to use separated caches? Which looks like
>> creating a "grey zone" in the middle of kernel space (trusted) and user
>> space (untrusted) memory. I've also thought that maybe hardening on the
>> "border" could be more efficient and targeted than a mitigation that
>> affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES.
> 
> I think it ends up having a similar effect, yes. The more copies that
> move to memdup_user(), the more coverage is created. The main point is to
> just not share caches between different kinds of allocations. The most
> abused version of this is the userspace size-controllable allocations,
> which this targets. 

I agree. Currently if we want to fulfill a more strict separation
between user-space manageable memory and other memory in kernel space,
technically speaking for fixed size allocations we could transform them
into using dedicated caches (i.e. kmem_cache_create()), but for dynamic
size allocations I don't think of any solution. With the APIs provided
by this patch set, we've got something that works.


> ... The existing caches (which could still be used for
> type confusion attacks when the sizes are sufficiently similar) have a
> good chance of being mitigated by CONFIG_RANDOM_KMALLOC_CACHES already,
> so this proposed change is just complementary, IMO.

Maybe in the future we could require that all user-kernel interfaces
that make use of SLAB caches should use either kmem_cache_create() or
kmem_buckets_create()? ;)

> 
> -Kees
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
                   ` (9 preceding siblings ...)
  2024-03-06  1:47 ` [PATCH v2 0/9] slab: Introduce dedicated bucket allocator GONG, Ruiqi
@ 2024-03-25  9:03 ` Vlastimil Babka
  2024-03-25 18:24   ` Kees Cook
  2024-03-25 19:32   ` Kent Overstreet
  10 siblings, 2 replies; 23+ messages in thread
From: Vlastimil Babka @ 2024-03-25  9:03 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG, Ruiqi,
	Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet, Jann Horn,
	Matteo Rizzo, linux-kernel, linux-mm, linux-hardening, jvoisin

On 3/5/24 11:10 AM, Kees Cook wrote:
> Hi,
> 
> Repeating the commit logs for patch 4 here:
> 
>     Dedicated caches are available For fixed size allocations via
>     kmem_cache_alloc(), but for dynamically sized allocations there is only
>     the global kmalloc API's set of buckets available. This means it isn't
>     possible to separate specific sets of dynamically sized allocations into
>     a separate collection of caches.
> 
>     This leads to a use-after-free exploitation weakness in the Linux
>     kernel since many heap memory spraying/grooming attacks depend on using
>     userspace-controllable dynamically sized allocations to collide with
>     fixed size allocations that end up in same cache.
> 
>     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
>     against these kinds of "type confusion" attacks, including for fixed
>     same-size heap objects, we can create a complementary deterministic
>     defense for dynamically sized allocations.
> 
>     In order to isolate user-controllable sized allocations from system
>     allocations, introduce kmem_buckets_create(), which behaves like
>     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
>     which behaves like kmem_cache_alloc().)
> 
>     Allows for confining allocations to a dedicated set of sized caches
>     (which have the same layout as the kmalloc caches).
> 
>     This can also be used in the future once codetag allocation annotations
>     exist to implement per-caller allocation cache isolation[0] even for
>     dynamic allocations.
> 
>     Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> 
> After the implemetation are 2 example patches of how this could be used
> for some repeat "offenders" that get used in exploits. There are more to
> be isolated beyond just these. Repeating the commit log for patch 8 here:
> 
>     The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
>     use-after-free type confusion flaws in the kernel for both read and
>     write primitives. Avoid having a user-controlled size cache share the
>     global kmalloc allocator by using a separate set of kmalloc buckets.
> 
>     Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
>     Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
>     Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
>     Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
>     Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
>     Link: https://zplin.me/papers/ELOISE.pdf [6]

Hi Kees,

after reading [1] I think the points should be addressed, mainly about the
feasibility of converting users manually. On a related technical note I
worry what will become of /proc/slabinfo when we convert non-trivial amounts
of users.

Also would interested to hear Jann Horn et al.'s opinion, and whether the
SLAB_VIRTUAL effort will continue?

Thanks,
Vlastimil


[1]
https://dustri.org/b/notes-on-the-slab-introduce-dedicated-bucket-allocator-series.html

> -Kees
> 
>  v2: significant rewrite, generalized the buckets type, added kvmalloc style
>  v1: https://lore.kernel.org/lkml/20240304184252.work.496-kees@kernel.org/
> 
> Kees Cook (9):
>   slab: Introduce kmem_buckets typedef
>   slub: Plumb kmem_buckets into __do_kmalloc_node()
>   util: Introduce __kvmalloc_node() that can take kmem_buckets argument
>   slab: Introduce kmem_buckets_create()
>   slab: Introduce kmem_buckets_alloc()
>   slub: Introduce kmem_buckets_alloc_track_caller()
>   slab: Introduce kmem_buckets_valloc()
>   ipc, msg: Use dedicated slab buckets for alloc_msg()
>   mm/util: Use dedicated slab buckets for memdup_user()
> 
>  include/linux/slab.h | 50 +++++++++++++++++++++-------
>  ipc/msgutil.c        | 13 +++++++-
>  lib/fortify_kunit.c  |  2 +-
>  mm/slab.h            |  6 ++--
>  mm/slab_common.c     | 77 ++++++++++++++++++++++++++++++++++++++++++--
>  mm/slub.c            | 14 ++++----
>  mm/util.c            | 23 +++++++++----
>  7 files changed, 154 insertions(+), 31 deletions(-)
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-25  9:03 ` Vlastimil Babka
@ 2024-03-25 18:24   ` Kees Cook
  2024-03-26 18:07     ` julien.voisin
  2024-03-25 19:32   ` Kent Overstreet
  1 sibling, 1 reply; 23+ messages in thread
From: Kees Cook @ 2024-03-25 18:24 UTC (permalink / raw)
  To: Vlastimil Babka, Julien Voisin
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG, Ruiqi,
	Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet, Jann Horn,
	Matteo Rizzo, linux-kernel, linux-mm, linux-hardening, jvoisin

On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote:
> On 3/5/24 11:10 AM, Kees Cook wrote:
> > Hi,
> > 
> > Repeating the commit logs for patch 4 here:
> > 
> >     Dedicated caches are available For fixed size allocations via
> >     kmem_cache_alloc(), but for dynamically sized allocations there is only
> >     the global kmalloc API's set of buckets available. This means it isn't
> >     possible to separate specific sets of dynamically sized allocations into
> >     a separate collection of caches.
> > 
> >     This leads to a use-after-free exploitation weakness in the Linux
> >     kernel since many heap memory spraying/grooming attacks depend on using
> >     userspace-controllable dynamically sized allocations to collide with
> >     fixed size allocations that end up in same cache.
> > 
> >     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> >     against these kinds of "type confusion" attacks, including for fixed
> >     same-size heap objects, we can create a complementary deterministic
> >     defense for dynamically sized allocations.
> > 
> >     In order to isolate user-controllable sized allocations from system
> >     allocations, introduce kmem_buckets_create(), which behaves like
> >     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> >     which behaves like kmem_cache_alloc().)
> > 
> >     Allows for confining allocations to a dedicated set of sized caches
> >     (which have the same layout as the kmalloc caches).
> > 
> >     This can also be used in the future once codetag allocation annotations
> >     exist to implement per-caller allocation cache isolation[0] even for
> >     dynamic allocations.
> > 
> >     Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> > 
> > After the implemetation are 2 example patches of how this could be used
> > for some repeat "offenders" that get used in exploits. There are more to
> > be isolated beyond just these. Repeating the commit log for patch 8 here:
> > 
> >     The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> >     use-after-free type confusion flaws in the kernel for both read and
> >     write primitives. Avoid having a user-controlled size cache share the
> >     global kmalloc allocator by using a separate set of kmalloc buckets.
> > 
> >     Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> >     Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> >     Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> >     Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> >     Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> >     Link: https://zplin.me/papers/ELOISE.pdf [6]
> 
> Hi Kees,
> 
> after reading [1] I think the points should be addressed, mainly about the
> feasibility of converting users manually.

Sure, I can do that.

Adding Julien to this thread... Julien can you please respond to LKML
patches in email? It's much easier to keep things in a single thread. :)

] This is playing wack-a-mole

Kind of, but not really. These patches provide a mechanism for having
dedicated dynamically-sized slab caches (to match kmem_cache_create(),
which only works for fixed-size allocations). This is needed to expand
the codetag work into doing per-call-site allocations, as I detailed
here[1].

Also, adding uses manually isn't very difficult, as can be seen in the
examples I included. In fact, my examples between v1 and v2 collapsed
from 3 to 2, because covering memdup_user() actually covered 2 known
allocation paths (attrs and vma names), and given its usage pattern,
will cover more in the future without changes.

] something like AUTOSLAB would be better

Yes, that's the goal of [1]. This is a prerequisite for that, as
mentioned in the cover letter.

] The slabs needs to be pinned

Yes, and this is a general problem[2] with all kmalloc allocations, though.
This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and
is under development.

] Lacks guard pages

Yes, and again, this is a general problem with all kmalloc allocations.
Solving it, like SLAB_VIRTUAL, would be a complementary hardening
improvement to the allocator generally.

] PAX_USERCOPY has been marking these sites since 2012

Either it's whack-a-mole or it's not. :) PAX_USERCOPY shows that it _is_
possible to mark all sites. Regardless, like AUTOSLAB, PAX_USERCOPY isn't
upstream, and its current implementation is an unpublished modification
to a GPL project. I look forward to someone proposing it for inclusion
in Linux, but for now we can work with the patches where an effort _has_
been made to upstream them for the benefit of the entire ecosystem.

] What about CONFIG_KMALLOC_SPLIT_VARSIZE

This proposed improvement is hampered by not having dedicated
_dynamically_ sized kmem caches, which this series provides. And with
codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE
are more fully realized, providing much more complete coverage.

] I have no idea how the community around the Linux kernel works with
] their email-based workflows

Step 1: reply to the proposal in email instead of (or perhaps in
addition to) making blog posts. :)

> On a related technical note I
> worry what will become of /proc/slabinfo when we convert non-trivial amounts
> of users.

It gets longer. :) And potentially makes the codetag /proc file
redundant. All that said, there are very few APIs in the kernel where
userspace can control both the size and contents of an allocation.

> Also would interested to hear Jann Horn et al.'s opinion, and whether the
> SLAB_VIRTUAL effort will continue?

SLAB_VIRTUAL is needed to address the reclamation UAF gap, and is
still being developed. I don't intend to let it fall off the radar.
(Which is why I included Jann and Matteo in CC originally.)

In the meantime, adding this series as-is kills two long-standing
exploitation methodologies, and paves the way to providing very
fine-grained caches using codetags (which I imagine would be entirely
optional and trivial to control with a boot param).

-Kees

[1] https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook/
[2] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-25  9:03 ` Vlastimil Babka
  2024-03-25 18:24   ` Kees Cook
@ 2024-03-25 19:32   ` Kent Overstreet
  2024-03-25 20:26     ` Kees Cook
  1 sibling, 1 reply; 23+ messages in thread
From: Kent Overstreet @ 2024-03-25 19:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kees Cook, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG,
	Ruiqi, Xiu Jianfeng, Suren Baghdasaryan, Jann Horn, Matteo Rizzo,
	linux-kernel, linux-mm, linux-hardening, jvoisin

On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote:
> On 3/5/24 11:10 AM, Kees Cook wrote:
> > Hi,
> > 
> > Repeating the commit logs for patch 4 here:
> > 
> >     Dedicated caches are available For fixed size allocations via
> >     kmem_cache_alloc(), but for dynamically sized allocations there is only
> >     the global kmalloc API's set of buckets available. This means it isn't
> >     possible to separate specific sets of dynamically sized allocations into
> >     a separate collection of caches.
> > 
> >     This leads to a use-after-free exploitation weakness in the Linux
> >     kernel since many heap memory spraying/grooming attacks depend on using
> >     userspace-controllable dynamically sized allocations to collide with
> >     fixed size allocations that end up in same cache.
> > 
> >     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> >     against these kinds of "type confusion" attacks, including for fixed
> >     same-size heap objects, we can create a complementary deterministic
> >     defense for dynamically sized allocations.
> > 
> >     In order to isolate user-controllable sized allocations from system
> >     allocations, introduce kmem_buckets_create(), which behaves like
> >     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> >     which behaves like kmem_cache_alloc().)
> > 
> >     Allows for confining allocations to a dedicated set of sized caches
> >     (which have the same layout as the kmalloc caches).
> > 
> >     This can also be used in the future once codetag allocation annotations
> >     exist to implement per-caller allocation cache isolation[0] even for
> >     dynamic allocations.
> > 
> >     Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> > 
> > After the implemetation are 2 example patches of how this could be used
> > for some repeat "offenders" that get used in exploits. There are more to
> > be isolated beyond just these. Repeating the commit log for patch 8 here:
> > 
> >     The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> >     use-after-free type confusion flaws in the kernel for both read and
> >     write primitives. Avoid having a user-controlled size cache share the
> >     global kmalloc allocator by using a separate set of kmalloc buckets.
> > 
> >     Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> >     Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> >     Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> >     Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> >     Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> >     Link: https://zplin.me/papers/ELOISE.pdf [6]
> 
> Hi Kees,
> 
> after reading [1] I think the points should be addressed, mainly about the
> feasibility of converting users manually. On a related technical note I
> worry what will become of /proc/slabinfo when we convert non-trivial amounts
> of users.

There shouldn't be any need to convert users to this interface - just
leverage the alloc_hooks() macro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/9] slab: Introduce kmem_buckets_create()
  2024-03-05 10:10 ` [PATCH v2 4/9] slab: Introduce kmem_buckets_create() Kees Cook
@ 2024-03-25 19:40   ` Kent Overstreet
  2024-03-25 20:40     ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: Kent Overstreet @ 2024-03-25 19:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Jann Horn, Matteo Rizzo, linux-kernel, linux-hardening

On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> Dedicated caches are available For fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
> 
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
> 
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations.
> 
> In order to isolate user-controllable sized allocations from system
> allocations, introduce kmem_buckets_create(), which behaves like
> kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> which behaves like kmem_cache_alloc().)
> 
> Allows for confining allocations to a dedicated set of sized caches
> (which have the same layout as the kmalloc caches).
> 
> This can also be used in the future once codetag allocation annotations
> exist to implement per-caller allocation cache isolation[1] even for
> dynamic allocations.
> 
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: linux-mm@kvack.org
> ---
>  include/linux/slab.h |  5 +++
>  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index f26ac9a6ef9f..058d0e3cd181 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>  
> +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> +				  slab_flags_t flags,
> +				  unsigned int useroffset, unsigned int usersize,
> +				  void (*ctor)(void *));

I'd prefer an API that initialized an object over one that allocates it
- that is, prefer

kmem_buckets_init(kmem_buckets *bucekts, ...)

by forcing it to be separately allocated, you're adding a pointer deref
to every access.

That would also allow for kmem_buckets to be lazily initialized, which
would play nicely declaring the kmem_buckets in the alloc_hooks() macro.

I'm curious what all the arguments to kmem_buckets_create() are needed
for, if this is supposed to be a replacement for kmalloc() users.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-25 19:32   ` Kent Overstreet
@ 2024-03-25 20:26     ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-25 20:26 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG,
	Ruiqi, Xiu Jianfeng, Suren Baghdasaryan, Jann Horn, Matteo Rizzo,
	linux-kernel, linux-mm, linux-hardening, jvoisin

On Mon, Mar 25, 2024 at 03:32:12PM -0400, Kent Overstreet wrote:
> On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote:
> > On 3/5/24 11:10 AM, Kees Cook wrote:
> > > Hi,
> > > 
> > > Repeating the commit logs for patch 4 here:
> > > 
> > >     Dedicated caches are available For fixed size allocations via
> > >     kmem_cache_alloc(), but for dynamically sized allocations there is only
> > >     the global kmalloc API's set of buckets available. This means it isn't
> > >     possible to separate specific sets of dynamically sized allocations into
> > >     a separate collection of caches.
> > > 
> > >     This leads to a use-after-free exploitation weakness in the Linux
> > >     kernel since many heap memory spraying/grooming attacks depend on using
> > >     userspace-controllable dynamically sized allocations to collide with
> > >     fixed size allocations that end up in same cache.
> > > 
> > >     While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > >     against these kinds of "type confusion" attacks, including for fixed
> > >     same-size heap objects, we can create a complementary deterministic
> > >     defense for dynamically sized allocations.
> > > 
> > >     In order to isolate user-controllable sized allocations from system
> > >     allocations, introduce kmem_buckets_create(), which behaves like
> > >     kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > >     which behaves like kmem_cache_alloc().)
> > > 
> > >     Allows for confining allocations to a dedicated set of sized caches
> > >     (which have the same layout as the kmalloc caches).
> > > 
> > >     This can also be used in the future once codetag allocation annotations
> > >     exist to implement per-caller allocation cache isolation[0] even for
> > >     dynamic allocations.
> > > 
> > >     Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> > > 
> > > After the implemetation are 2 example patches of how this could be used
> > > for some repeat "offenders" that get used in exploits. There are more to
> > > be isolated beyond just these. Repeating the commit log for patch 8 here:
> > > 
> > >     The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> > >     use-after-free type confusion flaws in the kernel for both read and
> > >     write primitives. Avoid having a user-controlled size cache share the
> > >     global kmalloc allocator by using a separate set of kmalloc buckets.
> > > 
> > >     Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> > >     Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> > >     Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> > >     Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> > >     Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> > >     Link: https://zplin.me/papers/ELOISE.pdf [6]
> > 
> > Hi Kees,
> > 
> > after reading [1] I think the points should be addressed, mainly about the
> > feasibility of converting users manually. On a related technical note I
> > worry what will become of /proc/slabinfo when we convert non-trivial amounts
> > of users.
> 
> There shouldn't be any need to convert users to this interface - just
> leverage the alloc_hooks() macro.

I expect to do both -- using the alloc_hooks() macro to do
per-call-site-allocation caches will certainly have a non-trivial amount
of memory usage overhead, and not all systems will want it. We can have
a boot param to choose between per-site and normal, though normal can
include a handful of these manually identified places.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/9] slab: Introduce kmem_buckets_create()
  2024-03-25 19:40   ` Kent Overstreet
@ 2024-03-25 20:40     ` Kees Cook
  2024-03-25 21:49       ` Kent Overstreet
  0 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2024-03-25 20:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Jann Horn, Matteo Rizzo, linux-kernel, linux-hardening

On Mon, Mar 25, 2024 at 03:40:51PM -0400, Kent Overstreet wrote:
> On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> > Dedicated caches are available For fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> > 
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> > 
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations.
> > 
> > In order to isolate user-controllable sized allocations from system
> > allocations, introduce kmem_buckets_create(), which behaves like
> > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > which behaves like kmem_cache_alloc().)
> > 
> > Allows for confining allocations to a dedicated set of sized caches
> > (which have the same layout as the kmalloc caches).
> > 
> > This can also be used in the future once codetag allocation annotations
> > exist to implement per-caller allocation cache isolation[1] even for
> > dynamic allocations.
> > 
> > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: linux-mm@kvack.org
> > ---
> >  include/linux/slab.h |  5 +++
> >  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 77 insertions(+)
> > 
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index f26ac9a6ef9f..058d0e3cd181 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> >  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> >  
> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> > +				  slab_flags_t flags,
> > +				  unsigned int useroffset, unsigned int usersize,
> > +				  void (*ctor)(void *));
> 
> I'd prefer an API that initialized an object over one that allocates it
> - that is, prefer
> 
> kmem_buckets_init(kmem_buckets *bucekts, ...)

Sure, that can work. kmem_cache_init() would need to exist for the same
reason though.

> 
> by forcing it to be separately allocated, you're adding a pointer deref
> to every access.

I don't understand what you mean here. "every access"? I take a guess
below...

> That would also allow for kmem_buckets to be lazily initialized, which
> would play nicely declaring the kmem_buckets in the alloc_hooks() macro.

Sure, I think it'll depend on how the per-site allocations got wired up.
I think you're meaning to include a full copy of the kmem cache/bucket
struct with the codetag instead of just a pointer? I don't think that'll
work well to make it runtime selectable, and I don't see it using an
extra deref -- allocations already get the struct from somewhere and
deref it. The only change is where to find the struct.

> I'm curious what all the arguments to kmem_buckets_create() are needed
> for, if this is supposed to be a replacement for kmalloc() users.

Are you confusing kmem_buckets_create() with kmem_buckets_alloc()? These
args are needed to initialize the per-bucket caches, just like is
already done for the global kmalloc per-bucket caches. This mirrors
kmem_cache_create(). (Or more specifically, calls kmem_cache_create()
for each bucket size, so the args need to be passed through.)

If you mean "why expose these arguments because they can just use the
existing defaults already used by the global kmalloc caches" then I
would say, it's to gain the benefit here of narrowing the scope of the
usercopy offsets. Right now kmalloc is forced to allow the full usercopy
window into an allocation, but we don't have to do this any more. For
example, see patch 8, where struct msg_msg doesn't need to expose the
header to userspace:

	msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
					  sizeof(struct msg_msg),
					  DATALEN_MSG, NULL);

Only DATALEN_MSG many bytes, starting at sizeof(struct msg_msg), will be
allowed to be copied in/out of userspace. Before, it was unbounded.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/9] slab: Introduce kmem_buckets_create()
  2024-03-25 20:40     ` Kees Cook
@ 2024-03-25 21:49       ` Kent Overstreet
  2024-03-25 23:13         ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: Kent Overstreet @ 2024-03-25 21:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Jann Horn, Matteo Rizzo, linux-kernel, linux-hardening

On Mon, Mar 25, 2024 at 01:40:34PM -0700, Kees Cook wrote:
> On Mon, Mar 25, 2024 at 03:40:51PM -0400, Kent Overstreet wrote:
> > On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> > > Dedicated caches are available For fixed size allocations via
> > > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > > the global kmalloc API's set of buckets available. This means it isn't
> > > possible to separate specific sets of dynamically sized allocations into
> > > a separate collection of caches.
> > > 
> > > This leads to a use-after-free exploitation weakness in the Linux
> > > kernel since many heap memory spraying/grooming attacks depend on using
> > > userspace-controllable dynamically sized allocations to collide with
> > > fixed size allocations that end up in same cache.
> > > 
> > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > > against these kinds of "type confusion" attacks, including for fixed
> > > same-size heap objects, we can create a complementary deterministic
> > > defense for dynamically sized allocations.
> > > 
> > > In order to isolate user-controllable sized allocations from system
> > > allocations, introduce kmem_buckets_create(), which behaves like
> > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > > which behaves like kmem_cache_alloc().)
> > > 
> > > Allows for confining allocations to a dedicated set of sized caches
> > > (which have the same layout as the kmalloc caches).
> > > 
> > > This can also be used in the future once codetag allocation annotations
> > > exist to implement per-caller allocation cache isolation[1] even for
> > > dynamic allocations.
> > > 
> > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > ---
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Pekka Enberg <penberg@kernel.org>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > Cc: linux-mm@kvack.org
> > > ---
> > >  include/linux/slab.h |  5 +++
> > >  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 77 insertions(+)
> > > 
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index f26ac9a6ef9f..058d0e3cd181 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > >  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
> > >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> > >  
> > > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> > > +				  slab_flags_t flags,
> > > +				  unsigned int useroffset, unsigned int usersize,
> > > +				  void (*ctor)(void *));
> > 
> > I'd prefer an API that initialized an object over one that allocates it
> > - that is, prefer
> > 
> > kmem_buckets_init(kmem_buckets *bucekts, ...)
> 
> Sure, that can work. kmem_cache_init() would need to exist for the same
> reason though.

That'll be a very worthwhile addition too; IPC running kernel code
is always crap and dependent loads is a big part of that.

I did mempool_init() and bioset_init() awhile back, so it's someone
else's turn for this one :)

> Sure, I think it'll depend on how the per-site allocations got wired up.
> I think you're meaning to include a full copy of the kmem cache/bucket
> struct with the codetag instead of just a pointer? I don't think that'll
> work well to make it runtime selectable, and I don't see it using an
> extra deref -- allocations already get the struct from somewhere and
> deref it. The only change is where to find the struct.

The codetags are in their own dedicated elf sections already, so if you
put the kmem_buckets in the codetag the entire elf section can be
discarded if it's not in use.

Also, the issue isn't derefs - it's dependent loads and locality. Taking
the address of the kmem_buckets to pass it is fine; the data referred to
will still get pulled into cache when we touch the codetag. If it's
behind a pointer we have to pull the codetag into cache, wait for that
so we can get the kmme_buckets pointer - then start to pull in the
kmem_buckets itself.

If it's a cache miss you just slowed the entire allocation down by
around 30 ns.

> > I'm curious what all the arguments to kmem_buckets_create() are needed
> > for, if this is supposed to be a replacement for kmalloc() users.
> 
> Are you confusing kmem_buckets_create() with kmem_buckets_alloc()? These
> args are needed to initialize the per-bucket caches, just like is
> already done for the global kmalloc per-bucket caches. This mirrors
> kmem_cache_create(). (Or more specifically, calls kmem_cache_create()
> for each bucket size, so the args need to be passed through.)
> 
> If you mean "why expose these arguments because they can just use the
> existing defaults already used by the global kmalloc caches" then I
> would say, it's to gain the benefit here of narrowing the scope of the
> usercopy offsets. Right now kmalloc is forced to allow the full usercopy
> window into an allocation, but we don't have to do this any more. For
> example, see patch 8, where struct msg_msg doesn't need to expose the
> header to userspace:

"usercopy window"? You're now annotating which data can be copied to
userspace?

I'm skeptical, this looks like defensive programming gone amuck to me.
 
> 	msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
> 					  sizeof(struct msg_msg),
> 					  DATALEN_MSG, NULL);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/9] slab: Introduce kmem_buckets_create()
  2024-03-25 21:49       ` Kent Overstreet
@ 2024-03-25 23:13         ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-25 23:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Jann Horn, Matteo Rizzo, linux-kernel, linux-hardening

On Mon, Mar 25, 2024 at 05:49:49PM -0400, Kent Overstreet wrote:
> The codetags are in their own dedicated elf sections already, so if you
> put the kmem_buckets in the codetag the entire elf section can be
> discarded if it's not in use.

Gotcha. Yeah, sounds good. Once codetags and this series land, I can
start working on making the per-site series.

> "usercopy window"? You're now annotating which data can be copied to
> userspace?

Hm? Yes. That's been there for over 7 years. :) It's just that it was only
meaningful for kmem_cache_create() users, since the proposed GFP_USERCOPY
for kmalloc() never landed[1].

-Kees

[1] https://lore.kernel.org/lkml/1497915397-93805-23-git-send-email-keescook@chromium.org/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-25 18:24   ` Kees Cook
@ 2024-03-26 18:07     ` julien.voisin
  2024-03-26 19:41       ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: julien.voisin @ 2024-03-26 18:07 UTC (permalink / raw)
  To: Kees Cook, Vlastimil Babka, Julien Voisin
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo, GONG, Ruiqi,
	Xiu Jianfeng, Suren Baghdasaryan, Kent Overstreet, Jann Horn,
	Matteo Rizzo, linux-kernel, linux-mm, linux-hardening

25 March 2024 at 19:24, "Kees Cook" <keescook@chromium.org> wrote:



> 
> On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote:
> 
> > 
> > On 3/5/24 11:10 AM, Kees Cook wrote:
> > 
> >  Hi,
> > 
> >  
> > 
> >  Repeating the commit logs for patch 4 here:
> > 
> >  
> > 
> >  Dedicated caches are available For fixed size allocations via
> > 
> >  kmem_cache_alloc(), but for dynamically sized allocations there is only
> > 
> >  the global kmalloc API's set of buckets available. This means it isn't
> > 
> >  possible to separate specific sets of dynamically sized allocations into
> > 
> >  a separate collection of caches.
> > 
> >  
> > 
> >  This leads to a use-after-free exploitation weakness in the Linux
> > 
> >  kernel since many heap memory spraying/grooming attacks depend on using
> > 
> >  userspace-controllable dynamically sized allocations to collide with
> > 
> >  fixed size allocations that end up in same cache.
> > 
> >  
> > 
> >  While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > 
> >  against these kinds of "type confusion" attacks, including for fixed
> > 
> >  same-size heap objects, we can create a complementary deterministic
> > 
> >  defense for dynamically sized allocations.
> > 
> >  
> > 
> >  In order to isolate user-controllable sized allocations from system
> > 
> >  allocations, introduce kmem_buckets_create(), which behaves like
> > 
> >  kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > 
> >  which behaves like kmem_cache_alloc().)
> > 
> >  
> > 
> >  Allows for confining allocations to a dedicated set of sized caches
> > 
> >  (which have the same layout as the kmalloc caches).
> > 
> >  
> > 
> >  This can also be used in the future once codetag allocation annotations
> > 
> >  exist to implement per-caller allocation cache isolation[0] even for
> > 
> >  dynamic allocations.
> > 
> >  
> > 
> >  Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> > 
> >  
> > 
> >  After the implemetation are 2 example patches of how this could be used
> > 
> >  for some repeat "offenders" that get used in exploits. There are more to
> > 
> >  be isolated beyond just these. Repeating the commit log for patch 8 here:
> > 
> >  
> > 
> >  The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> > 
> >  use-after-free type confusion flaws in the kernel for both read and
> > 
> >  write primitives. Avoid having a user-controlled size cache share the
> > 
> >  global kmalloc allocator by using a separate set of kmalloc buckets.
> > 
> >  
> > 
> >  Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> > 
> >  Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> > 
> >  Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> > 
> >  Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> > 
> >  Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> > 
> >  Link: https://zplin.me/papers/ELOISE.pdf [6]
> > 
> >  
> > 
> >  Hi Kees,
> > 
> >  
> > 
> >  after reading [1] I think the points should be addressed, mainly about the
> > 
> >  feasibility of converting users manually.
> > 
> 
> Sure, I can do that.
> 
> Adding Julien to this thread... Julien can you please respond to LKML
> 
> patches in email? It's much easier to keep things in a single thread. :)
> 
> ] This is playing wack-a-mole
> 
> Kind of, but not really. These patches provide a mechanism for having
> 
> dedicated dynamically-sized slab caches (to match kmem_cache_create(),
> 
> which only works for fixed-size allocations). This is needed to expand
> 
> the codetag work into doing per-call-site allocations, as I detailed
> 
> here[1].
> 
> Also, adding uses manually isn't very difficult, as can be seen in the
> 
> examples I included. In fact, my examples between v1 and v2 collapsed
> 
> from 3 to 2, because covering memdup_user() actually covered 2 known
> 
> allocation paths (attrs and vma names), and given its usage pattern,
> 
> will cover more in the future without changes.

It's not about difficulty, it's about scale. There are hundreds of interesting structures: I'm worried that no one will take the time to add a separate bucket for each of them, chase their call-sites down, and monitor every single newly added structures to check if they are "interesting" and should benefit from their own bucket as well.

> 
> ] something like AUTOSLAB would be better
> 
> Yes, that's the goal of [1]. This is a prerequisite for that, as
> 
> mentioned in the cover letter.

This series looks unrelated to [1] to me: the former adds a mechanism to add buckets and expects developers to manually make use of them, while the latter is about adding infrastructure to automate call-site-based segregation.

> ] The slabs needs to be pinned
> 
> Yes, and this is a general problem[2] with all kmalloc allocations, though.
> 
> This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and
> 
> is under development.

Then it would be nice to mention it in the serie, as an acknowledged limitation.

> ] Lacks guard pages
> 
> Yes, and again, this is a general problem with all kmalloc allocations.
> 
> Solving it, like SLAB_VIRTUAL, would be a complementary hardening
> 
> improvement to the allocator generally.

Then it would also be nice to mention it, because currently it's unclear that those limitations are both known and will be properly addressed.

> 
> ] PAX_USERCOPY has been marking these sites since 2012
> 
> Either it's whack-a-mole or it's not. :) 

This annotation was added 12 years ago in PaX, and while it was state of the art back then, I think that in 2024 we can do better than this.

> PAX_USERCOPY shows that it _is_ possible to mark all sites.

It shows that it's possible to annotate some sites (17 in grsecurity-3.1-4.9.9-201702122044.patch), and while it has a similar approach to your series, its annotations aren't conveying the same meaning.

> Regardless, like AUTOSLAB, PAX_USERCOPY isn't
> 
> upstream, and its current implementation is an unpublished modification
> 
> to a GPL project. I look forward to someone proposing it for inclusion
> 
> in Linux, but for now we can work with the patches where an effort _has_
> 
> been made to upstream them for the benefit of the entire ecosystem.
> 
> ] What about CONFIG_KMALLOC_SPLIT_VARSIZE
> 
> This proposed improvement is hampered by not having dedicated
> 
> _dynamically_ sized kmem caches, which this series provides. And with
> 
> codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE
> 
> are more fully realized, providing much more complete coverage.

CONFIG_KMALLOC_SPLIT_VARSIZE has been bypassed dozen of times in various ways as part of Google's kernelCTF.
Your series is, to my understanding, a weaker form of it. So I'm not super-convinced that it's the right approach to mitigate UAF.

Do you think it would be possible for Google to add this series to its kernelCTF, so gather empirical data on how feasible/easy it is to bypass it?

> 
> ] I have no idea how the community around the Linux kernel works with
> 
> ] their email-based workflows
> 
> Step 1: reply to the proposal in email instead of (or perhaps in
> 
> addition to) making blog posts. :)
> 
> > 
> > On a related technical note I
> > 
> >  worry what will become of /proc/slabinfo when we convert non-trivial amounts
> > 
> >  of users.
> > 
> 
> It gets longer. :) And potentially makes the codetag /proc file
> 
> redundant. All that said, there are very few APIs in the kernel where
> 
> userspace can control both the size and contents of an allocation.
> 
> > 
> > Also would interested to hear Jann Horn et al.'s opinion, and whether the
> > 
> >  SLAB_VIRTUAL effort will continue?
> > 
> 
> SLAB_VIRTUAL is needed to address the reclamation UAF gap, and is
> 
> still being developed. I don't intend to let it fall off the radar.
> 
> (Which is why I included Jann and Matteo in CC originally.)
> 
> In the meantime, adding this series as-is kills two long-standing
> 
> exploitation methodologies, and paves the way to providing very
> 
> fine-grained caches using codetags (which I imagine would be entirely
> 
> optional and trivial to control with a boot param).
> 
> -Kees
> 
> [1] https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook/
> 
> [2] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
> 
> -- 
> 
> Kees Cook
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] slab: Introduce dedicated bucket allocator
  2024-03-26 18:07     ` julien.voisin
@ 2024-03-26 19:41       ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2024-03-26 19:41 UTC (permalink / raw)
  To: julien.voisin
  Cc: Vlastimil Babka, Julien Voisin, Andrew Morton, Christoph Lameter,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, GONG, Ruiqi, Xiu Jianfeng, Suren Baghdasaryan,
	Kent Overstreet, Jann Horn, Matteo Rizzo, linux-kernel, linux-mm,
	linux-hardening

On Tue, Mar 26, 2024 at 06:07:07PM +0000, julien.voisin@dustri.org wrote:
> 25 March 2024 at 19:24, "Kees Cook" <keescook@chromium.org> wrote:
> > On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote:
> > > On 3/5/24 11:10 AM, Kees Cook wrote:
> > >  Hi,
> > >  
> > >  Repeating the commit logs for patch 4 here:
> > >  
> > >  Dedicated caches are available For fixed size allocations via
> > >  kmem_cache_alloc(), but for dynamically sized allocations there is only
> > >  the global kmalloc API's set of buckets available. This means it isn't
> > >  possible to separate specific sets of dynamically sized allocations into
> > >  a separate collection of caches.
> > >  
> > >  This leads to a use-after-free exploitation weakness in the Linux
> > >  kernel since many heap memory spraying/grooming attacks depend on using
> > >  userspace-controllable dynamically sized allocations to collide with
> > >  fixed size allocations that end up in same cache.
> > >  
> > >  While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > >  against these kinds of "type confusion" attacks, including for fixed
> > >  same-size heap objects, we can create a complementary deterministic
> > >  defense for dynamically sized allocations.
> > >  
> > >  In order to isolate user-controllable sized allocations from system
> > >  allocations, introduce kmem_buckets_create(), which behaves like
> > >  kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > >  which behaves like kmem_cache_alloc().)
> > >  
> > >  Allows for confining allocations to a dedicated set of sized caches
> > >  (which have the same layout as the kmalloc caches).
> > >  
> > >  This can also be used in the future once codetag allocation annotations
> > >  exist to implement per-caller allocation cache isolation[0] even for
> > >  dynamic allocations.
> > >  
> > >  Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0]
> > >  
> > >  After the implemetation are 2 example patches of how this could be used
> > >  for some repeat "offenders" that get used in exploits. There are more to
> > >  be isolated beyond just these. Repeating the commit log for patch 8 here:
> > >  
> > >  The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> > >  use-after-free type confusion flaws in the kernel for both read and
> > >  write primitives. Avoid having a user-controlled size cache share the
> > >  global kmalloc allocator by using a separate set of kmalloc buckets.
> > >  
> > >  Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> > >  Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> > >  Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> > >  Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> > >  Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> > >  Link: https://zplin.me/papers/ELOISE.pdf [6]
> > >  
> > >  Hi Kees,
> > >  
> > >  after reading [1] I think the points should be addressed, mainly about the
> > >  feasibility of converting users manually.
> >
> > Sure, I can do that.
> > Adding Julien to this thread... Julien can you please respond to LKML
> > patches in email? It's much easier to keep things in a single thread. :)
> >
> > ] This is playing wack-a-mole
> > Kind of, but not really. These patches provide a mechanism for having
> > dedicated dynamically-sized slab caches (to match kmem_cache_create(),
> > which only works for fixed-size allocations). This is needed to expand
> > the codetag work into doing per-call-site allocations, as I detailed
> > here[1].
> >
> > Also, adding uses manually isn't very difficult, as can be seen in the
> > examples I included. In fact, my examples between v1 and v2 collapsed
> > from 3 to 2, because covering memdup_user() actually covered 2 known
> > allocation paths (attrs and vma names), and given its usage pattern,
> > will cover more in the future without changes.
> 
> It's not about difficulty, it's about scale. There are hundreds of interesting structures: I'm worried that no one will take the time to add a separate bucket for each of them, chase their call-sites down, and monitor every single newly added structures to check if they are "interesting" and should benefit from their own bucket as well.

Very few are both: 1) dynamically sized, and 2) coming from userspace,
so I think the scale is fine.

> > ] something like AUTOSLAB would be better
> > Yes, that's the goal of [1]. This is a prerequisite for that, as
> > mentioned in the cover letter.
> 
> This series looks unrelated to [1] to me: the former adds a mechanism to add buckets and expects developers to manually make use of them, while the latter is about adding infrastructure to automate call-site-based segregation.

Right -- but for call-site-based separation, there is currently no way
to separate _dynamically_ sized allocations; only fixed size (via
kmem_cache_create()). This series adds the ability for call-site-based
separation to also use kmem_bucket_create(). Call-site-based
separation isn't possible without this series.

> 
> > ] The slabs needs to be pinned
> > Yes, and this is a general problem[2] with all kmalloc allocations, though.
> > This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and
> > is under development.
> 
> Then it would be nice to mention it in the serie, as an acknowledged limitation.

Sure, I can update the cover letter.

> 
> > ] Lacks guard pages
> > Yes, and again, this is a general problem with all kmalloc allocations.
> > Solving it, like SLAB_VIRTUAL, would be a complementary hardening
> > improvement to the allocator generally.
> 
> Then it would also be nice to mention it, because currently it's unclear that those limitations are both known and will be properly addressed.

Sure. For both this and pinning, the issues are orthogonal, so it didn't
seem useful to distract from what the series was doing, but I can
explicitly mention them going forward.

> 
> > ] PAX_USERCOPY has been marking these sites since 2012
> > Either it's whack-a-mole or it's not. :) 
> 
> This annotation was added 12 years ago in PaX, and while it was state of the art back then, I think that in 2024 we can do better than this.

Agreed. Here's my series to start that. :)

> > PAX_USERCOPY shows that it _is_ possible to mark all sites.
> 
> It shows that it's possible to annotate some sites (17 in grsecurity-3.1-4.9.9-201702122044.patch), and while it has a similar approach to your series, its annotations aren't conveying the same meaning.

Sure, GFP_USERCOPY is separate.

> > Regardless, like AUTOSLAB, PAX_USERCOPY isn't
> > upstream, and its current implementation is an unpublished modification
> > to a GPL project. I look forward to someone proposing it for inclusion
> > in Linux, but for now we can work with the patches where an effort _has_
> > been made to upstream them for the benefit of the entire ecosystem.
> > ] What about CONFIG_KMALLOC_SPLIT_VARSIZE
> > This proposed improvement is hampered by not having dedicated
> > _dynamically_ sized kmem caches, which this series provides. And with
> > codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE
> > are more fully realized, providing much more complete coverage.
> 
> CONFIG_KMALLOC_SPLIT_VARSIZE has been bypassed dozen of times in various ways as part of Google's kernelCTF.
> Your series is, to my understanding, a weaker form of it. So I'm not super-convinced that it's the right approach to mitigate UAF.

This series doesn't do anything that CONFIG_KMALLOC_SPLIT_VARSIZE does.
The call-site-separation series (which would depend on this series)
would do that work.

> Do you think it would be possible for Google to add this series to its kernelCTF, so gather empirical data on how feasible/easy it is to bypass it?

Sure, feel free to make that happen. :) But again, I'm less interested
in this series as a _standalone_ solution. It's a prerequisite for
call-site-based allocation separation. As part of it, though, we can
plug the blatant exploitation methods that currently exist.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-03-26 19:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-05 10:10 [PATCH v2 0/9] slab: Introduce dedicated bucket allocator Kees Cook
2024-03-05 10:10 ` [PATCH v2 1/9] slab: Introduce kmem_buckets typedef Kees Cook
2024-03-05 10:10 ` [PATCH v2 2/9] slub: Plumb kmem_buckets into __do_kmalloc_node() Kees Cook
2024-03-05 10:10 ` [PATCH v2 3/9] util: Introduce __kvmalloc_node() that can take kmem_buckets argument Kees Cook
2024-03-05 10:10 ` [PATCH v2 4/9] slab: Introduce kmem_buckets_create() Kees Cook
2024-03-25 19:40   ` Kent Overstreet
2024-03-25 20:40     ` Kees Cook
2024-03-25 21:49       ` Kent Overstreet
2024-03-25 23:13         ` Kees Cook
2024-03-05 10:10 ` [PATCH v2 5/9] slab: Introduce kmem_buckets_alloc() Kees Cook
2024-03-05 10:10 ` [PATCH v2 6/9] slub: Introduce kmem_buckets_alloc_track_caller() Kees Cook
2024-03-05 10:10 ` [PATCH v2 7/9] slab: Introduce kmem_buckets_valloc() Kees Cook
2024-03-05 10:10 ` [PATCH v2 8/9] ipc, msg: Use dedicated slab buckets for alloc_msg() Kees Cook
2024-03-05 10:10 ` [PATCH v2 9/9] mm/util: Use dedicated slab buckets for memdup_user() Kees Cook
2024-03-06  1:47 ` [PATCH v2 0/9] slab: Introduce dedicated bucket allocator GONG, Ruiqi
2024-03-07 20:31   ` Kees Cook
2024-03-15 10:28     ` GONG, Ruiqi
2024-03-25  9:03 ` Vlastimil Babka
2024-03-25 18:24   ` Kees Cook
2024-03-26 18:07     ` julien.voisin
2024-03-26 19:41       ` Kees Cook
2024-03-25 19:32   ` Kent Overstreet
2024-03-25 20:26     ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.