linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] slab cleanups
@ 2022-02-21 10:53 Hyeonggon Yoo
  2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

Hi. These are cleanup patches of slab. Please consider them for slab-next :)
Correctness of locking for patch 5 is checked using lockdep.

Hyeonggon Yoo (5):
  mm/sl[au]b: Unify __ksize()
  mm/sl[auo]b: Do not export __ksize()
  mm/slab: Do not call kmalloc_large() for unsupported size
  mm/slub: Limit min_partial only in cache creation
  mm/slub: Refactor deactivate_slab()

 include/linux/slab.h |  23 +++++++---
 mm/slab.c            |  23 ----------
 mm/slab_common.c     |  28 ++++++++++++
 mm/slob.c            |   1 -
 mm/slub.c            | 101 +++++++++++++++++--------------------------
 5 files changed, 84 insertions(+), 92 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] mm/sl[au]b: Unify __ksize()
  2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
  2022-02-23 18:39   ` Vlastimil Babka
  2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

Only SLOB need to implement __ksize() separately because SLOB records
size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slab.c        | 23 -----------------------
 mm/slab_common.c | 29 +++++++++++++++++++++++++++++
 mm/slub.c        | 16 ----------------
 3 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index ddf5737c63d9..eb73d2499480 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
 
-/**
- * __ksize -- Uninstrumented ksize.
- * @objp: pointer to the object
- *
- * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
- * safety checks as ksize() with KASAN instrumentation enabled.
- *
- * Return: size of the actual memory used by @objp in bytes
- */
-size_t __ksize(const void *objp)
-{
-	struct kmem_cache *c;
-	size_t size;
 
-	BUG_ON(!objp);
-	if (unlikely(objp == ZERO_SIZE_PTR))
-		return 0;
-
-	c = virt_to_cache(objp);
-	size = c ? c->object_size : 0;
-
-	return size;
-}
-EXPORT_SYMBOL(__ksize);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23f2ab0713b7..488997db0d97 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
 }
 EXPORT_SYMBOL(kfree_sensitive);
 
+#ifndef CONFIG_SLOB
+/**
+ * __ksize -- Uninstrumented ksize.
+ * @objp: pointer to the object
+ *
+ * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
+ * safety checks as ksize() with KASAN instrumentation enabled.
+ *
+ * Return: size of the actual memory used by @objp in bytes
+ */
+size_t __ksize(const void *object)
+{
+	struct folio *folio;
+
+	if (unlikely(object == ZERO_SIZE_PTR))
+		return 0;
+
+	folio = virt_to_folio(object);
+
+#ifdef CONFIG_SLUB
+	if (unlikely(!folio_test_slab(folio)))
+		return folio_size(folio);
+#endif
+
+	return slab_ksize(folio_slab(folio)->slab_cache);
+}
+EXPORT_SYMBOL(__ksize);
+#endif
+
 /**
  * ksize - get the actual amount of memory allocated for a given object
  * @objp: Pointer to the object
diff --git a/mm/slub.c b/mm/slub.c
index 261474092e43..3a4458976ab7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4526,22 +4526,6 @@ void __check_heap_object(const void *ptr, unsigned long n,
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
 
-size_t __ksize(const void *object)
-{
-	struct folio *folio;
-
-	if (unlikely(object == ZERO_SIZE_PTR))
-		return 0;
-
-	folio = virt_to_folio(object);
-
-	if (unlikely(!folio_test_slab(folio)))
-		return folio_size(folio);
-
-	return slab_ksize(folio_slab(folio)->slab_cache);
-}
-EXPORT_SYMBOL(__ksize);
-
 void kfree(const void *x)
 {
 	struct folio *folio;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
  2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
  2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
  2022-02-21 15:46   ` Matthew Wilcox
  2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

Do not export __ksize(). Only kasan calls __ksize() directly.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slab_common.c | 1 -
 mm/slob.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 488997db0d97..f2c021b57579 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1271,7 +1271,6 @@ size_t __ksize(const void *object)
 
 	return slab_ksize(folio_slab(folio)->slab_cache);
 }
-EXPORT_SYMBOL(__ksize);
 #endif
 
 /**
diff --git a/mm/slob.c b/mm/slob.c
index 60c5842215f1..d8af6c54f133 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -588,7 +588,6 @@ size_t __ksize(const void *block)
 	m = (unsigned int *)(block - align);
 	return SLOB_UNITS(*m) * SLOB_UNIT;
 }
-EXPORT_SYMBOL(__ksize);
 
 int __kmem_cache_create(struct kmem_cache *c, slab_flags_t flags)
 {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
  2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
  2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
  2022-02-21 15:53   ` Matthew Wilcox
  2022-02-24 12:48   ` Vlastimil Babka
  2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
  2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
  4 siblings, 2 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

SLAB's kfree() does not support freeing an object that is allocated from
kmalloc_large(). Fix this as SLAB do not pass requests larger than
KMALLOC_MAX_CACHE_SIZE directly to page allocator.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 include/linux/slab.h | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 37bde99b74af..aeda3e863f2b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -564,15 +564,19 @@ static __always_inline __alloc_size(1) void *kmalloc_large(size_t size, gfp_t fl
  *	Try really hard to succeed the allocation but fail
  *	eventually.
  */
+#ifndef CONFIG_SLOB
 static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
-#ifndef CONFIG_SLOB
 		unsigned int index;
-#endif
-		if (size > KMALLOC_MAX_CACHE_SIZE)
-			return kmalloc_large(size, flags);
-#ifndef CONFIG_SLOB
+
+		if (size > KMALLOC_MAX_CACHE_SIZE) {
+			if (IS_ENABLED(CONFIG_SLUB))
+				return kmalloc_large(size, flags);
+			else
+				return NULL;
+		}
+
 		index = kmalloc_index(size);
 
 		if (!index)
@@ -581,10 +585,17 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
 		return kmem_cache_alloc_trace(
 				kmalloc_caches[kmalloc_type(flags)][index],
 				flags, size);
-#endif
 	}
 	return __kmalloc(size, flags);
 }
+#else
+static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size) && size > KMALLOC_MAX_CACHE_SIZE)
+		return kmalloc_large(size, flags);
+	return __kmalloc(size, flags);
+}
+#endif
 
 static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
  2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
                   ` (2 preceding siblings ...)
  2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
  2022-02-22 23:48   ` David Rientjes
  2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
  4 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

SLUB sets number of minimum partial slabs for node (min_partial) using
set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
to avoid excessive use of page allocator.

set_min_partial() limits value of min_partial between MIN_PARTIAL and
MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
too, Only limit value of min_partial in kmem_cache_open() so that it
can be changed to value that a user wants.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3a4458976ab7..a4964deccb61 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 static void set_min_partial(struct kmem_cache *s, unsigned long min)
 {
-	if (min < MIN_PARTIAL)
-		min = MIN_PARTIAL;
-	else if (min > MAX_PARTIAL)
-		min = MAX_PARTIAL;
 	s->min_partial = min;
 }
 
@@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 
 static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 {
+	int min_partial;
+
 	s->flags = kmem_cache_flags(s->size, flags, s->name);
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	s->random = get_random_long();
@@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	 * The larger the object size is, the more slabs we want on the partial
 	 * list to avoid pounding the page allocator excessively.
 	 */
-	set_min_partial(s, ilog2(s->size) / 2);
+	min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
+	min_partial = max(MIN_PARTIAL, min_partial);
+
+	set_min_partial(s, min_partial);
 
 	set_cpu_partial(s);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
                   ` (3 preceding siblings ...)
  2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
@ 2022-02-21 10:53 ` Hyeonggon Yoo
  2022-02-24 18:16   ` Vlastimil Babka
  4 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-21 10:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Vlastimil Babka, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Hyeonggon Yoo

Simply deactivate_slab() by removing variable 'lock' and replacing
'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
n->list_lock when cmpxchg_double() fails, and then retry.

One slight functional change is releasing and taking n->list_lock again
when cmpxchg_double() fails. This is not harmful because SLUB avoids
deactivating slabs as much as possible.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
 1 file changed, 33 insertions(+), 41 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a4964deccb61..2d0663befb9e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 {
 	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
 	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
-	int lock = 0, free_delta = 0;
-	enum slab_modes l = M_NONE, m = M_NONE;
+	int free_delta = 0;
+	enum slab_modes mode = M_NONE;
 	void *nextfree, *freelist_iter, *freelist_tail;
 	int tail = DEACTIVATE_TO_HEAD;
 	unsigned long flags = 0;
@@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	new.frozen = 0;
 
 	if (!new.inuse && n->nr_partial >= s->min_partial)
-		m = M_FREE;
+		mode = M_FREE;
 	else if (new.freelist) {
-		m = M_PARTIAL;
-		if (!lock) {
-			lock = 1;
-			/*
-			 * Taking the spinlock removes the possibility that
-			 * acquire_slab() will see a slab that is frozen
-			 */
-			spin_lock_irqsave(&n->list_lock, flags);
-		}
-	} else {
-		m = M_FULL;
-		if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
-			lock = 1;
-			/*
-			 * This also ensures that the scanning of full
-			 * slabs from diagnostic functions will not see
-			 * any frozen slabs.
-			 */
-			spin_lock_irqsave(&n->list_lock, flags);
-		}
+		mode = M_PARTIAL;
+		/*
+		 * Taking the spinlock removes the possibility that
+		 * acquire_slab() will see a slab that is frozen
+		 */
+		spin_lock_irqsave(&n->list_lock, flags);
+		add_partial(n, slab, tail);
+	} else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
+		mode = M_FULL;
+		/*
+		 * This also ensures that the scanning of full
+		 * slabs from diagnostic functions will not see
+		 * any frozen slabs.
+		 */
+		spin_lock_irqsave(&n->list_lock, flags);
+		add_full(s, n, slab);
 	}
 
-	if (l != m) {
-		if (l == M_PARTIAL)
-			remove_partial(n, slab);
-		else if (l == M_FULL)
-			remove_full(s, n, slab);
 
-		if (m == M_PARTIAL)
-			add_partial(n, slab, tail);
-		else if (m == M_FULL)
-			add_full(s, n, slab);
-	}
-
-	l = m;
 	if (!cmpxchg_double_slab(s, slab,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
-				"unfreezing slab"))
+				"unfreezing slab")) {
+		if (mode == M_PARTIAL) {
+			remove_partial(n, slab);
+			spin_unlock_irqrestore(&n->list_lock, flags);
+		} else if (mode == M_FULL) {
+			remove_full(s, n, slab);
+			spin_unlock_irqrestore(&n->list_lock, flags);
+		}
 		goto redo;
+	}
 
-	if (lock)
-		spin_unlock_irqrestore(&n->list_lock, flags);
 
-	if (m == M_PARTIAL)
+	if (mode == M_PARTIAL) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
 		stat(s, tail);
-	else if (m == M_FULL)
+	} else if (mode == M_FULL) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
 		stat(s, DEACTIVATE_FULL);
-	else if (m == M_FREE) {
+	} else if (mode == M_FREE) {
 		stat(s, DEACTIVATE_EMPTY);
 		discard_slab(s, slab);
 		stat(s, FREE_SLAB);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
  2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
@ 2022-02-21 15:46   ` Matthew Wilcox
  2022-02-23  3:26     ` Hyeonggon Yoo
  2022-02-23 18:40     ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: Matthew Wilcox @ 2022-02-21 15:46 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
> Do not export __ksize(). Only kasan calls __ksize() directly.

We should probably delete (most of) the kernel-doc comment on __ksize,
leaving only the note that it's uninstrumented and only called by
kasan.  Also, we should probably move the declaration of __ksize from
include/linux/slab.h to mm/slab.h since we don't want to grow new callers.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
@ 2022-02-21 15:53   ` Matthew Wilcox
  2022-02-22  8:10     ` Hyeonggon Yoo
  2022-02-24 12:48   ` Vlastimil Babka
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-02-21 15:53 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> SLAB's kfree() does not support freeing an object that is allocated from
> kmalloc_large(). Fix this as SLAB do not pass requests larger than
> KMALLOC_MAX_CACHE_SIZE directly to page allocator.

I was wondering if we wanted to go in the other direction and get rid of
kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-21 15:53   ` Matthew Wilcox
@ 2022-02-22  8:10     ` Hyeonggon Yoo
  2022-02-22 19:59       ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-22  8:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > SLAB's kfree() does not support freeing an object that is allocated from
> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> 
> I was wondering if we wanted to go in the other direction and get rid of
> kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.

Good point.

Hmm.. I don't think SLAB is benefiting from queueing that large objects,
and maximum size is still limited to what buddy allocator supports.

I'll try reducing kmalloc caches up to order-1 page like SLUB.
That would be easier to maintain.

Thanks,
Hyeonggon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-22  8:10     ` Hyeonggon Yoo
@ 2022-02-22 19:59       ` Matthew Wilcox
  2022-02-23  3:24         ` Hyeonggon Yoo
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-02-22 19:59 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Tue, Feb 22, 2022 at 08:10:32AM +0000, Hyeonggon Yoo wrote:
> On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> > On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > > SLAB's kfree() does not support freeing an object that is allocated from
> > > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> > 
> > I was wondering if we wanted to go in the other direction and get rid of
> > kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
> 
> Good point.
> 
> Hmm.. I don't think SLAB is benefiting from queueing that large objects,
> and maximum size is still limited to what buddy allocator supports.
> 
> I'll try reducing kmalloc caches up to order-1 page like SLUB.
> That would be easier to maintain.

If you have time to investigate these kinds of things, I think SLUB would
benefit from caching order-2 and order-3 slabs as well.  Maybe not so much
now that Mel included order-2 and order-3 caching in the page allocator.
But it'd be interesting to have numbers.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
  2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
@ 2022-02-22 23:48   ` David Rientjes
  2022-02-23  3:37     ` Hyeonggon Yoo
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2022-02-22 23:48 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, Christoph Lameter, Pekka Enberg

On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:

> SLUB sets number of minimum partial slabs for node (min_partial) using
> set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
> to avoid excessive use of page allocator.
> 
> set_min_partial() limits value of min_partial between MIN_PARTIAL and
> MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
> too, Only limit value of min_partial in kmem_cache_open() so that it
> can be changed to value that a user wants.
> 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

I think this makes sense and there is no reason to limit the bounds that 
may be set at runtime with undocumented behavior.

However, since set_min_partial() is only setting the value in the 
kmem_cache, could we remove the helper function entirely and fold it into 
its two callers?

> ---
>  mm/slub.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 3a4458976ab7..a4964deccb61 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  
>  static void set_min_partial(struct kmem_cache *s, unsigned long min)
>  {
> -	if (min < MIN_PARTIAL)
> -		min = MIN_PARTIAL;
> -	else if (min > MAX_PARTIAL)
> -		min = MAX_PARTIAL;
>  	s->min_partial = min;
>  }
>  
> @@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
>  
>  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>  {
> +	int min_partial;
> +
>  	s->flags = kmem_cache_flags(s->size, flags, s->name);
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>  	s->random = get_random_long();
> @@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>  	 * The larger the object size is, the more slabs we want on the partial
>  	 * list to avoid pounding the page allocator excessively.
>  	 */
> -	set_min_partial(s, ilog2(s->size) / 2);
> +	min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
> +	min_partial = max(MIN_PARTIAL, min_partial);
> +
> +	set_min_partial(s, min_partial);
>  
>  	set_cpu_partial(s);
>  
> -- 
> 2.33.1
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-22 19:59       ` Matthew Wilcox
@ 2022-02-23  3:24         ` Hyeonggon Yoo
  0 siblings, 0 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23  3:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Tue, Feb 22, 2022 at 07:59:08PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 22, 2022 at 08:10:32AM +0000, Hyeonggon Yoo wrote:
> > On Mon, Feb 21, 2022 at 03:53:39PM +0000, Matthew Wilcox wrote:
> > > On Mon, Feb 21, 2022 at 10:53:34AM +0000, Hyeonggon Yoo wrote:
> > > > SLAB's kfree() does not support freeing an object that is allocated from
> > > > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > > > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> > > 
> > > I was wondering if we wanted to go in the other direction and get rid of
> > > kmalloc cache sizes larger than, say, 64kB from the SLAB allocator.
> > 
> > Good point.
> > 
> > Hmm.. I don't think SLAB is benefiting from queueing that large objects,
> > and maximum size is still limited to what buddy allocator supports.
> > 
> > I'll try reducing kmalloc caches up to order-1 page like SLUB.
> > That would be easier to maintain.
> 
> If you have time to investigate these kinds of things, I think SLUB would
> benefit from caching order-2 and order-3 slabs as well.  Maybe not so much
> now that Mel included order-2 and order-3 caching in the page allocator.
> But it'd be interesting to have numbers.

That's interesting topic. But I think this is slightly different topic.
AFAIK It's rare that a workload would benefit more from using slab for
large size objects (8K, 16K, ... etc) than using page allocator.

And yeah, caching high order slabs may affect the numbers even if page
allocator caches high order pages. SLUB already caches them and SLUB can
cache more slabs by tuning number of cpu partial slabs (s->cpu_partial_slabs)
and number of node partial slabs. (s->min_partial)

I need to investigate what actually Mel did and learn how it affects
SLUB. So it will take some time. Thanks!

--
Hyeonggon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
  2022-02-21 15:46   ` Matthew Wilcox
@ 2022-02-23  3:26     ` Hyeonggon Yoo
  2022-02-23 18:40     ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23  3:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg

On Mon, Feb 21, 2022 at 03:46:01PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
> > Do not export __ksize(). Only kasan calls __ksize() directly.
> 
> We should probably delete (most of) the kernel-doc comment on __ksize,
> leaving only the note that it's uninstrumented and only called by
> kasan.  Also, we should probably move the declaration of __ksize from
> include/linux/slab.h to mm/slab.h since we don't want to grow new callers.

Seems reasonable. I'll update it in next version!

Thanks.

--
Hyeonggon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
  2022-02-22 23:48   ` David Rientjes
@ 2022-02-23  3:37     ` Hyeonggon Yoo
  2022-02-24 12:52       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23  3:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Roman Gushchin, Andrew Morton, Vlastimil Babka,
	linux-kernel, Joonsoo Kim, Christoph Lameter, Pekka Enberg

On Tue, Feb 22, 2022 at 03:48:16PM -0800, David Rientjes wrote:
> On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:
> 
> > SLUB sets number of minimum partial slabs for node (min_partial) using
> > set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
> > to avoid excessive use of page allocator.
> > 
> > set_min_partial() limits value of min_partial between MIN_PARTIAL and
> > MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
> > too, Only limit value of min_partial in kmem_cache_open() so that it
> > can be changed to value that a user wants.
> > 
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 
> I think this makes sense and there is no reason to limit the bounds that 
> may be set at runtime with undocumented behavior.

Thank you for comment.

> 
> However, since set_min_partial() is only setting the value in the 
> kmem_cache, could we remove the helper function entirely and fold it into 
> its two callers?

Right. We don't need to separate this as function. I'll update this
in next version. Thanks!

> 
> > ---
> >  mm/slub.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3a4458976ab7..a4964deccb61 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4002,10 +4002,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> >  
> >  static void set_min_partial(struct kmem_cache *s, unsigned long min)
> >  {
> > -	if (min < MIN_PARTIAL)
> > -		min = MIN_PARTIAL;
> > -	else if (min > MAX_PARTIAL)
> > -		min = MAX_PARTIAL;
> >  	s->min_partial = min;
> >  }
> >  
> > @@ -4184,6 +4180,8 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> >  
> >  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >  {
> > +	int min_partial;
> > +
> >  	s->flags = kmem_cache_flags(s->size, flags, s->name);
> >  #ifdef CONFIG_SLAB_FREELIST_HARDENED
> >  	s->random = get_random_long();
> > @@ -4215,7 +4213,10 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >  	 * The larger the object size is, the more slabs we want on the partial
> >  	 * list to avoid pounding the page allocator excessively.
> >  	 */
> > -	set_min_partial(s, ilog2(s->size) / 2);
> > +	min_partial = min(MAX_PARTIAL, ilog2(s->size) / 2);
> > +	min_partial = max(MIN_PARTIAL, min_partial);
> > +
> > +	set_min_partial(s, min_partial);
> >  
> >  	set_cpu_partial(s);
> >  
> > -- 
> > 2.33.1
> > 
> >

-- 
Hyeonggon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
  2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
@ 2022-02-23 18:39   ` Vlastimil Babka
  2022-02-23 19:06     ` Marco Elver
  0 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-23 18:39 UTC (permalink / raw)
  To: Hyeonggon Yoo, linux-mm
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
	David Rientjes, Christoph Lameter, Pekka Enberg, Kees Cook,
	kasan-dev, Marco Elver

On 2/21/22 11:53, Hyeonggon Yoo wrote:
> Only SLOB need to implement __ksize() separately because SLOB records
> size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
> 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slab.c        | 23 -----------------------
>  mm/slab_common.c | 29 +++++++++++++++++++++++++++++
>  mm/slub.c        | 16 ----------------
>  3 files changed, 29 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index ddf5737c63d9..eb73d2499480 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
>  }
>  #endif /* CONFIG_HARDENED_USERCOPY */
>  
> -/**
> - * __ksize -- Uninstrumented ksize.
> - * @objp: pointer to the object
> - *
> - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> - * safety checks as ksize() with KASAN instrumentation enabled.
> - *
> - * Return: size of the actual memory used by @objp in bytes
> - */
> -size_t __ksize(const void *objp)
> -{
> -	struct kmem_cache *c;
> -	size_t size;
>  
> -	BUG_ON(!objp);
> -	if (unlikely(objp == ZERO_SIZE_PTR))
> -		return 0;
> -
> -	c = virt_to_cache(objp);
> -	size = c ? c->object_size : 0;

This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
looking up cache") by Kees and virt_to_cache() is an implicit check for
folio slab flag ...

> -
> -	return size;
> -}
> -EXPORT_SYMBOL(__ksize);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 23f2ab0713b7..488997db0d97 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
>  }
>  EXPORT_SYMBOL(kfree_sensitive);
>  
> +#ifndef CONFIG_SLOB
> +/**
> + * __ksize -- Uninstrumented ksize.
> + * @objp: pointer to the object
> + *
> + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> + * safety checks as ksize() with KASAN instrumentation enabled.
> + *
> + * Return: size of the actual memory used by @objp in bytes
> + */
> +size_t __ksize(const void *object)
> +{
> +	struct folio *folio;
> +
> +	if (unlikely(object == ZERO_SIZE_PTR))
> +		return 0;
> +
> +	folio = virt_to_folio(object);
> +
> +#ifdef CONFIG_SLUB
> +	if (unlikely(!folio_test_slab(folio)))
> +		return folio_size(folio);
> +#endif
> +
> +	return slab_ksize(folio_slab(folio)->slab_cache);

... and here in the common version you now for SLAB trust that the folio
will be a slab folio, thus undoing the intention of that commit. Maybe
that's not good and we should keep the folio_test_slab() for both cases?
Although maybe it's also strange that prior this patch, SLAB would return 0
if the test fails, and SLUB would return folio_size(). Probably because with
SLUB this can be a large kmalloc here and with SLAB not. So we could keep
doing that in the unified version, or KASAN devs (CC'd) could advise
something better?

> +}
> +EXPORT_SYMBOL(__ksize);
> +#endif
> +
>  /**
>   * ksize - get the actual amount of memory allocated for a given object
>   * @objp: Pointer to the object
> diff --git a/mm/slub.c b/mm/slub.c
> index 261474092e43..3a4458976ab7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4526,22 +4526,6 @@ void __check_heap_object(const void *ptr, unsigned long n,
>  }
>  #endif /* CONFIG_HARDENED_USERCOPY */
>  
> -size_t __ksize(const void *object)
> -{
> -	struct folio *folio;
> -
> -	if (unlikely(object == ZERO_SIZE_PTR))
> -		return 0;
> -
> -	folio = virt_to_folio(object);
> -
> -	if (unlikely(!folio_test_slab(folio)))
> -		return folio_size(folio);
> -
> -	return slab_ksize(folio_slab(folio)->slab_cache);
> -}
> -EXPORT_SYMBOL(__ksize);
> -
>  void kfree(const void *x)
>  {
>  	struct folio *folio;


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] mm/sl[auo]b: Do not export __ksize()
  2022-02-21 15:46   ` Matthew Wilcox
  2022-02-23  3:26     ` Hyeonggon Yoo
@ 2022-02-23 18:40     ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-23 18:40 UTC (permalink / raw)
  To: Matthew Wilcox, Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg

On 2/21/22 16:46, Matthew Wilcox wrote:
> On Mon, Feb 21, 2022 at 10:53:33AM +0000, Hyeonggon Yoo wrote:
>> Do not export __ksize(). Only kasan calls __ksize() directly.
> 
> We should probably delete (most of) the kernel-doc comment on __ksize,
> leaving only the note that it's uninstrumented and only called by
> kasan.  Also, we should probably move the declaration of __ksize from
> include/linux/slab.h to mm/slab.h since we don't want to grow new callers.

Agreed, thanks.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
  2022-02-23 18:39   ` Vlastimil Babka
@ 2022-02-23 19:06     ` Marco Elver
  2022-02-24 12:26       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Marco Elver @ 2022-02-23 19:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hyeonggon Yoo, linux-mm, Roman Gushchin, Andrew Morton,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg, Kees Cook, kasan-dev, Andrey Konovalov

On Wed, 23 Feb 2022 at 19:39, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > Only SLOB need to implement __ksize() separately because SLOB records
> > size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
> >
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > ---
> >  mm/slab.c        | 23 -----------------------
> >  mm/slab_common.c | 29 +++++++++++++++++++++++++++++
> >  mm/slub.c        | 16 ----------------
> >  3 files changed, 29 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index ddf5737c63d9..eb73d2499480 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
> >  }
> >  #endif /* CONFIG_HARDENED_USERCOPY */
> >
> > -/**
> > - * __ksize -- Uninstrumented ksize.
> > - * @objp: pointer to the object
> > - *
> > - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> > - * safety checks as ksize() with KASAN instrumentation enabled.
> > - *
> > - * Return: size of the actual memory used by @objp in bytes
> > - */
> > -size_t __ksize(const void *objp)
> > -{
> > -     struct kmem_cache *c;
> > -     size_t size;
> >
> > -     BUG_ON(!objp);
> > -     if (unlikely(objp == ZERO_SIZE_PTR))
> > -             return 0;
> > -
> > -     c = virt_to_cache(objp);
> > -     size = c ? c->object_size : 0;
>
> This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
> looking up cache") by Kees and virt_to_cache() is an implicit check for
> folio slab flag ...
>
> > -
> > -     return size;
> > -}
> > -EXPORT_SYMBOL(__ksize);
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 23f2ab0713b7..488997db0d97 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
> >  }
> >  EXPORT_SYMBOL(kfree_sensitive);
> >
> > +#ifndef CONFIG_SLOB
> > +/**
> > + * __ksize -- Uninstrumented ksize.
> > + * @objp: pointer to the object
> > + *
> > + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
> > + * safety checks as ksize() with KASAN instrumentation enabled.
> > + *
> > + * Return: size of the actual memory used by @objp in bytes
> > + */
> > +size_t __ksize(const void *object)
> > +{
> > +     struct folio *folio;
> > +
> > +     if (unlikely(object == ZERO_SIZE_PTR))
> > +             return 0;
> > +
> > +     folio = virt_to_folio(object);
> > +
> > +#ifdef CONFIG_SLUB
> > +     if (unlikely(!folio_test_slab(folio)))
> > +             return folio_size(folio);
> > +#endif
> > +
> > +     return slab_ksize(folio_slab(folio)->slab_cache);
>
> ... and here in the common version you now for SLAB trust that the folio
> will be a slab folio, thus undoing the intention of that commit. Maybe
> that's not good and we should keep the folio_test_slab() for both cases?
> Although maybe it's also strange that prior this patch, SLAB would return 0
> if the test fails, and SLUB would return folio_size(). Probably because with
> SLUB this can be a large kmalloc here and with SLAB not. So we could keep
> doing that in the unified version, or KASAN devs (CC'd) could advise
> something better?

Is this a definitive failure case? My opinion here is that returning 0
from ksize() in case of failure will a) provide a way to check for
error, and b) if the size is used unconditionally to compute an
address may be the more graceful failure mode (see comment added in
0d4ca4c9bab39 for what happens if we see invalid memory per KASAN
being accessed).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] mm/sl[au]b: Unify __ksize()
  2022-02-23 19:06     ` Marco Elver
@ 2022-02-24 12:26       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:26 UTC (permalink / raw)
  To: Marco Elver
  Cc: Hyeonggon Yoo, linux-mm, Roman Gushchin, Andrew Morton,
	linux-kernel, Joonsoo Kim, David Rientjes, Christoph Lameter,
	Pekka Enberg, Kees Cook, kasan-dev, Andrey Konovalov

On 2/23/22 20:06, Marco Elver wrote:
> On Wed, 23 Feb 2022 at 19:39, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > Only SLOB need to implement __ksize() separately because SLOB records
>> > size in object header for kmalloc objects. Unify SLAB/SLUB's __ksize().
>> >
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> > ---
>> >  mm/slab.c        | 23 -----------------------
>> >  mm/slab_common.c | 29 +++++++++++++++++++++++++++++
>> >  mm/slub.c        | 16 ----------------
>> >  3 files changed, 29 insertions(+), 39 deletions(-)
>> >
>> > diff --git a/mm/slab.c b/mm/slab.c
>> > index ddf5737c63d9..eb73d2499480 100644
>> > --- a/mm/slab.c
>> > +++ b/mm/slab.c
>> > @@ -4199,27 +4199,4 @@ void __check_heap_object(const void *ptr, unsigned long n,
>> >  }
>> >  #endif /* CONFIG_HARDENED_USERCOPY */
>> >
>> > -/**
>> > - * __ksize -- Uninstrumented ksize.
>> > - * @objp: pointer to the object
>> > - *
>> > - * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
>> > - * safety checks as ksize() with KASAN instrumentation enabled.
>> > - *
>> > - * Return: size of the actual memory used by @objp in bytes
>> > - */
>> > -size_t __ksize(const void *objp)
>> > -{
>> > -     struct kmem_cache *c;
>> > -     size_t size;
>> >
>> > -     BUG_ON(!objp);
>> > -     if (unlikely(objp == ZERO_SIZE_PTR))
>> > -             return 0;
>> > -
>> > -     c = virt_to_cache(objp);
>> > -     size = c ? c->object_size : 0;
>>
>> This comes from commit a64b53780ec3 ("mm/slab: sanity-check page type when
>> looking up cache") by Kees and virt_to_cache() is an implicit check for
>> folio slab flag ...
>>
>> > -
>> > -     return size;
>> > -}
>> > -EXPORT_SYMBOL(__ksize);
>> > diff --git a/mm/slab_common.c b/mm/slab_common.c
>> > index 23f2ab0713b7..488997db0d97 100644
>> > --- a/mm/slab_common.c
>> > +++ b/mm/slab_common.c
>> > @@ -1245,6 +1245,35 @@ void kfree_sensitive(const void *p)
>> >  }
>> >  EXPORT_SYMBOL(kfree_sensitive);
>> >
>> > +#ifndef CONFIG_SLOB
>> > +/**
>> > + * __ksize -- Uninstrumented ksize.
>> > + * @objp: pointer to the object
>> > + *
>> > + * Unlike ksize(), __ksize() is uninstrumented, and does not provide the same
>> > + * safety checks as ksize() with KASAN instrumentation enabled.
>> > + *
>> > + * Return: size of the actual memory used by @objp in bytes
>> > + */
>> > +size_t __ksize(const void *object)
>> > +{
>> > +     struct folio *folio;
>> > +
>> > +     if (unlikely(object == ZERO_SIZE_PTR))
>> > +             return 0;
>> > +
>> > +     folio = virt_to_folio(object);
>> > +
>> > +#ifdef CONFIG_SLUB
>> > +     if (unlikely(!folio_test_slab(folio)))
>> > +             return folio_size(folio);
>> > +#endif
>> > +
>> > +     return slab_ksize(folio_slab(folio)->slab_cache);
>>
>> ... and here in the common version you now for SLAB trust that the folio
>> will be a slab folio, thus undoing the intention of that commit. Maybe
>> that's not good and we should keep the folio_test_slab() for both cases?
>> Although maybe it's also strange that prior this patch, SLAB would return 0
>> if the test fails, and SLUB would return folio_size(). Probably because with
>> SLUB this can be a large kmalloc here and with SLAB not. So we could keep
>> doing that in the unified version, or KASAN devs (CC'd) could advise
>> something better?
> 
> Is this a definitive failure case?

Yeah, if we called it on a supposed object pointer that turns out to be not
slab, it usually means some UAF, so a failure.

> My opinion here is that returning 0
> from ksize() in case of failure will a) provide a way to check for
> error, and b) if the size is used unconditionally to compute an
> address may be the more graceful failure mode (see comment added in
> 0d4ca4c9bab39 for what happens if we see invalid memory per KASAN
> being accessed).

Sounds good, thanks. Then the patch should be fixed up to keep checking for
slab flag and returning 0 otherwise for CONFIG_SLAB.
For SLUB we might fail to detect the failure case by assuming it was a large
kmalloc. Maybe we could improve and only assume that when folio_size() is
large enough that the corresponding allocation would actually be done as a
large kmalloc, and the object pointer is to the beginning of the folio?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
  2022-02-21 15:53   ` Matthew Wilcox
@ 2022-02-24 12:48   ` Vlastimil Babka
  2022-02-24 13:31     ` Hyeonggon Yoo
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:48 UTC (permalink / raw)
  To: Hyeonggon Yoo, linux-mm
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
	David Rientjes, Christoph Lameter, Pekka Enberg, Matthew Wilcox

On 2/21/22 11:53, Hyeonggon Yoo wrote:
> SLAB's kfree() does not support freeing an object that is allocated from
> kmalloc_large(). Fix this as SLAB do not pass requests larger than
> KMALLOC_MAX_CACHE_SIZE directly to page allocator.

AFAICS this issue is limited to build-time constant sizes. Might be better
to make this a build error rather than build-time NULL?

> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] mm/slub: Limit min_partial only in cache creation
  2022-02-23  3:37     ` Hyeonggon Yoo
@ 2022-02-24 12:52       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-24 12:52 UTC (permalink / raw)
  To: Hyeonggon Yoo, David Rientjes
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, Christoph Lameter, Pekka Enberg

On 2/23/22 04:37, Hyeonggon Yoo wrote:
> On Tue, Feb 22, 2022 at 03:48:16PM -0800, David Rientjes wrote:
>> On Mon, 21 Feb 2022, Hyeonggon Yoo wrote:
>> 
>> > SLUB sets number of minimum partial slabs for node (min_partial) using
>> > set_min_partial(). SLUB holds at least min_partial slabs even if they're empty
>> > to avoid excessive use of page allocator.
>> > 
>> > set_min_partial() limits value of min_partial between MIN_PARTIAL and
>> > MAX_PARTIAL. As set_min_partial() can be called by min_partial_store()
>> > too, Only limit value of min_partial in kmem_cache_open() so that it
>> > can be changed to value that a user wants.
>> > 
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> 
>> I think this makes sense and there is no reason to limit the bounds that 
>> may be set at runtime with undocumented behavior.

Right.

> Thank you for comment.
> 
>> 
>> However, since set_min_partial() is only setting the value in the 
>> kmem_cache, could we remove the helper function entirely and fold it into 
>> its two callers?
> 
> Right. We don't need to separate this as function. I'll update this
> in next version. Thanks!

Agreed, thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-24 12:48   ` Vlastimil Babka
@ 2022-02-24 13:31     ` Hyeonggon Yoo
  2022-02-24 15:08       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-24 13:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Matthew Wilcox

On Thu, Feb 24, 2022 at 01:48:47PM +0100, Vlastimil Babka wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > SLAB's kfree() does not support freeing an object that is allocated from
> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
> 
> AFAICS this issue is limited to build-time constant sizes. Might be better
> to make this a build error rather than build-time NULL?

Right. And sounds better. But what about another direction as Matthew said:
passing large requests to page allocator like SLUB?

I think it's better for maintenance. Any obstacles for this direction?

Thank you!

> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

-- 
Hyeonggon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size
  2022-02-24 13:31     ` Hyeonggon Yoo
@ 2022-02-24 15:08       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-24 15:08 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg,
	Matthew Wilcox

On 2/24/22 14:31, Hyeonggon Yoo wrote:
> On Thu, Feb 24, 2022 at 01:48:47PM +0100, Vlastimil Babka wrote:
>> On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > SLAB's kfree() does not support freeing an object that is allocated from
>> > kmalloc_large(). Fix this as SLAB do not pass requests larger than
>> > KMALLOC_MAX_CACHE_SIZE directly to page allocator.
>> 
>> AFAICS this issue is limited to build-time constant sizes. Might be better
>> to make this a build error rather than build-time NULL?
> 
> Right. And sounds better. But what about another direction as Matthew said:
> passing large requests to page allocator like SLUB?

Sounds like a good idea, that would reduce the number of kmalloc caches with
SLAB, and I expect also simplify the common code further.

> I think it's better for maintenance. Any obstacles for this direction?
> 
> Thank you!
> 
>> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
@ 2022-02-24 18:16   ` Vlastimil Babka
  2022-02-25  9:34     ` Hyeonggon Yoo
  0 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-24 18:16 UTC (permalink / raw)
  To: Hyeonggon Yoo, linux-mm
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Joonsoo Kim,
	David Rientjes, Christoph Lameter, Pekka Enberg

On 2/21/22 11:53, Hyeonggon Yoo wrote:
> Simply deactivate_slab() by removing variable 'lock' and replacing
> 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> n->list_lock when cmpxchg_double() fails, and then retry.
> 
> One slight functional change is releasing and taking n->list_lock again
> when cmpxchg_double() fails. This is not harmful because SLUB avoids
> deactivating slabs as much as possible.
> 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Hm I wonder if we could simplify even a bit more. Do we have to actually
place the slab on a partial (full) list before the cmpxchg, only to remove
it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
un-frozen, but not on the list, which would be unexpected. However if anyone
sees such slab, they have to take the list_lock first to start working with
the slab... so this should be safe, because we hold the list_lock here, and
will place the slab on the list before we release it. But it thus shouldn't
matter if the placement happens before or after a successful cmpxchg, no? So
we can only do it once after a successful cmpxchg and need no undo's?

Specifically AFAIK the only possible race should be with a __slab_free()
which might observe !was_frozen after we succeed an unfreezing cmpxchg and
go through the
"} else { /* Needs to be taken off a list */"
branch but then it takes the list_lock as the first thing, so will be able
to proceed only after the slab is actually on the list.

Do I miss anything or would you agree?

> ---
>  mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
>  1 file changed, 33 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a4964deccb61..2d0663befb9e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  {
>  	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
>  	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> -	int lock = 0, free_delta = 0;
> -	enum slab_modes l = M_NONE, m = M_NONE;
> +	int free_delta = 0;
> +	enum slab_modes mode = M_NONE;
>  	void *nextfree, *freelist_iter, *freelist_tail;
>  	int tail = DEACTIVATE_TO_HEAD;
>  	unsigned long flags = 0;
> @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  	new.frozen = 0;
>  
>  	if (!new.inuse && n->nr_partial >= s->min_partial)
> -		m = M_FREE;
> +		mode = M_FREE;
>  	else if (new.freelist) {
> -		m = M_PARTIAL;
> -		if (!lock) {
> -			lock = 1;
> -			/*
> -			 * Taking the spinlock removes the possibility that
> -			 * acquire_slab() will see a slab that is frozen
> -			 */
> -			spin_lock_irqsave(&n->list_lock, flags);
> -		}
> -	} else {
> -		m = M_FULL;
> -		if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> -			lock = 1;
> -			/*
> -			 * This also ensures that the scanning of full
> -			 * slabs from diagnostic functions will not see
> -			 * any frozen slabs.
> -			 */
> -			spin_lock_irqsave(&n->list_lock, flags);
> -		}
> +		mode = M_PARTIAL;
> +		/*
> +		 * Taking the spinlock removes the possibility that
> +		 * acquire_slab() will see a slab that is frozen
> +		 */
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		add_partial(n, slab, tail);
> +	} else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> +		mode = M_FULL;
> +		/*
> +		 * This also ensures that the scanning of full
> +		 * slabs from diagnostic functions will not see
> +		 * any frozen slabs.
> +		 */
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		add_full(s, n, slab);
>  	}
>  
> -	if (l != m) {
> -		if (l == M_PARTIAL)
> -			remove_partial(n, slab);
> -		else if (l == M_FULL)
> -			remove_full(s, n, slab);
>  
> -		if (m == M_PARTIAL)
> -			add_partial(n, slab, tail);
> -		else if (m == M_FULL)
> -			add_full(s, n, slab);
> -	}
> -
> -	l = m;
>  	if (!cmpxchg_double_slab(s, slab,
>  				old.freelist, old.counters,
>  				new.freelist, new.counters,
> -				"unfreezing slab"))
> +				"unfreezing slab")) {
> +		if (mode == M_PARTIAL) {
> +			remove_partial(n, slab);
> +			spin_unlock_irqrestore(&n->list_lock, flags);
> +		} else if (mode == M_FULL) {
> +			remove_full(s, n, slab);
> +			spin_unlock_irqrestore(&n->list_lock, flags);
> +		}
>  		goto redo;
> +	}
>  
> -	if (lock)
> -		spin_unlock_irqrestore(&n->list_lock, flags);
>  
> -	if (m == M_PARTIAL)
> +	if (mode == M_PARTIAL) {
> +		spin_unlock_irqrestore(&n->list_lock, flags);
>  		stat(s, tail);
> -	else if (m == M_FULL)
> +	} else if (mode == M_FULL) {
> +		spin_unlock_irqrestore(&n->list_lock, flags);
>  		stat(s, DEACTIVATE_FULL);
> -	else if (m == M_FREE) {
> +	} else if (mode == M_FREE) {
>  		stat(s, DEACTIVATE_EMPTY);
>  		discard_slab(s, slab);
>  		stat(s, FREE_SLAB);


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-24 18:16   ` Vlastimil Babka
@ 2022-02-25  9:34     ` Hyeonggon Yoo
  2022-02-25  9:50       ` Hyeonggon Yoo
  0 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25  9:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg

On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > Simply deactivate_slab() by removing variable 'lock' and replacing
> > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> > n->list_lock when cmpxchg_double() fails, and then retry.
> > 
> > One slight functional change is releasing and taking n->list_lock again
> > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> > deactivating slabs as much as possible.
> > 
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 
> Hm I wonder if we could simplify even a bit more. Do we have to actually
> place the slab on a partial (full) list before the cmpxchg, only to remove
> it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> un-frozen, but not on the list, which would be unexpected. However if anyone
> sees such slab, they have to take the list_lock first to start working with
> the slab... so this should be safe, because we hold the list_lock here, and
> will place the slab on the list before we release it. But it thus shouldn't
> matter if the placement happens before or after a successful cmpxchg, no? So
> we can only do it once after a successful cmpxchg and need no undo's?
>

My thought was similar. But after testing I noticed that &n->list_lock prevents
race between __slab_free() and deactivate_slab().

> Specifically AFAIK the only possible race should be with a __slab_free()
> which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> go through the
> "} else { /* Needs to be taken off a list */"
> branch but then it takes the list_lock as the first thing, so will be able
> to proceed only after the slab is actually on the list.
> 
> Do I miss anything or would you agree?
>

It's so tricky.

I tried to simplify more as you said. Seeing frozen slab on list was not
problem. But the problem was that something might interfere between
cmpxchg_double() and taking spinlock.

This is what I faced:

	CPU A				CPU B
deactivate_slab() {			__slab_free() {
	/* slab is full */
	slab.frozen = 0;
	cmpxchg_double();
						/* Hmm... 
						slab->frozen == 0 &&
						slab->freelist != NULL?
						Oh This must be on the list.. */
						
						spin_lock_irqsave();
						cmpxchg_double();
						/* Corruption: slab
						 * was not yet inserted to
						 * list but try removing */
						remove_full();
						spin_unlock_irqrestore();
					}
	spin_lock_irqsave();
	add_full();
	spin_unlock_irqrestore();
}

I think it's quite confusing because it's protecting code, not data.

Would you have an idea to solve it, or should we add a comment for this?

> > ---
> >  mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
> >  1 file changed, 33 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a4964deccb61..2d0663befb9e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >  {
> >  	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
> >  	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> > -	int lock = 0, free_delta = 0;
> > -	enum slab_modes l = M_NONE, m = M_NONE;
> > +	int free_delta = 0;
> > +	enum slab_modes mode = M_NONE;
> >  	void *nextfree, *freelist_iter, *freelist_tail;
> >  	int tail = DEACTIVATE_TO_HEAD;
> >  	unsigned long flags = 0;
> > @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >  	new.frozen = 0;
> >  
> >  	if (!new.inuse && n->nr_partial >= s->min_partial)
> > -		m = M_FREE;
> > +		mode = M_FREE;
> >  	else if (new.freelist) {
> > -		m = M_PARTIAL;
> > -		if (!lock) {
> > -			lock = 1;
> > -			/*
> > -			 * Taking the spinlock removes the possibility that
> > -			 * acquire_slab() will see a slab that is frozen
> > -			 */
> > -			spin_lock_irqsave(&n->list_lock, flags);
> > -		}
> > -	} else {
> > -		m = M_FULL;
> > -		if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> > -			lock = 1;
> > -			/*
> > -			 * This also ensures that the scanning of full
> > -			 * slabs from diagnostic functions will not see
> > -			 * any frozen slabs.
> > -			 */
> > -			spin_lock_irqsave(&n->list_lock, flags);
> > -		}
> > +		mode = M_PARTIAL;
> > +		/*
> > +		 * Taking the spinlock removes the possibility that
> > +		 * acquire_slab() will see a slab that is frozen
> > +		 */
> > +		spin_lock_irqsave(&n->list_lock, flags);
> > +		add_partial(n, slab, tail);
> > +	} else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> > +		mode = M_FULL;
> > +		/*
> > +		 * This also ensures that the scanning of full
> > +		 * slabs from diagnostic functions will not see
> > +		 * any frozen slabs.
> > +		 */
> > +		spin_lock_irqsave(&n->list_lock, flags);
> > +		add_full(s, n, slab);
> >  	}
> >  
> > -	if (l != m) {
> > -		if (l == M_PARTIAL)
> > -			remove_partial(n, slab);
> > -		else if (l == M_FULL)
> > -			remove_full(s, n, slab);
> >  
> > -		if (m == M_PARTIAL)
> > -			add_partial(n, slab, tail);
> > -		else if (m == M_FULL)
> > -			add_full(s, n, slab);
> > -	}
> > -
> > -	l = m;
> >  	if (!cmpxchg_double_slab(s, slab,
> >  				old.freelist, old.counters,
> >  				new.freelist, new.counters,
> > -				"unfreezing slab"))
> > +				"unfreezing slab")) {
> > +		if (mode == M_PARTIAL) {
> > +			remove_partial(n, slab);
> > +			spin_unlock_irqrestore(&n->list_lock, flags);
> > +		} else if (mode == M_FULL) {
> > +			remove_full(s, n, slab);
> > +			spin_unlock_irqrestore(&n->list_lock, flags);
> > +		}
> >  		goto redo;
> > +	}
> >  
> > -	if (lock)
> > -		spin_unlock_irqrestore(&n->list_lock, flags);
> >  
> > -	if (m == M_PARTIAL)
> > +	if (mode == M_PARTIAL) {
> > +		spin_unlock_irqrestore(&n->list_lock, flags);
> >  		stat(s, tail);
> > -	else if (m == M_FULL)
> > +	} else if (mode == M_FULL) {
> > +		spin_unlock_irqrestore(&n->list_lock, flags);
> >  		stat(s, DEACTIVATE_FULL);
> > -	else if (m == M_FREE) {
> > +	} else if (mode == M_FREE) {
> >  		stat(s, DEACTIVATE_EMPTY);
> >  		discard_slab(s, slab);
> >  		stat(s, FREE_SLAB);
> 

-- 
Thank you, You are awesome!
Hyeonggon :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-25  9:34     ` Hyeonggon Yoo
@ 2022-02-25  9:50       ` Hyeonggon Yoo
  2022-02-25 10:07         ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25  9:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg

On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
> > > Simply deactivate_slab() by removing variable 'lock' and replacing
> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> > > n->list_lock when cmpxchg_double() fails, and then retry.
> > > 
> > > One slight functional change is releasing and taking n->list_lock again
> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> > > deactivating slabs as much as possible.
> > > 
> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > 
> > Hm I wonder if we could simplify even a bit more. Do we have to actually
> > place the slab on a partial (full) list before the cmpxchg, only to remove
> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> > un-frozen, but not on the list, which would be unexpected. However if anyone
> > sees such slab, they have to take the list_lock first to start working with
> > the slab... so this should be safe, because we hold the list_lock here, and
> > will place the slab on the list before we release it. But it thus shouldn't
> > matter if the placement happens before or after a successful cmpxchg, no? So
> > we can only do it once after a successful cmpxchg and need no undo's?
> >
> 
> My thought was similar. But after testing I noticed that &n->list_lock prevents
> race between __slab_free() and deactivate_slab().
> 
> > Specifically AFAIK the only possible race should be with a __slab_free()
> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> > go through the
> > "} else { /* Needs to be taken off a list */"
> > branch but then it takes the list_lock as the first thing, so will be able
> > to proceed only after the slab is actually on the list.
> > 
> > Do I miss anything or would you agree?
> >
> 
> It's so tricky.
> 
> I tried to simplify more as you said. Seeing frozen slab on list was not
> problem. But the problem was that something might interfere between
> cmpxchg_double() and taking spinlock.
> 
> This is what I faced:
> 
> 	CPU A				CPU B
> deactivate_slab() {			__slab_free() {
> 	/* slab is full */
> 	slab.frozen = 0;
> 	cmpxchg_double();
> 						/* Hmm... 
> 						slab->frozen == 0 &&
> 						slab->freelist != NULL?
> 						Oh This must be on the list.. */
						Oh this is wrong.
						slab->freelist must be
						NULL because it's full
						slab.

						It's more complex
						than I thought...


> 						spin_lock_irqsave();
> 						cmpxchg_double();
> 						/* Corruption: slab
> 						 * was not yet inserted to
> 						 * list but try removing */
> 						remove_full();
> 						spin_unlock_irqrestore();
> 					}
> 	spin_lock_irqsave();
> 	add_full();
> 	spin_unlock_irqrestore();
> }

So it was...

 	CPU A				CPU B
 deactivate_slab() {			__slab_free() {
 	/* slab is full */
 	slab.frozen = 0;
 	cmpxchg_double();
 						/*
							Hmm... 
							!was_frozen &&
							prior == NULL?
							Let's freeze this!
						*/
						put_cpu_partial();
 					}
 	spin_lock_irqsave();
 	add_full();
	/* It's now frozen by CPU B and at the same time on full list */
 	spin_unlock_irqrestore();

And &n->list_lock prevents such a race.

> 
> I think it's quite confusing because it's protecting code, not data.
> 
> Would you have an idea to solve it, or should we add a comment for this?
> 
> > > ---
> > >  mm/slub.c | 74 +++++++++++++++++++++++++------------------------------
> > >  1 file changed, 33 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index a4964deccb61..2d0663befb9e 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2350,8 +2350,8 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > >  {
> > >  	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
> > >  	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> > > -	int lock = 0, free_delta = 0;
> > > -	enum slab_modes l = M_NONE, m = M_NONE;
> > > +	int free_delta = 0;
> > > +	enum slab_modes mode = M_NONE;
> > >  	void *nextfree, *freelist_iter, *freelist_tail;
> > >  	int tail = DEACTIVATE_TO_HEAD;
> > >  	unsigned long flags = 0;
> > > @@ -2420,57 +2420,49 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> > >  	new.frozen = 0;
> > >  
> > >  	if (!new.inuse && n->nr_partial >= s->min_partial)
> > > -		m = M_FREE;
> > > +		mode = M_FREE;
> > >  	else if (new.freelist) {
> > > -		m = M_PARTIAL;
> > > -		if (!lock) {
> > > -			lock = 1;
> > > -			/*
> > > -			 * Taking the spinlock removes the possibility that
> > > -			 * acquire_slab() will see a slab that is frozen
> > > -			 */
> > > -			spin_lock_irqsave(&n->list_lock, flags);
> > > -		}
> > > -	} else {
> > > -		m = M_FULL;
> > > -		if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) {
> > > -			lock = 1;
> > > -			/*
> > > -			 * This also ensures that the scanning of full
> > > -			 * slabs from diagnostic functions will not see
> > > -			 * any frozen slabs.
> > > -			 */
> > > -			spin_lock_irqsave(&n->list_lock, flags);
> > > -		}
> > > +		mode = M_PARTIAL;
> > > +		/*
> > > +		 * Taking the spinlock removes the possibility that
> > > +		 * acquire_slab() will see a slab that is frozen
> > > +		 */
> > > +		spin_lock_irqsave(&n->list_lock, flags);
> > > +		add_partial(n, slab, tail);
> > > +	} else if (kmem_cache_debug_flags(s, SLAB_STORE_USER)) {
> > > +		mode = M_FULL;
> > > +		/*
> > > +		 * This also ensures that the scanning of full
> > > +		 * slabs from diagnostic functions will not see
> > > +		 * any frozen slabs.
> > > +		 */
> > > +		spin_lock_irqsave(&n->list_lock, flags);
> > > +		add_full(s, n, slab);
> > >  	}
> > >  
> > > -	if (l != m) {
> > > -		if (l == M_PARTIAL)
> > > -			remove_partial(n, slab);
> > > -		else if (l == M_FULL)
> > > -			remove_full(s, n, slab);
> > >  
> > > -		if (m == M_PARTIAL)
> > > -			add_partial(n, slab, tail);
> > > -		else if (m == M_FULL)
> > > -			add_full(s, n, slab);
> > > -	}
> > > -
> > > -	l = m;
> > >  	if (!cmpxchg_double_slab(s, slab,
> > >  				old.freelist, old.counters,
> > >  				new.freelist, new.counters,
> > > -				"unfreezing slab"))
> > > +				"unfreezing slab")) {
> > > +		if (mode == M_PARTIAL) {
> > > +			remove_partial(n, slab);
> > > +			spin_unlock_irqrestore(&n->list_lock, flags);
> > > +		} else if (mode == M_FULL) {
> > > +			remove_full(s, n, slab);
> > > +			spin_unlock_irqrestore(&n->list_lock, flags);
> > > +		}
> > >  		goto redo;
> > > +	}
> > >  
> > > -	if (lock)
> > > -		spin_unlock_irqrestore(&n->list_lock, flags);
> > >  
> > > -	if (m == M_PARTIAL)
> > > +	if (mode == M_PARTIAL) {
> > > +		spin_unlock_irqrestore(&n->list_lock, flags);
> > >  		stat(s, tail);
> > > -	else if (m == M_FULL)
> > > +	} else if (mode == M_FULL) {
> > > +		spin_unlock_irqrestore(&n->list_lock, flags);
> > >  		stat(s, DEACTIVATE_FULL);
> > > -	else if (m == M_FREE) {
> > > +	} else if (mode == M_FREE) {
> > >  		stat(s, DEACTIVATE_EMPTY);
> > >  		discard_slab(s, slab);
> > >  		stat(s, FREE_SLAB);
> > 
> 
> -- 
> Thank you, You are awesome!
> Hyeonggon :-)

-- 
Thank you, You are awesome!
Hyeonggon :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-25  9:50       ` Hyeonggon Yoo
@ 2022-02-25 10:07         ` Vlastimil Babka
  2022-02-25 10:26           ` Hyeonggon Yoo
  0 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2022-02-25 10:07 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg

On 2/25/22 10:50, Hyeonggon Yoo wrote:
> On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
>> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
>> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
>> > > Simply deactivate_slab() by removing variable 'lock' and replacing
>> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
>> > > n->list_lock when cmpxchg_double() fails, and then retry.
>> > > 
>> > > One slight functional change is releasing and taking n->list_lock again
>> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
>> > > deactivating slabs as much as possible.
>> > > 
>> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> > 
>> > Hm I wonder if we could simplify even a bit more. Do we have to actually
>> > place the slab on a partial (full) list before the cmpxchg, only to remove
>> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
>> > un-frozen, but not on the list, which would be unexpected. However if anyone
>> > sees such slab, they have to take the list_lock first to start working with
>> > the slab... so this should be safe, because we hold the list_lock here, and
>> > will place the slab on the list before we release it. But it thus shouldn't
>> > matter if the placement happens before or after a successful cmpxchg, no? So
>> > we can only do it once after a successful cmpxchg and need no undo's?
>> >
>> 
>> My thought was similar. But after testing I noticed that &n->list_lock prevents
>> race between __slab_free() and deactivate_slab().
>> 
>> > Specifically AFAIK the only possible race should be with a __slab_free()
>> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
>> > go through the
>> > "} else { /* Needs to be taken off a list */"
>> > branch but then it takes the list_lock as the first thing, so will be able
>> > to proceed only after the slab is actually on the list.
>> > 
>> > Do I miss anything or would you agree?
>> >
>> 
>> It's so tricky.
>> 
>> I tried to simplify more as you said. Seeing frozen slab on list was not
>> problem. But the problem was that something might interfere between
>> cmpxchg_double() and taking spinlock.
>> 
>> This is what I faced:
>> 
>> 	CPU A				CPU B
>> deactivate_slab() {			__slab_free() {
>> 	/* slab is full */
>> 	slab.frozen = 0;
>> 	cmpxchg_double();
>> 						/* Hmm... 
>> 						slab->frozen == 0 &&
>> 						slab->freelist != NULL?
>> 						Oh This must be on the list.. */
> 						Oh this is wrong.
> 						slab->freelist must be
> 						NULL because it's full
> 						slab.
> 
> 						It's more complex
> 						than I thought...
> 
> 
>> 						spin_lock_irqsave();
>> 						cmpxchg_double();
>> 						/* Corruption: slab
>> 						 * was not yet inserted to
>> 						 * list but try removing */
>> 						remove_full();
>> 						spin_unlock_irqrestore();
>> 					}
>> 	spin_lock_irqsave();
>> 	add_full();
>> 	spin_unlock_irqrestore();
>> }
> 
> So it was...
> 
>  	CPU A				CPU B
>  deactivate_slab() {			__slab_free() {
>  	/* slab is full */
>  	slab.frozen = 0;
>  	cmpxchg_double();
>  						/*
> 							Hmm... 
> 							!was_frozen &&
> 							prior == NULL?
> 							Let's freeze this!
> 						*/
> 						put_cpu_partial();
>  					}
>  	spin_lock_irqsave();

Yeah in my proposal I didn't intend to only take spin_lock_irqsave() here.
My idea for CPU A would be something like:

spin_lock_irqsave();
slab.frozen = 0;
if (cmpxchg_double()); {
	/* success */
	add_partial(); // (or add_full())
	spin_unlock_irqrestore();
} else {
	/* fail */
	spin_unlock_irqrestore();
	goto redo;
}
	
So we would still have the list_lock protection around cmpxchg as in the
current code. We just wouldn't do e.g. add_partial() before cmpxchg, only to
remove_partial() when cmpxchg failed.

>  	add_full();
> 	/* It's now frozen by CPU B and at the same time on full list */
>  	spin_unlock_irqrestore();
> 
> And &n->list_lock prevents such a race.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] mm/slub: Refactor deactivate_slab()
  2022-02-25 10:07         ` Vlastimil Babka
@ 2022-02-25 10:26           ` Hyeonggon Yoo
  0 siblings, 0 replies; 27+ messages in thread
From: Hyeonggon Yoo @ 2022-02-25 10:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Roman Gushchin, Andrew Morton, linux-kernel,
	Joonsoo Kim, David Rientjes, Christoph Lameter, Pekka Enberg

On Fri, Feb 25, 2022 at 11:07:45AM +0100, Vlastimil Babka wrote:
> On 2/25/22 10:50, Hyeonggon Yoo wrote:
> > On Fri, Feb 25, 2022 at 09:34:09AM +0000, Hyeonggon Yoo wrote:
> >> On Thu, Feb 24, 2022 at 07:16:11PM +0100, Vlastimil Babka wrote:
> >> > On 2/21/22 11:53, Hyeonggon Yoo wrote:
> >> > > Simply deactivate_slab() by removing variable 'lock' and replacing
> >> > > 'l' and 'm' with 'mode'. Instead, remove slab from list and unlock
> >> > > n->list_lock when cmpxchg_double() fails, and then retry.
> >> > > 
> >> > > One slight functional change is releasing and taking n->list_lock again
> >> > > when cmpxchg_double() fails. This is not harmful because SLUB avoids
> >> > > deactivating slabs as much as possible.
> >> > > 
> >> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >> > 
> >> > Hm I wonder if we could simplify even a bit more. Do we have to actually
> >> > place the slab on a partial (full) list before the cmpxchg, only to remove
> >> > it when cmpxchg fails? Seems it's to avoid anyone else seeing the slab
> >> > un-frozen, but not on the list, which would be unexpected. However if anyone
> >> > sees such slab, they have to take the list_lock first to start working with
> >> > the slab... so this should be safe, because we hold the list_lock here, and
> >> > will place the slab on the list before we release it. But it thus shouldn't
> >> > matter if the placement happens before or after a successful cmpxchg, no? So
> >> > we can only do it once after a successful cmpxchg and need no undo's?
> >> >
> >> 
> >> My thought was similar. But after testing I noticed that &n->list_lock prevents
> >> race between __slab_free() and deactivate_slab().
> >> 
> >> > Specifically AFAIK the only possible race should be with a __slab_free()
> >> > which might observe !was_frozen after we succeed an unfreezing cmpxchg and
> >> > go through the
> >> > "} else { /* Needs to be taken off a list */"
> >> > branch but then it takes the list_lock as the first thing, so will be able
> >> > to proceed only after the slab is actually on the list.
> >> > 
> >> > Do I miss anything or would you agree?
> >> >
> >> 
> >> It's so tricky.
> >> 
> >> I tried to simplify more as you said. Seeing frozen slab on list was not
> >> problem. But the problem was that something might interfere between
> >> cmpxchg_double() and taking spinlock.
> >> 
> >> This is what I faced:
> >> 
> >> 	CPU A				CPU B
> >> deactivate_slab() {			__slab_free() {
> >> 	/* slab is full */
> >> 	slab.frozen = 0;
> >> 	cmpxchg_double();
> >> 						/* Hmm... 
> >> 						slab->frozen == 0 &&
> >> 						slab->freelist != NULL?
> >> 						Oh This must be on the list.. */
> > 						Oh this is wrong.
> > 						slab->freelist must be
> > 						NULL because it's full
> > 						slab.
> > 
> > 						It's more complex
> > 						than I thought...
> > 
> > 
> >> 						spin_lock_irqsave();
> >> 						cmpxchg_double();
> >> 						/* Corruption: slab
> >> 						 * was not yet inserted to
> >> 						 * list but try removing */
> >> 						remove_full();
> >> 						spin_unlock_irqrestore();
> >> 					}
> >> 	spin_lock_irqsave();
> >> 	add_full();
> >> 	spin_unlock_irqrestore();
> >> }
> > 
> > So it was...
> > 
> >  	CPU A				CPU B
> >  deactivate_slab() {			__slab_free() {
> >  	/* slab is full */
> >  	slab.frozen = 0;
> >  	cmpxchg_double();
> >  						/*
> > 							Hmm... 
> > 							!was_frozen &&
> > 							prior == NULL?
> > 							Let's freeze this!
> > 						*/
> > 						put_cpu_partial();
> >  					}
> >  	spin_lock_irqsave();
> 
> Yeah in my proposal I didn't intend to only take spin_lock_irqsave() here.
> My idea for CPU A would be something like:
>
Oh, misunderstood your proposal.
I spent hours figuring what's wrong haha

> spin_lock_irqsave();
> slab.frozen = 0;
> if (cmpxchg_double()); {
> 	/* success */
> 	add_partial(); // (or add_full())
> 	spin_unlock_irqrestore();
> } else {
> 	/* fail */
> 	spin_unlock_irqrestore();
> 	goto redo;
> }
> 	
> So we would still have the list_lock protection around cmpxchg as in the
> current code. We just wouldn't do e.g. add_partial() before cmpxchg, only to
> remove_partial() when cmpxchg failed.

Now I got what you mean...
I think that would work. Will try that.

Thank you for nice proposal!

> 
> >  	add_full();
> > 	/* It's now frozen by CPU B and at the same time on full list */
> >  	spin_unlock_irqrestore();
> > 
> > And &n->list_lock prevents such a race.
> 

-- 
Thank you, You are awesome!
Hyeonggon :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-02-25 10:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 10:53 [PATCH 0/5] slab cleanups Hyeonggon Yoo
2022-02-21 10:53 ` [PATCH 1/5] mm/sl[au]b: Unify __ksize() Hyeonggon Yoo
2022-02-23 18:39   ` Vlastimil Babka
2022-02-23 19:06     ` Marco Elver
2022-02-24 12:26       ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 2/5] mm/sl[auo]b: Do not export __ksize() Hyeonggon Yoo
2022-02-21 15:46   ` Matthew Wilcox
2022-02-23  3:26     ` Hyeonggon Yoo
2022-02-23 18:40     ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 3/5] mm/slab: Do not call kmalloc_large() for unsupported size Hyeonggon Yoo
2022-02-21 15:53   ` Matthew Wilcox
2022-02-22  8:10     ` Hyeonggon Yoo
2022-02-22 19:59       ` Matthew Wilcox
2022-02-23  3:24         ` Hyeonggon Yoo
2022-02-24 12:48   ` Vlastimil Babka
2022-02-24 13:31     ` Hyeonggon Yoo
2022-02-24 15:08       ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 4/5] mm/slub: Limit min_partial only in cache creation Hyeonggon Yoo
2022-02-22 23:48   ` David Rientjes
2022-02-23  3:37     ` Hyeonggon Yoo
2022-02-24 12:52       ` Vlastimil Babka
2022-02-21 10:53 ` [PATCH 5/5] mm/slub: Refactor deactivate_slab() Hyeonggon Yoo
2022-02-24 18:16   ` Vlastimil Babka
2022-02-25  9:34     ` Hyeonggon Yoo
2022-02-25  9:50       ` Hyeonggon Yoo
2022-02-25 10:07         ` Vlastimil Babka
2022-02-25 10:26           ` Hyeonggon Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).