linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] fix validation races and cleanup locking
@ 2022-08-12  9:14 Vlastimil Babka
  2022-08-12  9:14 ` [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Vlastimil Babka @ 2022-08-12  9:14 UTC (permalink / raw)
  To: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith,
	Vlastimil Babka

This series builds on the validation races fix posted previously [1]
that became patch 2 here and contains all the details in its
description.

Thanks to Hyeonggon Yoo's observation, patch 3 removes more slab_lock()
usage that became unnecessary after patch 2.

This made it possible to further simplify locking code in patches 4 and
5. Since those are related to PREEMPT_RT, I'm CCing relevant people on
this series.

[1] https://lore.kernel.org/all/20220809140043.9903-1-vbabka@suse.cz/

Vlastimil Babka (5):
  mm/slub: move free_debug_processing() further
  mm/slub: restrict sysfs validation to debug caches and make it safe
  mm/slub: remove slab_lock() usage for debug operations
  mm/slub: convert object_map_lock to non-raw spinlock
  mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock()

 mm/slub.c | 417 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 252 insertions(+), 165 deletions(-)

-- 
2.37.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe
  2022-08-12  9:14 [PATCH 0/5] fix validation races and cleanup locking Vlastimil Babka
@ 2022-08-12  9:14 ` Vlastimil Babka
  2022-08-14 14:39   ` Hyeonggon Yoo
  2022-08-12  9:16 ` [PATCH 0/5] fix validation races and cleanup locking Vlastimil Babka
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2022-08-12  9:14 UTC (permalink / raw)
  To: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith,
	Vlastimil Babka

Rongwei Wang reports [1] that cache validation triggered by writing to
/sys/kernel/slab/<cache>/validate is racy against normal cache
operations (e.g. freeing) in a way that can cause false positive
inconsistency reports for caches with debugging enabled. The problem is
that debugging actions that mark object free or active and actual
freelist operations are not atomic, and the validation can see an
inconsistent state.

For caches that do or don't have debugging enabled, additional races
involving n->nr_slabs are possible that result in false reports of wrong
slab counts.

This patch attempts to solve these issues while not adding overhead to
normal (especially fastpath) operations for caches that do not have
debugging enabled. Such overhead would not be justified to make possible
userspace-triggered validation safe. Instead, disable the validation for
caches that don't have debugging enabled and make their sysfs validate
handler return -EINVAL.

For caches that do have debugging enabled, we can instead extend the
existing approach of not using percpu freelists to force all alloc/free
perations to the slow paths where debugging flags is checked and acted
upon. There can adjust the debug-specific paths to increase n->list_lock
coverage against concurrent validation as necessary.

The processing on free in free_debug_processing() already happens under
n->list_lock so we can extend it to actually do the freeing as well and
thus make it atomic against concurrent validation. As observed by
Hyeonggon Yoo, we do not really need to take slab_lock() anymore here
because all paths we could race with are protected by n->list_lock under
the new scheme, so drop its usage here.

The processing on alloc in alloc_debug_processing() currently doesn't
take any locks, but we have to first allocate the object from a slab on
the partial list (as debugging caches have no percpu slabs) and thus
take the n->list_lock anyway. Add a function alloc_single_from_partial()
that grabs just the allocated object instead of the whole freelist, and
does the debug processing. The n->list_lock coverage again makes it
atomic against validation and it is also ultimately more efficient than
the current grabbing of freelist immediately followed by slab
deactivation.

To prevent races on n->nr_slabs updates, make sure that for caches with
debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
n->list_lock. When allocating a new slab for a debug cache, handle the
allocation by a new function alloc_single_from_new_slab() instead of the
current forced deactivation path.

Neither of these changes affect the fast paths at all. The changes in
slow paths are negligible for non-debug caches.

[1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/

Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 227 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 177 insertions(+), 50 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 87e794ab101a..fa7efd2d98be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
 }
 
 static noinline int alloc_debug_processing(struct kmem_cache *s,
-					struct slab *slab,
-					void *object, unsigned long addr)
+					struct slab *slab, void *object)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!alloc_consistency_checks(s, slab, object))
 			goto bad;
 	}
 
-	/* Success perform special debug activities for allocs */
-	if (s->flags & SLAB_STORE_USER)
-		set_track(s, object, TRACK_ALLOC, addr);
+	/* Success. Perform special debug activities for allocs */
 	trace(s, slab, object, 1);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
@@ -1604,9 +1601,9 @@ static inline
 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct slab *slab, void *object, unsigned long addr) { return 0; }
+	struct slab *slab, void *object) { return 0; }
 
-static inline int free_debug_processing(
+static inline void free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,
 	unsigned long addr) { return 0; }
@@ -1614,6 +1611,8 @@ static inline int free_debug_processing(
 static inline void slab_pad_check(struct kmem_cache *s, struct slab *slab) {}
 static inline int check_object(struct kmem_cache *s, struct slab *slab,
 			void *object, u8 val) { return 1; }
+static inline void set_track(struct kmem_cache *s, void *object,
+			     enum track_item alloc, unsigned long addr) {}
 static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
 					struct slab *slab) {}
 static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
@@ -1919,11 +1918,13 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 */
 		slab = alloc_slab_page(alloc_gfp, node, oo);
 		if (unlikely(!slab))
-			goto out;
+			return NULL;
 		stat(s, ORDER_FALLBACK);
 	}
 
 	slab->objects = oo_objects(oo);
+	slab->inuse = 0;
+	slab->frozen = 0;
 
 	account_slab(slab, oo_order(oo), s, flags);
 
@@ -1950,15 +1951,6 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		set_freepointer(s, p, NULL);
 	}
 
-	slab->inuse = slab->objects;
-	slab->frozen = 1;
-
-out:
-	if (!slab)
-		return NULL;
-
-	inc_slabs_node(s, slab_nid(slab), slab->objects);
-
 	return slab;
 }
 
@@ -2045,6 +2037,76 @@ static inline void remove_partial(struct kmem_cache_node *n,
 	n->nr_partial--;
 }
 
+/*
+ * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
+ * slab from the n->partial list. Remove only a single object from the slab, do
+ * the alloc_debug_processing() checks and leave the slab on the list, or move
+ * it to full list if it was the last free object.
+ */
+static void *alloc_single_from_partial(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct slab *slab)
+{
+	void *object;
+
+	lockdep_assert_held(&n->list_lock);
+
+	object = slab->freelist;
+	slab->freelist = get_freepointer(s, object);
+	slab->inuse++;
+
+	if (!alloc_debug_processing(s, slab, object)) {
+		remove_partial(n, slab);
+		return NULL;
+	}
+
+	if (slab->inuse == slab->objects) {
+		remove_partial(n, slab);
+		add_full(s, n, slab);
+	}
+
+	return object;
+}
+
+/*
+ * Called only for kmem_cache_debug() caches to allocate from a freshly
+ * allocated slab. Allocate a single object instead of whole freelist
+ * and put the slab to the partial (or full) list.
+ */
+static void *alloc_single_from_new_slab(struct kmem_cache *s,
+					struct slab *slab)
+{
+	int nid = slab_nid(slab);
+	struct kmem_cache_node *n = get_node(s, nid);
+	unsigned long flags;
+	void *object;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+
+	object = slab->freelist;
+	slab->freelist = get_freepointer(s, object);
+	slab->inuse = 1;
+
+	if (!alloc_debug_processing(s, slab, object)) {
+		/*
+		 * It's not really expected that this would fail on a
+		 * freshly allocated slab, but a concurrent memory
+		 * corruption in theory could cause that.
+		 */
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return NULL;
+	}
+
+	if (slab->inuse == slab->objects)
+		add_full(s, n, slab);
+	else
+		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+
+	inc_slabs_node(s, nid, slab->objects);
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	return object;
+}
+
 /*
  * Remove slab from the partial list, freeze it and
  * return the pointer to the freelist.
@@ -2125,6 +2187,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 		if (!pfmemalloc_match(slab, gfpflags))
 			continue;
 
+		if (kmem_cache_debug(s)) {
+			object = alloc_single_from_partial(s, n, slab);
+			if (object)
+				break;
+			continue;
+		}
+
 		t = acquire_slab(s, n, slab, object == NULL);
 		if (!t)
 			break;
@@ -2733,31 +2802,39 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 }
 
 /* Supports checking bulk free of a constructed freelist */
-static noinline int free_debug_processing(
+static noinline void free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,
 	unsigned long addr)
 {
 	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
+	struct slab *slab_free = NULL;
 	void *object = head;
 	int cnt = 0;
-	unsigned long flags, flags2;
-	int ret = 0;
+	unsigned long flags;
+	bool checks_ok = false;
 	depot_stack_handle_t handle = 0;
 
 	if (s->flags & SLAB_STORE_USER)
 		handle = set_track_prepare();
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	slab_lock(slab, &flags2);
 
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!check_slab(s, slab))
 			goto out;
 	}
 
+	if (slab->inuse < bulk_cnt) {
+		slab_err(s, slab, "Slab has %d allocated objects but %d are to be freed\n",
+			 slab->inuse, bulk_cnt);
+		goto out;
+	}
+
 next_object:
-	cnt++;
+
+	if (++cnt > bulk_cnt)
+		goto out_cnt;
 
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!free_consistency_checks(s, slab, object, addr))
@@ -2775,18 +2852,56 @@ static noinline int free_debug_processing(
 		object = get_freepointer(s, object);
 		goto next_object;
 	}
-	ret = 1;
+	checks_ok = true;
 
-out:
+out_cnt:
 	if (cnt != bulk_cnt)
-		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
+		slab_err(s, slab, "Bulk free expected %d objects but found %d\n",
 			 bulk_cnt, cnt);
 
-	slab_unlock(slab, &flags2);
+out:
+	if (checks_ok) {
+		void *prior = slab->freelist;
+
+		/* Perform the actual freeing while we still hold the locks */
+		slab->inuse -= cnt;
+		set_freepointer(s, tail, prior);
+		slab->freelist = head;
+
+		/* Do we need to remove the slab from full or partial list? */
+		if (!prior) {
+			remove_full(s, n, slab);
+		} else if (slab->inuse == 0) {
+			remove_partial(n, slab);
+			stat(s, FREE_REMOVE_PARTIAL);
+		}
+
+		/* Do we need to discard the slab or add to partial list? */
+		if (slab->inuse == 0) {
+			slab_free = slab;
+		} else if (!prior) {
+			add_partial(n, slab, DEACTIVATE_TO_TAIL);
+			stat(s, FREE_ADD_PARTIAL);
+		}
+	}
+
+	if (slab_free) {
+		/*
+		 * Update the counters while still holding n->list_lock to
+		 * prevent spurious validation warnings
+		 */
+		dec_slabs_node(s, slab_nid(slab_free), slab_free->objects);
+	}
+
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	if (!ret)
+
+	if (!checks_ok)
 		slab_fix(s, "Object at 0x%p not freed", object);
-	return ret;
+
+	if (slab_free) {
+		stat(s, FREE_SLAB);
+		free_slab(s, slab_free);
+	}
 }
 #endif /* CONFIG_SLUB_DEBUG */
 
@@ -3036,36 +3151,52 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
+	stat(s, ALLOC_SLAB);
+
+	if (kmem_cache_debug(s)) {
+		freelist = alloc_single_from_new_slab(s, slab);
+
+		if (unlikely(!freelist))
+			goto new_objects;
+
+		if (s->flags & SLAB_STORE_USER)
+			set_track(s, freelist, TRACK_ALLOC, addr);
+
+		return freelist;
+	}
+
 	/*
 	 * No other reference to the slab yet so we can
 	 * muck around with it freely without cmpxchg
 	 */
 	freelist = slab->freelist;
 	slab->freelist = NULL;
+	slab->inuse = slab->objects;
+	slab->frozen = 1;
 
-	stat(s, ALLOC_SLAB);
+	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
 check_new_slab:
 
 	if (kmem_cache_debug(s)) {
-		if (!alloc_debug_processing(s, slab, freelist, addr)) {
-			/* Slab failed checks. Next slab needed */
-			goto new_slab;
-		} else {
-			/*
-			 * For debug case, we don't load freelist so that all
-			 * allocations go through alloc_debug_processing()
-			 */
-			goto return_single;
-		}
+		/*
+		 * For debug caches here we had to go through
+		 * alloc_single_from_partial() so just store the tracking info
+		 * and return the object
+		 */
+		if (s->flags & SLAB_STORE_USER)
+			set_track(s, freelist, TRACK_ALLOC, addr);
+		return freelist;
 	}
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
+	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
 		 * we don't make further mismatched allocations easier.
 		 */
-		goto return_single;
+		deactivate_slab(s, slab, get_freepointer(s, freelist));
+		return freelist;
+	}
 
 retry_load_slab:
 
@@ -3089,11 +3220,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	c->slab = slab;
 
 	goto load_freelist;
-
-return_single:
-
-	deactivate_slab(s, slab, get_freepointer(s, freelist));
-	return freelist;
 }
 
 /*
@@ -3341,9 +3467,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	if (kfence_free(head))
 		return;
 
-	if (kmem_cache_debug(s) &&
-	    !free_debug_processing(s, slab, head, tail, cnt, addr))
+	if (kmem_cache_debug(s)) {
+		free_debug_processing(s, slab, head, tail, cnt, addr);
 		return;
+	}
 
 	do {
 		if (unlikely(n)) {
@@ -3936,6 +4063,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
 
 	BUG_ON(!slab);
+	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
 	if (slab_nid(slab) != node) {
 		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
 		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
@@ -3950,7 +4078,6 @@ static void early_kmem_cache_node_alloc(int node)
 	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
-	slab->frozen = 0;
 	kmem_cache_node->node[node] = n;
 	init_kmem_cache_node(n);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
@@ -5601,7 +5728,7 @@ static ssize_t validate_store(struct kmem_cache *s,
 {
 	int ret = -EINVAL;
 
-	if (buf[0] == '1') {
+	if (buf[0] == '1' && kmem_cache_debug(s)) {
 		ret = validate_slab_cache(s);
 		if (ret >= 0)
 			ret = length;
-- 
2.37.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/5] fix validation races and cleanup locking
  2022-08-12  9:14 [PATCH 0/5] fix validation races and cleanup locking Vlastimil Babka
  2022-08-12  9:14 ` [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
@ 2022-08-12  9:16 ` Vlastimil Babka
       [not found] ` <20220812091426.18418-4-vbabka@suse.cz>
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2022-08-12  9:16 UTC (permalink / raw)
  To: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg
  Cc: Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On 8/12/22 11:14, Vlastimil Babka wrote:
> This series builds on the validation races fix posted previously [1]
> that became patch 2 here and contains all the details in its
> description.
> 
> Thanks to Hyeonggon Yoo's observation, patch 3 removes more slab_lock()
> usage that became unnecessary after patch 2.
> 
> This made it possible to further simplify locking code in patches 4 and
> 5. Since those are related to PREEMPT_RT, I'm CCing relevant people on
> this series.

Forgot the link to the git version of this:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-validate-fix-v2r1

> [1] https://lore.kernel.org/all/20220809140043.9903-1-vbabka@suse.cz/
> 
> Vlastimil Babka (5):
>   mm/slub: move free_debug_processing() further
>   mm/slub: restrict sysfs validation to debug caches and make it safe
>   mm/slub: remove slab_lock() usage for debug operations
>   mm/slub: convert object_map_lock to non-raw spinlock
>   mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock()
> 
>  mm/slub.c | 417 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 252 insertions(+), 165 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm/slub: move free_debug_processing() further
       [not found] ` <20220812091426.18418-2-vbabka@suse.cz>
@ 2022-08-14 13:42   ` Hyeonggon Yoo
  2022-08-15  0:03   ` David Rientjes
  1 sibling, 0 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2022-08-14 13:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, Aug 12, 2022 at 11:14:22AM +0200, Vlastimil Babka wrote:
> In the following patch, the function free_debug_processing() will be
> calling add_partial(), remove_partial() and discard_slab(), se move it
> below their definitions to avoid forward declarations. To make review
> easier, separate the move from functional changes.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 114 +++++++++++++++++++++++++++---------------------------
>  1 file changed, 57 insertions(+), 57 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 862dbd9af4f5..87e794ab101a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1385,63 +1385,6 @@ static inline int free_consistency_checks(struct kmem_cache *s,
>  	return 1;
>  }
>  
> -/* Supports checking bulk free of a constructed freelist */
> -static noinline int free_debug_processing(
> -	struct kmem_cache *s, struct slab *slab,
> -	void *head, void *tail, int bulk_cnt,
> -	unsigned long addr)
> -{
> -	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> -	void *object = head;
> -	int cnt = 0;
> -	unsigned long flags, flags2;
> -	int ret = 0;
> -	depot_stack_handle_t handle = 0;
> -
> -	if (s->flags & SLAB_STORE_USER)
> -		handle = set_track_prepare();
> -
> -	spin_lock_irqsave(&n->list_lock, flags);
> -	slab_lock(slab, &flags2);
> -
> -	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> -		if (!check_slab(s, slab))
> -			goto out;
> -	}
> -
> -next_object:
> -	cnt++;
> -
> -	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> -		if (!free_consistency_checks(s, slab, object, addr))
> -			goto out;
> -	}
> -
> -	if (s->flags & SLAB_STORE_USER)
> -		set_track_update(s, object, TRACK_FREE, addr, handle);
> -	trace(s, slab, object, 0);
> -	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
> -	init_object(s, object, SLUB_RED_INACTIVE);
> -
> -	/* Reached end of constructed freelist yet? */
> -	if (object != tail) {
> -		object = get_freepointer(s, object);
> -		goto next_object;
> -	}
> -	ret = 1;
> -
> -out:
> -	if (cnt != bulk_cnt)
> -		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
> -			 bulk_cnt, cnt);
> -
> -	slab_unlock(slab, &flags2);
> -	spin_unlock_irqrestore(&n->list_lock, flags);
> -	if (!ret)
> -		slab_fix(s, "Object at 0x%p not freed", object);
> -	return ret;
> -}
> -
>  /*
>   * Parse a block of slub_debug options. Blocks are delimited by ';'
>   *
> @@ -2788,6 +2731,63 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>  {
>  	return atomic_long_read(&n->total_objects);
>  }
> +
> +/* Supports checking bulk free of a constructed freelist */
> +static noinline int free_debug_processing(
> +	struct kmem_cache *s, struct slab *slab,
> +	void *head, void *tail, int bulk_cnt,
> +	unsigned long addr)
> +{
> +	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> +	void *object = head;
> +	int cnt = 0;
> +	unsigned long flags, flags2;
> +	int ret = 0;
> +	depot_stack_handle_t handle = 0;
> +
> +	if (s->flags & SLAB_STORE_USER)
> +		handle = set_track_prepare();
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +	slab_lock(slab, &flags2);
> +
> +	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> +		if (!check_slab(s, slab))
> +			goto out;
> +	}
> +
> +next_object:
> +	cnt++;
> +
> +	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> +		if (!free_consistency_checks(s, slab, object, addr))
> +			goto out;
> +	}
> +
> +	if (s->flags & SLAB_STORE_USER)
> +		set_track_update(s, object, TRACK_FREE, addr, handle);
> +	trace(s, slab, object, 0);
> +	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
> +	init_object(s, object, SLUB_RED_INACTIVE);
> +
> +	/* Reached end of constructed freelist yet? */
> +	if (object != tail) {
> +		object = get_freepointer(s, object);
> +		goto next_object;
> +	}
> +	ret = 1;
> +
> +out:
> +	if (cnt != bulk_cnt)
> +		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
> +			 bulk_cnt, cnt);
> +
> +	slab_unlock(slab, &flags2);
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	if (!ret)
> +		slab_fix(s, "Object at 0x%p not freed", object);
> +	return ret;
> +}
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> -- 
> 2.37.1
> 

Looks good to me.

Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe
  2022-08-12  9:14 ` [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
@ 2022-08-14 14:39   ` Hyeonggon Yoo
  2022-08-23 16:39     ` Vlastimil Babka
  0 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2022-08-14 14:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, Aug 12, 2022 at 11:14:23AM +0200, Vlastimil Babka wrote:
> Rongwei Wang reports [1] that cache validation triggered by writing to
> /sys/kernel/slab/<cache>/validate is racy against normal cache
> operations (e.g. freeing) in a way that can cause false positive
> inconsistency reports for caches with debugging enabled. The problem is
> that debugging actions that mark object free or active and actual
> freelist operations are not atomic, and the validation can see an
> inconsistent state.
> 
> For caches that do or don't have debugging enabled, additional races
> involving n->nr_slabs are possible that result in false reports of wrong
> slab counts.
> 
> This patch attempts to solve these issues while not adding overhead to
> normal (especially fastpath) operations for caches that do not have
> debugging enabled. Such overhead would not be justified to make possible
> userspace-triggered validation safe. Instead, disable the validation for
> caches that don't have debugging enabled and make their sysfs validate
> handler return -EINVAL.
> 
> For caches that do have debugging enabled, we can instead extend the
> existing approach of not using percpu freelists to force all alloc/free
> perations to the slow paths where debugging flags is checked and acted
> upon. There can adjust the debug-specific paths to increase n->list_lock
> coverage against concurrent validation as necessary.

s/perations/operations

> The processing on free in free_debug_processing() already happens under
> n->list_lock so we can extend it to actually do the freeing as well and
> thus make it atomic against concurrent validation. As observed by
> Hyeonggon Yoo, we do not really need to take slab_lock() anymore here
> because all paths we could race with are protected by n->list_lock under
> the new scheme, so drop its usage here.
> 
> The processing on alloc in alloc_debug_processing() currently doesn't
> take any locks, but we have to first allocate the object from a slab on
> the partial list (as debugging caches have no percpu slabs) and thus
> take the n->list_lock anyway. Add a function alloc_single_from_partial()
> that grabs just the allocated object instead of the whole freelist, and
> does the debug processing. The n->list_lock coverage again makes it
> atomic against validation and it is also ultimately more efficient than
> the current grabbing of freelist immediately followed by slab
> deactivation.
> 
> To prevent races on n->nr_slabs updates, make sure that for caches with
> debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
> n->list_lock. When allocating a new slab for a debug cache, handle the
> allocation by a new function alloc_single_from_new_slab() instead of the
> current forced deactivation path.
> 
> Neither of these changes affect the fast paths at all. The changes in
> slow paths are negligible for non-debug caches.
> 
> [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/
> 
> Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 227 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 177 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 87e794ab101a..fa7efd2d98be 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>  }
>  
>  static noinline int alloc_debug_processing(struct kmem_cache *s,
> -					struct slab *slab,
> -					void *object, unsigned long addr)
> +					struct slab *slab, void *object)
>  {
>  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
>  		if (!alloc_consistency_checks(s, slab, object))
>  			goto bad;
>  	}
>  
> -	/* Success perform special debug activities for allocs */
> -	if (s->flags & SLAB_STORE_USER)
> -		set_track(s, object, TRACK_ALLOC, addr);
> +	/* Success. Perform special debug activities for allocs */
>  	trace(s, slab, object, 1);
>  	init_object(s, object, SLUB_RED_ACTIVE);
>  	return 1;
> @@ -1604,9 +1601,9 @@ static inline
>  void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
>  
>  static inline int alloc_debug_processing(struct kmem_cache *s,
> -	struct slab *slab, void *object, unsigned long addr) { return 0; }
> +	struct slab *slab, void *object) { return 0; }
>  
> -static inline int free_debug_processing(
> +static inline void free_debug_processing(
>  	struct kmem_cache *s, struct slab *slab,
>  	void *head, void *tail, int bulk_cnt,
>  	unsigned long addr) { return 0; }

IIRC As reported by bot on earlier patch, void function
should not return 0;

> @@ -1614,6 +1611,8 @@ static inline int free_debug_processing(
>  static inline void slab_pad_check(struct kmem_cache *s, struct slab *slab) {}
>  static inline int check_object(struct kmem_cache *s, struct slab *slab,
>  			void *object, u8 val) { return 1; }
> +static inline void set_track(struct kmem_cache *s, void *object,
> +			     enum track_item alloc, unsigned long addr) {}
>  static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
>  					struct slab *slab) {}
>  static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
> @@ -1919,11 +1918,13 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  		 */
>  		slab = alloc_slab_page(alloc_gfp, node, oo);
>  		if (unlikely(!slab))
> -			goto out;
> +			return NULL;
>  		stat(s, ORDER_FALLBACK);
>  	}
>  
>  	slab->objects = oo_objects(oo);
> +	slab->inuse = 0;
> +	slab->frozen = 0;
>  
>  	account_slab(slab, oo_order(oo), s, flags);
>  
> @@ -1950,15 +1951,6 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  		set_freepointer(s, p, NULL);
>  	}
>  
> -	slab->inuse = slab->objects;
> -	slab->frozen = 1;
> -
> -out:
> -	if (!slab)
> -		return NULL;
> -
> -	inc_slabs_node(s, slab_nid(slab), slab->objects);
> -
>  	return slab;
>  }
>  
> @@ -2045,6 +2037,76 @@ static inline void remove_partial(struct kmem_cache_node *n,
>  	n->nr_partial--;
>  }
>  
> +/*
> + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
> + * slab from the n->partial list. Remove only a single object from the slab, do
> + * the alloc_debug_processing() checks and leave the slab on the list, or move
> + * it to full list if it was the last free object.
> + */
> +static void *alloc_single_from_partial(struct kmem_cache *s,
> +		struct kmem_cache_node *n, struct slab *slab)
> +{
> +	void *object;
> +
> +	lockdep_assert_held(&n->list_lock);
> +
> +	object = slab->freelist;
> +	slab->freelist = get_freepointer(s, object);
> +	slab->inuse++;
> +
> +	if (!alloc_debug_processing(s, slab, object)) {
> +		remove_partial(n, slab);
> +		return NULL;
> +	}
> +
> +	if (slab->inuse == slab->objects) {
> +		remove_partial(n, slab);
> +		add_full(s, n, slab);
> +	}
> +
> +	return object;
> +}
> +
> +/*
> + * Called only for kmem_cache_debug() caches to allocate from a freshly
> + * allocated slab. Allocate a single object instead of whole freelist
> + * and put the slab to the partial (or full) list.
> + */
> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
> +					struct slab *slab)
> +{
> +	int nid = slab_nid(slab);
> +	struct kmem_cache_node *n = get_node(s, nid);
> +	unsigned long flags;
> +	void *object;
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +
> +	object = slab->freelist;
> +	slab->freelist = get_freepointer(s, object);
> +	slab->inuse = 1;
> +
> +	if (!alloc_debug_processing(s, slab, object)) {
> +		/*
> +		 * It's not really expected that this would fail on a
> +		 * freshly allocated slab, but a concurrent memory
> +		 * corruption in theory could cause that.
> +		 */
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +		return NULL;
> +	}
> +

Nit: spin_lock_irqsave() can be here as freshly allocated
slab has no other reference.

> +	if (slab->inuse == slab->objects)
> +		add_full(s, n, slab);
> +	else
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +
> +	inc_slabs_node(s, nid, slab->objects);
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +	return object;
> +}
> +
>  /*
>   * Remove slab from the partial list, freeze it and
>   * return the pointer to the freelist.
> @@ -2125,6 +2187,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>  		if (!pfmemalloc_match(slab, gfpflags))
>  			continue;
>  
> +		if (kmem_cache_debug(s)) {
> +			object = alloc_single_from_partial(s, n, slab);
> +			if (object)
> +				break;
> +			continue;
> +		}
> +
>  		t = acquire_slab(s, n, slab, object == NULL);
>  		if (!t)
>  			break;
> @@ -2733,31 +2802,39 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>  }
>  
>  /* Supports checking bulk free of a constructed freelist */
> -static noinline int free_debug_processing(
> +static noinline void free_debug_processing(
>  	struct kmem_cache *s, struct slab *slab,
>  	void *head, void *tail, int bulk_cnt,
>  	unsigned long addr)
>  {
>  	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> +	struct slab *slab_free = NULL;
>  	void *object = head;
>  	int cnt = 0;
> -	unsigned long flags, flags2;
> -	int ret = 0;
> +	unsigned long flags;
> +	bool checks_ok = false;
>  	depot_stack_handle_t handle = 0;
>  
>  	if (s->flags & SLAB_STORE_USER)
>  		handle = set_track_prepare();
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	slab_lock(slab, &flags2);
>  
>  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
>  		if (!check_slab(s, slab))
>  			goto out;
>  	}
>  
> +	if (slab->inuse < bulk_cnt) {
> +		slab_err(s, slab, "Slab has %d allocated objects but %d are to be freed\n",
> +			 slab->inuse, bulk_cnt);
> +		goto out;
> +	}
> +
>  next_object:
> -	cnt++;
> +
> +	if (++cnt > bulk_cnt)
> +		goto out_cnt;
>  
>  	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
>  		if (!free_consistency_checks(s, slab, object, addr))
> @@ -2775,18 +2852,56 @@ static noinline int free_debug_processing(
>  		object = get_freepointer(s, object);
>  		goto next_object;
>  	}
> -	ret = 1;
> +	checks_ok = true;
>  
> -out:
> +out_cnt:
>  	if (cnt != bulk_cnt)
> -		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
> +		slab_err(s, slab, "Bulk free expected %d objects but found %d\n",
>  			 bulk_cnt, cnt);
>  
> -	slab_unlock(slab, &flags2);
> +out:
> +	if (checks_ok) {
> +		void *prior = slab->freelist;
> +
> +		/* Perform the actual freeing while we still hold the locks */
> +		slab->inuse -= cnt;
> +		set_freepointer(s, tail, prior);
> +		slab->freelist = head;
> +
> +		/* Do we need to remove the slab from full or partial list? */
> +		if (!prior) {
> +			remove_full(s, n, slab);
> +		} else if (slab->inuse == 0) {
> +			remove_partial(n, slab);
> +			stat(s, FREE_REMOVE_PARTIAL);
> +		}
> +
> +		/* Do we need to discard the slab or add to partial list? */
> +		if (slab->inuse == 0) {
> +			slab_free = slab;
> +		} else if (!prior) {
> +			add_partial(n, slab, DEACTIVATE_TO_TAIL);
> +			stat(s, FREE_ADD_PARTIAL);
> +		}
> +	}
> +
> +	if (slab_free) {
> +		/*
> +		 * Update the counters while still holding n->list_lock to
> +		 * prevent spurious validation warnings
> +		 */
> +		dec_slabs_node(s, slab_nid(slab_free), slab_free->objects);
> +	}

This looks good but maybe kmem_cache_shrink() can lead to
spurious validation warnings?

> +
>  	spin_unlock_irqrestore(&n->list_lock, flags);
> -	if (!ret)
> +
> +	if (!checks_ok)
>  		slab_fix(s, "Object at 0x%p not freed", object);
> -	return ret;
> +
> +	if (slab_free) {
> +		stat(s, FREE_SLAB);
> +		free_slab(s, slab_free);
> +	}
>  }
>  #endif /* CONFIG_SLUB_DEBUG */
>  
> @@ -3036,36 +3151,52 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		return NULL;
>  	}
>  
> +	stat(s, ALLOC_SLAB);
> +
> +	if (kmem_cache_debug(s)) {
> +		freelist = alloc_single_from_new_slab(s, slab);
> +
> +		if (unlikely(!freelist))
> +			goto new_objects;
> +
> +		if (s->flags & SLAB_STORE_USER)
> +			set_track(s, freelist, TRACK_ALLOC, addr);
> +
> +		return freelist;
> +	}
> +
>  	/*
>  	 * No other reference to the slab yet so we can
>  	 * muck around with it freely without cmpxchg
>  	 */
>  	freelist = slab->freelist;
>  	slab->freelist = NULL;
> +	slab->inuse = slab->objects;
> +	slab->frozen = 1;
>  
> -	stat(s, ALLOC_SLAB);
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
>  
>  check_new_slab:
>  
>  	if (kmem_cache_debug(s)) {
> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
> -			/* Slab failed checks. Next slab needed */
> -			goto new_slab;
> -		} else {
> -			/*
> -			 * For debug case, we don't load freelist so that all
> -			 * allocations go through alloc_debug_processing()
> -			 */
> -			goto return_single;
> -		}
> +		/*
> +		 * For debug caches here we had to go through
> +		 * alloc_single_from_partial() so just store the tracking info
> +		 * and return the object
> +		 */
> +		if (s->flags & SLAB_STORE_USER)
> +			set_track(s, freelist, TRACK_ALLOC, addr);
> +		return freelist;
>  	}
>  
> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
> +	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
>  		/*
>  		 * For !pfmemalloc_match() case we don't load freelist so that
>  		 * we don't make further mismatched allocations easier.
>  		 */
> -		goto return_single;
> +		deactivate_slab(s, slab, get_freepointer(s, freelist));
> +		return freelist;
> +	}
>  
>  retry_load_slab:
>  
> @@ -3089,11 +3220,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	c->slab = slab;
>  
>  	goto load_freelist;
> -
> -return_single:
> -
> -	deactivate_slab(s, slab, get_freepointer(s, freelist));
> -	return freelist;
>  }
>  
>  /*
> @@ -3341,9 +3467,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	if (kfence_free(head))
>  		return;
>  
> -	if (kmem_cache_debug(s) &&
> -	    !free_debug_processing(s, slab, head, tail, cnt, addr))
> +	if (kmem_cache_debug(s)) {
> +		free_debug_processing(s, slab, head, tail, cnt, addr);
>  		return;
> +	}
>  
>  	do {
>  		if (unlikely(n)) {
> @@ -3936,6 +4063,7 @@ static void early_kmem_cache_node_alloc(int node)
>  	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
>  
>  	BUG_ON(!slab);
> +	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
>  	if (slab_nid(slab) != node) {
>  		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
>  		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
> @@ -3950,7 +4078,6 @@ static void early_kmem_cache_node_alloc(int node)
>  	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
>  	slab->freelist = get_freepointer(kmem_cache_node, n);
>  	slab->inuse = 1;
> -	slab->frozen = 0;
>  	kmem_cache_node->node[node] = n;
>  	init_kmem_cache_node(n);
>  	inc_slabs_node(kmem_cache_node, node, slab->objects);
> @@ -5601,7 +5728,7 @@ static ssize_t validate_store(struct kmem_cache *s,
>  {
>  	int ret = -EINVAL;
>  
> -	if (buf[0] == '1') {
> +	if (buf[0] == '1' && kmem_cache_debug(s)) {
>  		ret = validate_slab_cache(s);
>  		if (ret >= 0)
>  			ret = length;
> -- 
> 2.37.1
> 

Otherwise looks good to me!

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] mm/slub: remove slab_lock() usage for debug operations
       [not found] ` <20220812091426.18418-4-vbabka@suse.cz>
@ 2022-08-14 14:54   ` Hyeonggon Yoo
  2022-08-15  0:04   ` David Rientjes
  1 sibling, 0 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2022-08-14 14:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, Aug 12, 2022 at 11:14:24AM +0200, Vlastimil Babka wrote:
> All alloc and free operations on debug caches are now serialized by
> n->list_lock, so we can remove slab_lock() usage in validate_slab()
> and list_slab_objects() as those also happen under n->list_lock.
> 
> Note the usage in list_slab_objects() could happen even on non-debug
> caches, but only during cache shutdown time so there should not be any
> parallel freeing activity anymore. Except for buggy slab users, but in
> that case the slab_lock() would not help against the common cmpxchg
> based fast paths (in non-debug caches) anyway.
> 
> Also adjust documentation comments accordingly.
> 
> Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fa7efd2d98be..32b79bc3ae6d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -50,7 +50,7 @@
>   *   1. slab_mutex (Global Mutex)
>   *   2. node->list_lock (Spinlock)
>   *   3. kmem_cache->cpu_slab->lock (Local lock)
> - *   4. slab_lock(slab) (Only on some arches or for debugging)
> + *   4. slab_lock(slab) (Only on some arches)
>   *   5. object_map_lock (Only for debugging)
>   *
>   *   slab_mutex
> @@ -64,8 +64,9 @@
>   *   The slab_lock is a wrapper around the page lock, thus it is a bit
>   *   spinlock.
>   *
> - *   The slab_lock is only used for debugging and on arches that do not
> - *   have the ability to do a cmpxchg_double. It only protects:
> + *   The slab_lock is only used on arches that do not have the ability
> + *   to do a cmpxchg_double. It only protects:
> + *
>   *	A. slab->freelist	-> List of free objects in a slab
>   *	B. slab->inuse		-> Number of objects in use
>   *	C. slab->objects	-> Number of objects in slab
> @@ -94,6 +95,9 @@
>   *   allocating a long series of objects that fill up slabs does not require
>   *   the list lock.
>   *
> + *   For debug caches, all allocations are forced to go through a list_lock
> + *   protected region to serialize against concurrent validation.
> + *
>   *   cpu_slab->lock local lock
>   *
>   *   This locks protect slowpath manipulation of all kmem_cache_cpu fields
> @@ -4369,7 +4373,6 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
>  	void *p;
>  
>  	slab_err(s, slab, text, s->name);
> -	slab_lock(slab, &flags);
>  
>  	map = get_map(s, slab);
>  	for_each_object(p, s, addr, slab->objects) {
> @@ -4380,7 +4383,6 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
>  		}
>  	}
>  	put_map(map);
> -	slab_unlock(slab, &flags);
>  #endif
>  }
>  
> @@ -5107,12 +5109,9 @@ static void validate_slab(struct kmem_cache *s, struct slab *slab,
>  {
>  	void *p;
>  	void *addr = slab_address(slab);
> -	unsigned long flags;
> -
> -	slab_lock(slab, &flags);
>  
>  	if (!check_slab(s, slab) || !on_freelist(s, slab, NULL))
> -		goto unlock;
> +		return;
>  
>  	/* Now we know that a valid freelist exists */
>  	__fill_map(obj_map, s, slab);
> @@ -5123,8 +5122,6 @@ static void validate_slab(struct kmem_cache *s, struct slab *slab,
>  		if (!check_object(s, slab, p, val))
>  			break;
>  	}
> -unlock:
> -	slab_unlock(slab, &flags);
>  }
>  
>  static int validate_slab_node(struct kmem_cache *s,
> -- 
> 2.37.1
> 

Looks good to me.
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] mm/slub: convert object_map_lock to non-raw spinlock
       [not found] ` <20220812091426.18418-5-vbabka@suse.cz>
@ 2022-08-15  0:03   ` David Rientjes
  2022-08-15 12:53   ` Hyeonggon Yoo
  1 sibling, 0 replies; 12+ messages in thread
From: David Rientjes @ 2022-08-15  0:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, 12 Aug 2022, Vlastimil Babka wrote:

> The only remaining user of object_map_lock is list_slab_objects().
> Obtaining the lock there used to happen under slab_lock() which implied
> disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the
> slab_lock() removed, we can convert it to a normal spinlock.
> 
> Also remove the get_map()/put_map() wrappers as list_slab_objects()
> became their only remaining user.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I don't see a benefit to keeping the VM_BUG_ON(!irqs_disabled()); around 
either.

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm/slub: move free_debug_processing() further
       [not found] ` <20220812091426.18418-2-vbabka@suse.cz>
  2022-08-14 13:42   ` [PATCH 1/5] mm/slub: move free_debug_processing() further Hyeonggon Yoo
@ 2022-08-15  0:03   ` David Rientjes
  1 sibling, 0 replies; 12+ messages in thread
From: David Rientjes @ 2022-08-15  0:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, 12 Aug 2022, Vlastimil Babka wrote:

> In the following patch, the function free_debug_processing() will be
> calling add_partial(), remove_partial() and discard_slab(), se move it
> below their definitions to avoid forward declarations. To make review
> easier, separate the move from functional changes.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] mm/slub: remove slab_lock() usage for debug operations
       [not found] ` <20220812091426.18418-4-vbabka@suse.cz>
  2022-08-14 14:54   ` [PATCH 3/5] mm/slub: remove slab_lock() usage for debug operations Hyeonggon Yoo
@ 2022-08-15  0:04   ` David Rientjes
  1 sibling, 0 replies; 12+ messages in thread
From: David Rientjes @ 2022-08-15  0:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, 12 Aug 2022, Vlastimil Babka wrote:

> All alloc and free operations on debug caches are now serialized by
> n->list_lock, so we can remove slab_lock() usage in validate_slab()
> and list_slab_objects() as those also happen under n->list_lock.
> 
> Note the usage in list_slab_objects() could happen even on non-debug
> caches, but only during cache shutdown time so there should not be any
> parallel freeing activity anymore. Except for buggy slab users, but in
> that case the slab_lock() would not help against the common cmpxchg
> based fast paths (in non-debug caches) anyway.
> 
> Also adjust documentation comments accordingly.
> 
> Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/5] mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock()
       [not found] ` <20220812091426.18418-6-vbabka@suse.cz>
@ 2022-08-15  0:04   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2022-08-15  0:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	Hyeonggon Yoo, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, 12 Aug 2022, Vlastimil Babka wrote:

> The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab()
> (through slab_[un]lock()) makes it behave on PREEMPT_RT exactly as
> cmpxchg_double_slab() does, so it's simpler to just redirect the usage
> of the former to the latter on RT.
> 
> That means we no longer need the slab_[un]lock() wrappers, so delete
> them and rename the current __slab_[un]lock() to slab_[un]lock().
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] mm/slub: convert object_map_lock to non-raw spinlock
       [not found] ` <20220812091426.18418-5-vbabka@suse.cz>
  2022-08-15  0:03   ` [PATCH 4/5] mm/slub: convert object_map_lock to non-raw spinlock David Rientjes
@ 2022-08-15 12:53   ` Hyeonggon Yoo
  1 sibling, 0 replies; 12+ messages in thread
From: Hyeonggon Yoo @ 2022-08-15 12:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On Fri, Aug 12, 2022 at 11:14:25AM +0200, Vlastimil Babka wrote:
> The only remaining user of object_map_lock is list_slab_objects().
> Obtaining the lock there used to happen under slab_lock() which implied
> disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the
> slab_lock() removed, we can convert it to a normal spinlock.
> 
> Also remove the get_map()/put_map() wrappers as list_slab_objects()
> became their only remaining user.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 36 ++++++------------------------------
>  1 file changed, 6 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 32b79bc3ae6d..dffb5063acbf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -565,7 +565,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
>  
>  #ifdef CONFIG_SLUB_DEBUG
>  static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
> -static DEFINE_RAW_SPINLOCK(object_map_lock);
> +static DEFINE_SPINLOCK(object_map_lock);
>  
>  static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
>  		       struct slab *slab)
> @@ -599,30 +599,6 @@ static bool slab_add_kunit_errors(void)
>  static inline bool slab_add_kunit_errors(void) { return false; }
>  #endif
>  
> -/*
> - * Determine a map of objects in use in a slab.
> - *
> - * Node listlock must be held to guarantee that the slab does
> - * not vanish from under us.
> - */
> -static unsigned long *get_map(struct kmem_cache *s, struct slab *slab)
> -	__acquires(&object_map_lock)
> -{
> -	VM_BUG_ON(!irqs_disabled());
> -
> -	raw_spin_lock(&object_map_lock);
> -
> -	__fill_map(object_map, s, slab);
> -
> -	return object_map;
> -}
> -
> -static void put_map(unsigned long *map) __releases(&object_map_lock)
> -{
> -	VM_BUG_ON(map != object_map);
> -	raw_spin_unlock(&object_map_lock);
> -}
> -
>  static inline unsigned int size_from_object(struct kmem_cache *s)
>  {
>  	if (s->flags & SLAB_RED_ZONE)
> @@ -4368,21 +4344,21 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab,
>  {
>  #ifdef CONFIG_SLUB_DEBUG
>  	void *addr = slab_address(slab);
> -	unsigned long flags;
> -	unsigned long *map;
>  	void *p;
>  
>  	slab_err(s, slab, text, s->name);
>  
> -	map = get_map(s, slab);
> +	spin_lock(&object_map_lock);
> +	__fill_map(object_map, s, slab);
> +
>  	for_each_object(p, s, addr, slab->objects) {
>  
> -		if (!test_bit(__obj_to_index(s, addr, p), map)) {
> +		if (!test_bit(__obj_to_index(s, addr, p), object_map)) {
>  			pr_err("Object 0x%p @offset=%tu\n", p, p - addr);
>  			print_tracking(s, p);
>  		}
>  	}
> -	put_map(map);
> +	spin_unlock(&object_map_lock);
>  #endif
>  }
>  
> -- 
> 2.37.1
> 

Looks good to me.
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe
  2022-08-14 14:39   ` Hyeonggon Yoo
@ 2022-08-23 16:39     ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2022-08-23 16:39 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Roman Gushchin, linux-mm,
	Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith

On 8/14/22 16:39, Hyeonggon Yoo wrote:
> On Fri, Aug 12, 2022 at 11:14:23AM +0200, Vlastimil Babka wrote:
>> Rongwei Wang reports [1] that cache validation triggered by writing to
>> /sys/kernel/slab/<cache>/validate is racy against normal cache
>> operations (e.g. freeing) in a way that can cause false positive
>> inconsistency reports for caches with debugging enabled. The problem is
>> that debugging actions that mark object free or active and actual
>> freelist operations are not atomic, and the validation can see an
>> inconsistent state.
>> 
>> For caches that do or don't have debugging enabled, additional races
>> involving n->nr_slabs are possible that result in false reports of wrong
>> slab counts.
>> 
>> This patch attempts to solve these issues while not adding overhead to
>> normal (especially fastpath) operations for caches that do not have
>> debugging enabled. Such overhead would not be justified to make possible
>> userspace-triggered validation safe. Instead, disable the validation for
>> caches that don't have debugging enabled and make their sysfs validate
>> handler return -EINVAL.
>> 
>> For caches that do have debugging enabled, we can instead extend the
>> existing approach of not using percpu freelists to force all alloc/free
>> perations to the slow paths where debugging flags is checked and acted
>> upon. There can adjust the debug-specific paths to increase n->list_lock
>> coverage against concurrent validation as necessary.
> 
> s/perations/operations

OK

>> @@ -1604,9 +1601,9 @@ static inline
>>  void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
>>  
>>  static inline int alloc_debug_processing(struct kmem_cache *s,
>> -	struct slab *slab, void *object, unsigned long addr) { return 0; }
>> +	struct slab *slab, void *object) { return 0; }
>>  
>> -static inline int free_debug_processing(
>> +static inline void free_debug_processing(
>>  	struct kmem_cache *s, struct slab *slab,
>>  	void *head, void *tail, int bulk_cnt,
>>  	unsigned long addr) { return 0; }
> 
> IIRC As reported by bot on earlier patch, void function
> should not return 0;

OK

>> +/*
>> + * Called only for kmem_cache_debug() caches to allocate from a freshly
>> + * allocated slab. Allocate a single object instead of whole freelist
>> + * and put the slab to the partial (or full) list.
>> + */
>> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
>> +					struct slab *slab)
>> +{
>> +	int nid = slab_nid(slab);
>> +	struct kmem_cache_node *n = get_node(s, nid);
>> +	unsigned long flags;
>> +	void *object;
>> +
>> +	spin_lock_irqsave(&n->list_lock, flags);
>> +
>> +	object = slab->freelist;
>> +	slab->freelist = get_freepointer(s, object);
>> +	slab->inuse = 1;
>> +
>> +	if (!alloc_debug_processing(s, slab, object)) {
>> +		/*
>> +		 * It's not really expected that this would fail on a
>> +		 * freshly allocated slab, but a concurrent memory
>> +		 * corruption in theory could cause that.
>> +		 */
>> +		spin_unlock_irqrestore(&n->list_lock, flags);
>> +		return NULL;
>> +	}
>> +
> 
> Nit: spin_lock_irqsave() can be here as freshly allocated
> slab has no other reference.

Right.

>> +out:
>> +	if (checks_ok) {
>> +		void *prior = slab->freelist;
>> +
>> +		/* Perform the actual freeing while we still hold the locks */
>> +		slab->inuse -= cnt;
>> +		set_freepointer(s, tail, prior);
>> +		slab->freelist = head;
>> +
>> +		/* Do we need to remove the slab from full or partial list? */
>> +		if (!prior) {
>> +			remove_full(s, n, slab);
>> +		} else if (slab->inuse == 0) {
>> +			remove_partial(n, slab);
>> +			stat(s, FREE_REMOVE_PARTIAL);
>> +		}
>> +
>> +		/* Do we need to discard the slab or add to partial list? */
>> +		if (slab->inuse == 0) {
>> +			slab_free = slab;
>> +		} else if (!prior) {
>> +			add_partial(n, slab, DEACTIVATE_TO_TAIL);
>> +			stat(s, FREE_ADD_PARTIAL);
>> +		}
>> +	}
>> +
>> +	if (slab_free) {
>> +		/*
>> +		 * Update the counters while still holding n->list_lock to
>> +		 * prevent spurious validation warnings
>> +		 */
>> +		dec_slabs_node(s, slab_nid(slab_free), slab_free->objects);
>> +	}
> 
> This looks good but maybe kmem_cache_shrink() can lead to
> spurious validation warnings?

Good catch, I'll fix that too.

> 
> Otherwise looks good to me!
> 

Thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-08-23 16:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-12  9:14 [PATCH 0/5] fix validation races and cleanup locking Vlastimil Babka
2022-08-12  9:14 ` [PATCH 2/5] mm/slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
2022-08-14 14:39   ` Hyeonggon Yoo
2022-08-23 16:39     ` Vlastimil Babka
2022-08-12  9:16 ` [PATCH 0/5] fix validation races and cleanup locking Vlastimil Babka
     [not found] ` <20220812091426.18418-4-vbabka@suse.cz>
2022-08-14 14:54   ` [PATCH 3/5] mm/slub: remove slab_lock() usage for debug operations Hyeonggon Yoo
2022-08-15  0:04   ` David Rientjes
     [not found] ` <20220812091426.18418-2-vbabka@suse.cz>
2022-08-14 13:42   ` [PATCH 1/5] mm/slub: move free_debug_processing() further Hyeonggon Yoo
2022-08-15  0:03   ` David Rientjes
     [not found] ` <20220812091426.18418-6-vbabka@suse.cz>
2022-08-15  0:04   ` [PATCH 5/5] mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock() David Rientjes
     [not found] ` <20220812091426.18418-5-vbabka@suse.cz>
2022-08-15  0:03   ` [PATCH 4/5] mm/slub: convert object_map_lock to non-raw spinlock David Rientjes
2022-08-15 12:53   ` Hyeonggon Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).