All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe
@ 2022-07-27 18:29 Vlastimil Babka
  2022-07-31  8:44 ` Hyeonggon Yoo
  0 siblings, 1 reply; 5+ messages in thread
From: Vlastimil Babka @ 2022-07-27 18:29 UTC (permalink / raw)
  To: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, Hyeonggon Yoo
  Cc: linux-mm, Feng Tang, Vlastimil Babka

Rongwei Wang reports [1] that cache validation triggered by writing to
/sys/kernel/slab/<cache>/validate is racy against normal cache
operations (e.g. freeing) in a way that can cause false positive
inconsistency reports for caches with debugging enabled. The problem is
that debugging actions that mark object free or active and actual
freelist operations are not atomic, and the validation can see an
inconsistent state.

For caches that do or don't have debugging enabled, additional races
regarding n->nr_slabs are possible that result in false reports of wrong
slab counts.

This patch attempts to solve these issues while not adding overhead to
normal (especially fastpath) operations for caches that do not have
debugging enabled, just to make possible userspace-triggered validation
safe. Instead, disable the validation for caches that don't have
debugging enabled and make the sysfs handler return -EINVAL.

For caches that do have debugging enabled, we can instead extend the
existing approach of not using percpu freelists to force all operations
to the slow paths where debugging is checked for and processed.

The processing on free in free_debug_processing() already happens under
n->list_lock and slab_lock() so we can extend it to actually do the
freeing as well and thus make it atomic against concurrent validation.

The processing on alloc in alloc_debug_processing() currently doesn't
take any locks, but we have to first allocate the object from a slab on
the partial list (as percpu slabs are always non-existent) and thus take
the n->list_lock anyway. Add a function alloc_single_from_partial() that
additionally takes slab_lock() for the debug processing and then grabs
just the allocated object instead of the whole freelist. This again
makes it atomic against validation and it is also ultimately more
efficient than the current grabbing of freelist immediately followed by
slab deactivation.

To prevent races on n->nr_slabs, make sure that for caches with
debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
n->list_lock. When allocating a new slab for a debug cache, handle the
allocation by a new function alloc_single_from_new_slab() instead of the
current forced deactivation path.

Neither of these changes affect the fast paths.

The function free_debug_processing() is moved so that it is placed
later than the definitions of add_partial(), remove_partial() and
discard_slab(), to avoid a need for forward declarations.

[1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/

Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Hi, this extends the pre-RFC from [1] to cover also racy n->nr_slab updates
and hopefully thus addresses everything that Rongwei's series did, and
testing will show that?
Thanks, Vlastimil

[1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/

 mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 231 insertions(+), 91 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b1281b8654bd..01e5228809d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
 }
 
 static noinline int alloc_debug_processing(struct kmem_cache *s,
-					struct slab *slab,
-					void *object, unsigned long addr)
+					struct slab *slab, void *object)
 {
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!alloc_consistency_checks(s, slab, object))
 			goto bad;
 	}
 
-	/* Success perform special debug activities for allocs */
-	if (s->flags & SLAB_STORE_USER)
-		set_track(s, object, TRACK_ALLOC, addr);
+	/* Success. Perform special debug activities for allocs */
 	trace(s, slab, object, 1);
 	init_object(s, object, SLUB_RED_ACTIVE);
 	return 1;
@@ -1385,63 +1382,6 @@ static inline int free_consistency_checks(struct kmem_cache *s,
 	return 1;
 }
 
-/* Supports checking bulk free of a constructed freelist */
-static noinline int free_debug_processing(
-	struct kmem_cache *s, struct slab *slab,
-	void *head, void *tail, int bulk_cnt,
-	unsigned long addr)
-{
-	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
-	void *object = head;
-	int cnt = 0;
-	unsigned long flags, flags2;
-	int ret = 0;
-	depot_stack_handle_t handle = 0;
-
-	if (s->flags & SLAB_STORE_USER)
-		handle = set_track_prepare();
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	slab_lock(slab, &flags2);
-
-	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-		if (!check_slab(s, slab))
-			goto out;
-	}
-
-next_object:
-	cnt++;
-
-	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-		if (!free_consistency_checks(s, slab, object, addr))
-			goto out;
-	}
-
-	if (s->flags & SLAB_STORE_USER)
-		set_track_update(s, object, TRACK_FREE, addr, handle);
-	trace(s, slab, object, 0);
-	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
-	init_object(s, object, SLUB_RED_INACTIVE);
-
-	/* Reached end of constructed freelist yet? */
-	if (object != tail) {
-		object = get_freepointer(s, object);
-		goto next_object;
-	}
-	ret = 1;
-
-out:
-	if (cnt != bulk_cnt)
-		slab_err(s, slab, "Bulk freelist count(%d) invalid(%d)\n",
-			 bulk_cnt, cnt);
-
-	slab_unlock(slab, &flags2);
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	if (!ret)
-		slab_fix(s, "Object at 0x%p not freed", object);
-	return ret;
-}
-
 /*
  * Parse a block of slub_debug options. Blocks are delimited by ';'
  *
@@ -1661,7 +1601,7 @@ static inline
 void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
 
 static inline int alloc_debug_processing(struct kmem_cache *s,
-	struct slab *slab, void *object, unsigned long addr) { return 0; }
+	struct slab *slab, void *object) { return 0; }
 
 static inline int free_debug_processing(
 	struct kmem_cache *s, struct slab *slab,
@@ -1671,6 +1611,8 @@ static inline int free_debug_processing(
 static inline void slab_pad_check(struct kmem_cache *s, struct slab *slab) {}
 static inline int check_object(struct kmem_cache *s, struct slab *slab,
 			void *object, u8 val) { return 1; }
+static inline void set_track(struct kmem_cache *s, void *object,
+			     enum track_item alloc, unsigned long addr) {}
 static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
 					struct slab *slab) {}
 static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n,
@@ -1976,7 +1918,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 */
 		slab = alloc_slab_page(alloc_gfp, node, oo);
 		if (unlikely(!slab))
-			goto out;
+			return NULL;
 		stat(s, ORDER_FALLBACK);
 	}
 
@@ -2010,12 +1952,6 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	slab->inuse = slab->objects;
 	slab->frozen = 1;
 
-out:
-	if (!slab)
-		return NULL;
-
-	inc_slabs_node(s, slab_nid(slab), slab->objects);
-
 	return slab;
 }
 
@@ -2102,6 +2038,87 @@ static inline void remove_partial(struct kmem_cache_node *n,
 	n->nr_partial--;
 }
 
+/*
+ * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
+ * slab from the n->partial list. Removes only a single object from the slab
+ * under slab_lock(), does the alloc_debug_processing() checks and leaves the
+ * slab on the list, or moves it to full list if it was the last object.
+ */
+static void *alloc_single_from_partial(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct slab *slab)
+{
+	void *object;
+	unsigned long flags;
+
+	lockdep_assert_held(&n->list_lock);
+
+	slab_lock(slab, &flags);
+
+	object = slab->freelist;
+	slab->freelist = get_freepointer(s, object);
+	slab->inuse++;
+
+	if (!alloc_debug_processing(s, slab, object)) {
+		remove_partial(n, slab);
+		slab_unlock(slab, &flags);
+		return NULL;
+	}
+
+	if (slab->inuse == slab->objects) {
+		remove_partial(n, slab);
+		add_full(s, n, slab);
+	}
+
+	slab_unlock(slab, &flags);
+
+	return object;
+}
+
+/*
+ * Called only for kmem_cache_debug() caches to allocate from a freshly
+ * allocated slab. Allocates a single object instead of whole freelist
+ * and puts the slab to the partial (or full) list.
+ */
+static void *alloc_single_from_new_slab(struct kmem_cache *s,
+					struct slab *slab)
+{
+	int nid = slab_nid(slab);
+	struct kmem_cache_node *n = get_node(s, nid);
+	unsigned long flags, flags2;
+	void *object;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	slab_lock(slab, &flags2);
+
+	object = slab->freelist;
+	slab->freelist = get_freepointer(s, object);
+	/* Undo what allocate_slab() did */
+	slab->frozen = 0;
+	slab->inuse = 1;
+
+	if (!alloc_debug_processing(s, slab, object)) {
+		/*
+		 * It's not really expected that this would fail on a
+		 * freshly allocated slab, but a concurrent memory
+		 * corruption in theory could cause that.
+		 */
+		slab_unlock(slab, &flags2);
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return NULL;
+	}
+
+	if (slab->inuse == slab->objects)
+		add_full(s, n, slab);
+	else
+		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+
+	slab_unlock(slab, &flags2);
+	inc_slabs_node(s, nid, slab->objects);
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	return object;
+}
+
 /*
  * Remove slab from the partial list, freeze it and
  * return the pointer to the freelist.
@@ -2182,6 +2199,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 		if (!pfmemalloc_match(slab, gfpflags))
 			continue;
 
+		if (kmem_cache_debug(s)) {
+			object = alloc_single_from_partial(s, n, slab);
+			if (object)
+				break;
+			continue;
+		}
+
 		t = acquire_slab(s, n, slab, object == NULL);
 		if (!t)
 			break;
@@ -2788,6 +2812,111 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
 	return atomic_long_read(&n->total_objects);
 }
+
+/* Supports checking bulk free of a constructed freelist */
+static noinline int free_debug_processing(
+	struct kmem_cache *s, struct slab *slab,
+	void *head, void *tail, int bulk_cnt,
+	unsigned long addr)
+{
+	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
+	struct slab *slab_free = NULL;
+	void *object = head;
+	int cnt = 0;
+	unsigned long flags, flags2;
+	int ret = 0;
+	depot_stack_handle_t handle = 0;
+
+	if (s->flags & SLAB_STORE_USER)
+		handle = set_track_prepare();
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	slab_lock(slab, &flags2);
+
+	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
+		if (!check_slab(s, slab))
+			goto out;
+	}
+
+	if (slab->inuse < bulk_cnt) {
+		slab_err(s, slab, "Slab has %d allocated objects but %d are to be freed\n",
+			 slab->inuse, bulk_cnt);
+		goto out;
+	}
+
+next_object:
+
+	if (++cnt > bulk_cnt)
+		goto out_cnt;
+
+	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
+		if (!free_consistency_checks(s, slab, object, addr))
+			goto out;
+	}
+
+	if (s->flags & SLAB_STORE_USER)
+		set_track_update(s, object, TRACK_FREE, addr, handle);
+	trace(s, slab, object, 0);
+	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
+	init_object(s, object, SLUB_RED_INACTIVE);
+
+	/* Reached end of constructed freelist yet? */
+	if (object != tail) {
+		object = get_freepointer(s, object);
+		goto next_object;
+	}
+	ret = 1;
+
+out_cnt:
+	if (cnt != bulk_cnt)
+		slab_err(s, slab, "Bulk free expected %d objects but found %d\n",
+			 bulk_cnt, cnt);
+
+out:
+	if (ret) {
+		void *prior = slab->freelist;
+
+		/* Perform the actual freeing while we still hold the locks */
+		slab->inuse -= cnt;
+		set_freepointer(s, tail, prior);
+		slab->freelist = head;
+
+		/* Do we need to remove the slab from full or partial list? */
+		if (!prior) {
+			remove_full(s, n, slab);
+		} else if (slab->inuse == 0) {
+			remove_partial(n, slab);
+			stat(s, FREE_REMOVE_PARTIAL);
+		}
+
+		/* Do we need to discard the slab or add to partial list? */
+		if (slab->inuse == 0) {
+			slab_free = slab;
+		} else if (!prior) {
+			add_partial(n, slab, DEACTIVATE_TO_TAIL);
+			stat(s, FREE_ADD_PARTIAL);
+		}
+
+	}
+
+	slab_unlock(slab, &flags2);
+	if (slab_free) {
+		/*
+		 * Update the counters while still holding n->list_lock to
+		 * prevent spurious validation warnings
+		 */
+		dec_slabs_node(s, slab_nid(slab_free), slab_free->objects);
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	if (!ret)
+		slab_fix(s, "Object at 0x%p not freed", object);
+	if (slab_free) {
+		stat(s, FREE_SLAB);
+		free_slab(s, slab_free);
+	}
+
+	return ret;
+}
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
@@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
+	stat(s, ALLOC_SLAB);
+
+	if (kmem_cache_debug(s)) {
+		freelist = alloc_single_from_new_slab(s, slab);
+
+		if (unlikely(!freelist))
+			goto new_objects;
+
+		if (s->flags & SLAB_STORE_USER)
+			set_track(s, freelist, TRACK_ALLOC, addr);
+
+		return freelist;
+	}
+
 	/*
 	 * No other reference to the slab yet so we can
 	 * muck around with it freely without cmpxchg
@@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	freelist = slab->freelist;
 	slab->freelist = NULL;
 
-	stat(s, ALLOC_SLAB);
+	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
 check_new_slab:
 
 	if (kmem_cache_debug(s)) {
-		if (!alloc_debug_processing(s, slab, freelist, addr)) {
-			/* Slab failed checks. Next slab needed */
-			goto new_slab;
-		} else {
-			/*
-			 * For debug case, we don't load freelist so that all
-			 * allocations go through alloc_debug_processing()
-			 */
-			goto return_single;
-		}
+		/*
+		 * For debug caches here we had to go through
+		 * alloc_single_from_partial() so just store the tracking info
+		 * and return the object
+		 */
+		if (s->flags & SLAB_STORE_USER)
+			set_track(s, freelist, TRACK_ALLOC, addr);
+		return freelist;
 	}
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
+	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
 		 * we don't make further mismatched allocations easier.
 		 */
-		goto return_single;
+		deactivate_slab(s, slab, get_freepointer(s, freelist));
+		return freelist;
+	}
 
 retry_load_slab:
 
@@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	c->slab = slab;
 
 	goto load_freelist;
-
-return_single:
-
-	deactivate_slab(s, slab, get_freepointer(s, freelist));
-	return freelist;
 }
 
 /*
@@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	if (kfence_free(head))
 		return;
 
-	if (kmem_cache_debug(s) &&
-	    !free_debug_processing(s, slab, head, tail, cnt, addr))
+	if (kmem_cache_debug(s)) {
+		free_debug_processing(s, slab, head, tail, cnt, addr);
 		return;
+	}
 
 	do {
 		if (unlikely(n)) {
@@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
 
 	BUG_ON(!slab);
+	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
 	if (slab_nid(slab) != node) {
 		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
 		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
@@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s,
 {
 	int ret = -EINVAL;
 
-	if (buf[0] == '1') {
+	if (buf[0] == '1' && kmem_cache_debug(s)) {
 		ret = validate_slab_cache(s);
 		if (ret >= 0)
 			ret = length;
-- 
2.37.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe
  2022-07-27 18:29 [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
@ 2022-07-31  8:44 ` Hyeonggon Yoo
  2022-07-31 10:07   ` Rongwei Wang
  2022-08-01 13:51   ` Vlastimil Babka
  0 siblings, 2 replies; 5+ messages in thread
From: Hyeonggon Yoo @ 2022-07-31  8:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, linux-mm, Feng Tang

On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote:
> Rongwei Wang reports [1] that cache validation triggered by writing to
> /sys/kernel/slab/<cache>/validate is racy against normal cache
> operations (e.g. freeing) in a way that can cause false positive
> inconsistency reports for caches with debugging enabled. The problem is
> that debugging actions that mark object free or active and actual
> freelist operations are not atomic, and the validation can see an
> inconsistent state.
> 
> For caches that do or don't have debugging enabled, additional races
> regarding n->nr_slabs are possible that result in false reports of wrong
> slab counts.
> 
> This patch attempts to solve these issues while not adding overhead to
> normal (especially fastpath) operations for caches that do not have
> debugging enabled, just to make possible userspace-triggered validation
> safe. Instead, disable the validation for caches that don't have
> debugging enabled and make the sysfs handler return -EINVAL.
> 
> For caches that do have debugging enabled, we can instead extend the
> existing approach of not using percpu freelists to force all operations
> to the slow paths where debugging is checked for and processed.
> 
> The processing on free in free_debug_processing() already happens under
> n->list_lock and slab_lock() so we can extend it to actually do the
> freeing as well and thus make it atomic against concurrent validation.
> 
> The processing on alloc in alloc_debug_processing() currently doesn't
> take any locks, but we have to first allocate the object from a slab on
> the partial list (as percpu slabs are always non-existent) and thus take
> the n->list_lock anyway. Add a function alloc_single_from_partial() that
> additionally takes slab_lock() for the debug processing and then grabs
> just the allocated object instead of the whole freelist. This again
> makes it atomic against validation and it is also ultimately more
> efficient than the current grabbing of freelist immediately followed by
> slab deactivation.
> 
> To prevent races on n->nr_slabs, make sure that for caches with
> debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
> n->list_lock. When allocating a new slab for a debug cache, handle the
> allocation by a new function alloc_single_from_new_slab() instead of the
> current forced deactivation path.
> 
> Neither of these changes affect the fast paths.
> 
> The function free_debug_processing() is moved so that it is placed
> later than the definitions of add_partial(), remove_partial() and
> discard_slab(), to avoid a need for forward declarations.
> 
> [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/
> 
> Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Hi, this extends the pre-RFC from [1] to cover also racy n->nr_slab updates
> and hopefully thus addresses everything that Rongwei's series did, and
> testing will show that?
> Thanks, Vlastimil
>

I don't care whose patch to ACK.
Maybe Rongwei will post his own patch?

Anyway, this patch overall looks good.

Also all issues (as far as I know) related to validate attribute
as gone after this patch.

Silly question:
	Do we want to apply on stable trees?
	I doubt someone would use validate attribute when not debugging.

> [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/
> 
>  mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 231 insertions(+), 91 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b1281b8654bd..01e5228809d7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>  }

[...]

> +/*
> + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
> + * slab from the n->partial list. Removes only a single object from the slab
> + * under slab_lock(), does the alloc_debug_processing() checks and leaves the
> + * slab on the list, or moves it to full list if it was the last object.
> + */
> +static void *alloc_single_from_partial(struct kmem_cache *s,
> +		struct kmem_cache_node *n, struct slab *slab)
> +{
> +	void *object;
> +	unsigned long flags;
> +
> +	lockdep_assert_held(&n->list_lock);
> +
> +	slab_lock(slab, &flags);
> +
> +	object = slab->freelist;
> +	slab->freelist = get_freepointer(s, object);
> +	slab->inuse++;
> +
> +	if (!alloc_debug_processing(s, slab, object)) {
> +		remove_partial(n, slab);
> +		slab_unlock(slab, &flags);
> +		return NULL;
> +	}
> +
> +	if (slab->inuse == slab->objects) {
> +		remove_partial(n, slab);
> +		add_full(s, n, slab);
> +	}
> +
> +	slab_unlock(slab, &flags);

AFAIK add_full/remove_full/add_partial/remove_partial
can be called outside slab_lock but inside list_lock.

> +	return object;
> +}
> +
> +/*
> + * Called only for kmem_cache_debug() caches to allocate from a freshly
> + * allocated slab. Allocates a single object instead of whole freelist
> + * and puts the slab to the partial (or full) list.
> + */
> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
> +					struct slab *slab)
> +{
> +	int nid = slab_nid(slab);
> +	struct kmem_cache_node *n = get_node(s, nid);
> +	unsigned long flags, flags2;
> +	void *object;
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +	slab_lock(slab, &flags2);
> +
> +	object = slab->freelist;
> +	slab->freelist = get_freepointer(s, object);
> +	/* Undo what allocate_slab() did */
> +	slab->frozen = 0;
> +	slab->inuse = 1;

Maybe do it in allocate_slab()?

> +	if (!alloc_debug_processing(s, slab, object)) {
> +		/*
> +		 * It's not really expected that this would fail on a
> +		 * freshly allocated slab, but a concurrent memory
> +		 * corruption in theory could cause that.
> +		 */
> +		slab_unlock(slab, &flags2);
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +		return NULL;
> +	}
> +
> +	if (slab->inuse == slab->objects)
> +		add_full(s, n, slab);
> +	else
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +
> +	slab_unlock(slab, &flags2);
> +	inc_slabs_node(s, nid, slab->objects);
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +	return object;
> +}

[...]

>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		return NULL;
>  	}
>  
> +	stat(s, ALLOC_SLAB);
> +
> +	if (kmem_cache_debug(s)) {
> +		freelist = alloc_single_from_new_slab(s, slab);
> +
> +		if (unlikely(!freelist))
> +			goto new_objects;
> +
> +		if (s->flags & SLAB_STORE_USER)
> +			set_track(s, freelist, TRACK_ALLOC, addr);
> +
> +		return freelist;
> +	}
> +
>  	/*
>  	 * No other reference to the slab yet so we can
>  	 * muck around with it freely without cmpxchg
> @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	freelist = slab->freelist;
>  	slab->freelist = NULL;
>  
> -	stat(s, ALLOC_SLAB);
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
>  
>  check_new_slab:
>  
>  	if (kmem_cache_debug(s)) {
> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
> -			/* Slab failed checks. Next slab needed */
> -			goto new_slab;
> -		} else {
> -			/*
> -			 * For debug case, we don't load freelist so that all
> -			 * allocations go through alloc_debug_processing()
> -			 */
> -			goto return_single;
> -		}
> +		/*
> +		 * For debug caches here we had to go through
> +		 * alloc_single_from_partial() so just store the tracking info
> +		 * and return the object
> +		 */
> +		if (s->flags & SLAB_STORE_USER)
> +			set_track(s, freelist, TRACK_ALLOC, addr);
> +		return freelist;
>  	}
>  
> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
> +	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
>  		/*
>  		 * For !pfmemalloc_match() case we don't load freelist so that
>  		 * we don't make further mismatched allocations easier.
>  		 */
> -		goto return_single;
> +		deactivate_slab(s, slab, get_freepointer(s, freelist));
> +		return freelist;
> +	}



>  
>  retry_load_slab:
>  
> @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	c->slab = slab;
>  
>  	goto load_freelist;
> -
> -return_single:
> -
> -	deactivate_slab(s, slab, get_freepointer(s, freelist));
> -	return freelist;
>  }
>  
>  /*
> @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	if (kfence_free(head))
>  		return;
>  
> -	if (kmem_cache_debug(s) &&
> -	    !free_debug_processing(s, slab, head, tail, cnt, addr))
> +	if (kmem_cache_debug(s)) {
> +		free_debug_processing(s, slab, head, tail, cnt, addr);
>  		return;
> +	}

Oh, now debugging caches does not share free path with non-debugging
caches.

Now free_debug_processing's return type can be void?

>  
>  	do {
>  		if (unlikely(n)) {
> @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node)
>  	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
>  
>  	BUG_ON(!slab);
> +	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
>  	if (slab_nid(slab) != node) {
>  		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
>  		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
> @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s,
>  {
>  	int ret = -EINVAL;
>  
> -	if (buf[0] == '1') {
> +	if (buf[0] == '1' && kmem_cache_debug(s)) {
>  		ret = validate_slab_cache(s);
>  		if (ret >= 0)
>  			ret = length;

Yeah definitely this is what it should be,
instead of serializing inc_slabs_node()/dec_slabs_node()
for non-debugging caches.

> -- 
> 2.37.1
> 

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe
  2022-07-31  8:44 ` Hyeonggon Yoo
@ 2022-07-31 10:07   ` Rongwei Wang
  2022-08-01 13:51   ` Vlastimil Babka
  1 sibling, 0 replies; 5+ messages in thread
From: Rongwei Wang @ 2022-07-31 10:07 UTC (permalink / raw)
  To: Hyeonggon Yoo, Vlastimil Babka
  Cc: Christoph Lameter, Joonsoo Kim, David Rientjes, Pekka Enberg,
	linux-mm, Feng Tang



On 7/31/22 4:44 PM, Hyeonggon Yoo wrote:
> On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote:
>> Rongwei Wang reports [1] that cache validation triggered by writing to
>> /sys/kernel/slab/<cache>/validate is racy against normal cache
>> operations (e.g. freeing) in a way that can cause false positive
>> inconsistency reports for caches with debugging enabled. The problem is
>> that debugging actions that mark object free or active and actual
>> freelist operations are not atomic, and the validation can see an
>> inconsistent state.
>>
>> For caches that do or don't have debugging enabled, additional races
>> regarding n->nr_slabs are possible that result in false reports of wrong
>> slab counts.
>>
>> This patch attempts to solve these issues while not adding overhead to
>> normal (especially fastpath) operations for caches that do not have
>> debugging enabled, just to make possible userspace-triggered validation
>> safe. Instead, disable the validation for caches that don't have
>> debugging enabled and make the sysfs handler return -EINVAL.
>>
>> For caches that do have debugging enabled, we can instead extend the
>> existing approach of not using percpu freelists to force all operations
>> to the slow paths where debugging is checked for and processed.
>>
>> The processing on free in free_debug_processing() already happens under
>> n->list_lock and slab_lock() so we can extend it to actually do the
>> freeing as well and thus make it atomic against concurrent validation.
>>
>> The processing on alloc in alloc_debug_processing() currently doesn't
>> take any locks, but we have to first allocate the object from a slab on
>> the partial list (as percpu slabs are always non-existent) and thus take
>> the n->list_lock anyway. Add a function alloc_single_from_partial() that
>> additionally takes slab_lock() for the debug processing and then grabs
>> just the allocated object instead of the whole freelist. This again
>> makes it atomic against validation and it is also ultimately more
>> efficient than the current grabbing of freelist immediately followed by
>> slab deactivation.
>>
>> To prevent races on n->nr_slabs, make sure that for caches with
>> debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
>> n->list_lock. When allocating a new slab for a debug cache, handle the
>> allocation by a new function alloc_single_from_new_slab() instead of the
>> current forced deactivation path.
>>
>> Neither of these changes affect the fast paths.
>>
>> The function free_debug_processing() is moved so that it is placed
>> later than the definitions of add_partial(), remove_partial() and
>> discard_slab(), to avoid a need for forward declarations.
>>
>> [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/
>>
>> Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>> Hi, this extends the pre-RFC from [1] to cover also racy n->nr_slab updates
>> and hopefully thus addresses everything that Rongwei's series did, and
>> testing will show that?
>> Thanks, Vlastimil
>>
> 
> I don't care whose patch to ACK.
> Maybe Rongwei will post his own patch?
Hi Hyeonggon

Vlastimil's method is better than me, and it looks good to me. I will 
not continue to post my patch.

-wrw
> 
> Anyway, this patch overall looks good.
> 
> Also all issues (as far as I know) related to validate attribute
> as gone after this patch.
> 
> Silly question:
> 	Do we want to apply on stable trees?
> 	I doubt someone would use validate attribute when not debugging.
> 
>> [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/
>>
>>   mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 231 insertions(+), 91 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index b1281b8654bd..01e5228809d7 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>>   }
> 
> [...]
> 
>> +/*
>> + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
>> + * slab from the n->partial list. Removes only a single object from the slab
>> + * under slab_lock(), does the alloc_debug_processing() checks and leaves the
>> + * slab on the list, or moves it to full list if it was the last object.
>> + */
>> +static void *alloc_single_from_partial(struct kmem_cache *s,
>> +		struct kmem_cache_node *n, struct slab *slab)
>> +{
>> +	void *object;
>> +	unsigned long flags;
>> +
>> +	lockdep_assert_held(&n->list_lock);
>> +
>> +	slab_lock(slab, &flags);
>> +
>> +	object = slab->freelist;
>> +	slab->freelist = get_freepointer(s, object);
>> +	slab->inuse++;
>> +
>> +	if (!alloc_debug_processing(s, slab, object)) {
>> +		remove_partial(n, slab);
>> +		slab_unlock(slab, &flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (slab->inuse == slab->objects) {
>> +		remove_partial(n, slab);
>> +		add_full(s, n, slab);
>> +	}
>> +
>> +	slab_unlock(slab, &flags);
> 
> AFAIK add_full/remove_full/add_partial/remove_partial
> can be called outside slab_lock but inside list_lock.
> 
>> +	return object;
>> +}
>> +
>> +/*
>> + * Called only for kmem_cache_debug() caches to allocate from a freshly
>> + * allocated slab. Allocates a single object instead of whole freelist
>> + * and puts the slab to the partial (or full) list.
>> + */
>> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
>> +					struct slab *slab)
>> +{
>> +	int nid = slab_nid(slab);
>> +	struct kmem_cache_node *n = get_node(s, nid);
>> +	unsigned long flags, flags2;
>> +	void *object;
>> +
>> +	spin_lock_irqsave(&n->list_lock, flags);
>> +	slab_lock(slab, &flags2);
>> +
>> +	object = slab->freelist;
>> +	slab->freelist = get_freepointer(s, object);
>> +	/* Undo what allocate_slab() did */
>> +	slab->frozen = 0;
>> +	slab->inuse = 1;
> 
> Maybe do it in allocate_slab()?
> 
>> +	if (!alloc_debug_processing(s, slab, object)) {
>> +		/*
>> +		 * It's not really expected that this would fail on a
>> +		 * freshly allocated slab, but a concurrent memory
>> +		 * corruption in theory could cause that.
>> +		 */
>> +		slab_unlock(slab, &flags2);
>> +		spin_unlock_irqrestore(&n->list_lock, flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (slab->inuse == slab->objects)
>> +		add_full(s, n, slab);
>> +	else
>> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
>> +
>> +	slab_unlock(slab, &flags2);
>> +	inc_slabs_node(s, nid, slab->objects);
>> +	spin_unlock_irqrestore(&n->list_lock, flags);
>> +
>> +	return object;
>> +}
> 
> [...]
> 
>>   #endif /* CONFIG_SLUB_DEBUG */
>>   
>>   #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   		return NULL;
>>   	}
>>   
>> +	stat(s, ALLOC_SLAB);
>> +
>> +	if (kmem_cache_debug(s)) {
>> +		freelist = alloc_single_from_new_slab(s, slab);
>> +
>> +		if (unlikely(!freelist))
>> +			goto new_objects;
>> +
>> +		if (s->flags & SLAB_STORE_USER)
>> +			set_track(s, freelist, TRACK_ALLOC, addr);
>> +
>> +		return freelist;
>> +	}
>> +
>>   	/*
>>   	 * No other reference to the slab yet so we can
>>   	 * muck around with it freely without cmpxchg
>> @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	freelist = slab->freelist;
>>   	slab->freelist = NULL;
>>   
>> -	stat(s, ALLOC_SLAB);
>> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
>>   
>>   check_new_slab:
>>   
>>   	if (kmem_cache_debug(s)) {
>> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
>> -			/* Slab failed checks. Next slab needed */
>> -			goto new_slab;
>> -		} else {
>> -			/*
>> -			 * For debug case, we don't load freelist so that all
>> -			 * allocations go through alloc_debug_processing()
>> -			 */
>> -			goto return_single;
>> -		}
>> +		/*
>> +		 * For debug caches here we had to go through
>> +		 * alloc_single_from_partial() so just store the tracking info
>> +		 * and return the object
>> +		 */
>> +		if (s->flags & SLAB_STORE_USER)
>> +			set_track(s, freelist, TRACK_ALLOC, addr);
>> +		return freelist;
>>   	}
>>   
>> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
>> +	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
>>   		/*
>>   		 * For !pfmemalloc_match() case we don't load freelist so that
>>   		 * we don't make further mismatched allocations easier.
>>   		 */
>> -		goto return_single;
>> +		deactivate_slab(s, slab, get_freepointer(s, freelist));
>> +		return freelist;
>> +	}
> 
> 
> 
>>   
>>   retry_load_slab:
>>   
>> @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	c->slab = slab;
>>   
>>   	goto load_freelist;
>> -
>> -return_single:
>> -
>> -	deactivate_slab(s, slab, get_freepointer(s, freelist));
>> -	return freelist;
>>   }
>>   
>>   /*
>> @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>   	if (kfence_free(head))
>>   		return;
>>   
>> -	if (kmem_cache_debug(s) &&
>> -	    !free_debug_processing(s, slab, head, tail, cnt, addr))
>> +	if (kmem_cache_debug(s)) {
>> +		free_debug_processing(s, slab, head, tail, cnt, addr);
>>   		return;
>> +	}
> 
> Oh, now debugging caches does not share free path with non-debugging
> caches.
> 
> Now free_debug_processing's return type can be void?
> 
>>   
>>   	do {
>>   		if (unlikely(n)) {
>> @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node)
>>   	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
>>   
>>   	BUG_ON(!slab);
>> +	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
>>   	if (slab_nid(slab) != node) {
>>   		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
>>   		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
>> @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s,
>>   {
>>   	int ret = -EINVAL;
>>   
>> -	if (buf[0] == '1') {
>> +	if (buf[0] == '1' && kmem_cache_debug(s)) {
>>   		ret = validate_slab_cache(s);
>>   		if (ret >= 0)
>>   			ret = length;
> 
> Yeah definitely this is what it should be,
> instead of serializing inc_slabs_node()/dec_slabs_node()
> for non-debugging caches.
> 
>> -- 
>> 2.37.1
>>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe
  2022-07-31  8:44 ` Hyeonggon Yoo
  2022-07-31 10:07   ` Rongwei Wang
@ 2022-08-01 13:51   ` Vlastimil Babka
  2022-08-02  2:47     ` Hyeonggon Yoo
  1 sibling, 1 reply; 5+ messages in thread
From: Vlastimil Babka @ 2022-08-01 13:51 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, linux-mm, Feng Tang

On 7/31/22 10:44, Hyeonggon Yoo wrote:
> On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote:
> Also all issues (as far as I know) related to validate attribute
> as gone after this patch.

As you (And Rongwei) were able to trigger/reproduce the issues, does your
testing also no longer reproduce them?

> Silly question:
> 	Do we want to apply on stable trees?

I'd prefer not to, it's too intrusive for stable.

> 	I doubt someone would use validate attribute when not debugging.

I doubt as well. Also it requires root, and even if somebody hits the issue,
it's just spurious warnings, nothing fatal. So that doesn't warrant the
intrusive stable backport IMHO.

>> [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/
>> 
>>  mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 231 insertions(+), 91 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index b1281b8654bd..01e5228809d7 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
>>  }
> 
> [...]
> 
>> +/*
>> + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
>> + * slab from the n->partial list. Removes only a single object from the slab
>> + * under slab_lock(), does the alloc_debug_processing() checks and leaves the
>> + * slab on the list, or moves it to full list if it was the last object.
>> + */
>> +static void *alloc_single_from_partial(struct kmem_cache *s,
>> +		struct kmem_cache_node *n, struct slab *slab)
>> +{
>> +	void *object;
>> +	unsigned long flags;
>> +
>> +	lockdep_assert_held(&n->list_lock);
>> +
>> +	slab_lock(slab, &flags);
>> +
>> +	object = slab->freelist;
>> +	slab->freelist = get_freepointer(s, object);
>> +	slab->inuse++;
>> +
>> +	if (!alloc_debug_processing(s, slab, object)) {
>> +		remove_partial(n, slab);
>> +		slab_unlock(slab, &flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (slab->inuse == slab->objects) {
>> +		remove_partial(n, slab);
>> +		add_full(s, n, slab);
>> +	}
>> +
>> +	slab_unlock(slab, &flags);
> 
> AFAIK add_full/remove_full/add_partial/remove_partial
> can be called outside slab_lock but inside list_lock.

Right, I will adjust, thanks.

>> +	return object;
>> +}
>> +
>> +/*
>> + * Called only for kmem_cache_debug() caches to allocate from a freshly
>> + * allocated slab. Allocates a single object instead of whole freelist
>> + * and puts the slab to the partial (or full) list.
>> + */
>> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
>> +					struct slab *slab)
>> +{
>> +	int nid = slab_nid(slab);
>> +	struct kmem_cache_node *n = get_node(s, nid);
>> +	unsigned long flags, flags2;
>> +	void *object;
>> +
>> +	spin_lock_irqsave(&n->list_lock, flags);
>> +	slab_lock(slab, &flags2);
>> +
>> +	object = slab->freelist;
>> +	slab->freelist = get_freepointer(s, object);
>> +	/* Undo what allocate_slab() did */
>> +	slab->frozen = 0;
>> +	slab->inuse = 1;
> 
> Maybe do it in allocate_slab()?

Hmm yeah, I guess we could stop doing that pre-freezing and inuse = objects
in allocate_slab(), and do it in __slab_alloc(), which thus won't add any
overhead. Then we won't have to unfreeze in early_kmem_cache_node_alloc() as
well.

>> +	if (!alloc_debug_processing(s, slab, object)) {
>> +		/*
>> +		 * It's not really expected that this would fail on a
>> +		 * freshly allocated slab, but a concurrent memory
>> +		 * corruption in theory could cause that.
>> +		 */
>> +		slab_unlock(slab, &flags2);
>> +		spin_unlock_irqrestore(&n->list_lock, flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (slab->inuse == slab->objects)
>> +		add_full(s, n, slab);
>> +	else
>> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
>> +
>> +	slab_unlock(slab, &flags2);
>> +	inc_slabs_node(s, nid, slab->objects);
>> +	spin_unlock_irqrestore(&n->list_lock, flags);
>> +
>> +	return object;
>> +}
> 
> [...]
> 
>>  #endif /* CONFIG_SLUB_DEBUG */
>>  
>>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>  		return NULL;
>>  	}
>>  
>> +	stat(s, ALLOC_SLAB);
>> +
>> +	if (kmem_cache_debug(s)) {
>> +		freelist = alloc_single_from_new_slab(s, slab);
>> +
>> +		if (unlikely(!freelist))
>> +			goto new_objects;
>> +
>> +		if (s->flags & SLAB_STORE_USER)
>> +			set_track(s, freelist, TRACK_ALLOC, addr);
>> +
>> +		return freelist;
>> +	}
>> +
>>  	/*
>>  	 * No other reference to the slab yet so we can
>>  	 * muck around with it freely without cmpxchg
>> @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>  	freelist = slab->freelist;
>>  	slab->freelist = NULL;
>>  
>> -	stat(s, ALLOC_SLAB);
>> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
>>  
>>  check_new_slab:
>>  
>>  	if (kmem_cache_debug(s)) {
>> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
>> -			/* Slab failed checks. Next slab needed */
>> -			goto new_slab;
>> -		} else {
>> -			/*
>> -			 * For debug case, we don't load freelist so that all
>> -			 * allocations go through alloc_debug_processing()
>> -			 */
>> -			goto return_single;
>> -		}
>> +		/*
>> +		 * For debug caches here we had to go through
>> +		 * alloc_single_from_partial() so just store the tracking info
>> +		 * and return the object
>> +		 */
>> +		if (s->flags & SLAB_STORE_USER)
>> +			set_track(s, freelist, TRACK_ALLOC, addr);
>> +		return freelist;
>>  	}
>>  
>> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
>> +	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
>>  		/*
>>  		 * For !pfmemalloc_match() case we don't load freelist so that
>>  		 * we don't make further mismatched allocations easier.
>>  		 */
>> -		goto return_single;
>> +		deactivate_slab(s, slab, get_freepointer(s, freelist));
>> +		return freelist;
>> +	}
> 
> 
> 
>>  
>>  retry_load_slab:
>>  
>> @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>  	c->slab = slab;
>>  
>>  	goto load_freelist;
>> -
>> -return_single:
>> -
>> -	deactivate_slab(s, slab, get_freepointer(s, freelist));
>> -	return freelist;
>>  }
>>  
>>  /*
>> @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>  	if (kfence_free(head))
>>  		return;
>>  
>> -	if (kmem_cache_debug(s) &&
>> -	    !free_debug_processing(s, slab, head, tail, cnt, addr))
>> +	if (kmem_cache_debug(s)) {
>> +		free_debug_processing(s, slab, head, tail, cnt, addr);
>>  		return;
>> +	}
> 
> Oh, now debugging caches does not share free path with non-debugging
> caches.
> 
> Now free_debug_processing's return type can be void?

Right.

>>  
>>  	do {
>>  		if (unlikely(n)) {
>> @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node)
>>  	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
>>  
>>  	BUG_ON(!slab);
>> +	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
>>  	if (slab_nid(slab) != node) {
>>  		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
>>  		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
>> @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s,
>>  {
>>  	int ret = -EINVAL;
>>  
>> -	if (buf[0] == '1') {
>> +	if (buf[0] == '1' && kmem_cache_debug(s)) {
>>  		ret = validate_slab_cache(s);
>>  		if (ret >= 0)
>>  			ret = length;
> 
> Yeah definitely this is what it should be,
> instead of serializing inc_slabs_node()/dec_slabs_node()
> for non-debugging caches.
> 
>> -- 
>> 2.37.1
>> 
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe
  2022-08-01 13:51   ` Vlastimil Babka
@ 2022-08-02  2:47     ` Hyeonggon Yoo
  0 siblings, 0 replies; 5+ messages in thread
From: Hyeonggon Yoo @ 2022-08-02  2:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rongwei Wang, Christoph Lameter, Joonsoo Kim, David Rientjes,
	Pekka Enberg, linux-mm, Feng Tang

On Mon, Aug 01, 2022 at 03:51:49PM +0200, Vlastimil Babka wrote:
> On 7/31/22 10:44, Hyeonggon Yoo wrote:
> > On Wed, Jul 27, 2022 at 08:29:09PM +0200, Vlastimil Babka wrote:
> > Also all issues (as far as I know) related to validate attribute
> > as gone after this patch.
> 
> As you (And Rongwei) were able to trigger/reproduce the issues, does your
> testing also no longer reproduce them?

Yes. I was no longer able to reproduce them.
And it would be better if we could get Rongwei's Tested-by: too.

> > Silly question:
> > 	Do we want to apply on stable trees?
> 
> I'd prefer not to, it's too intrusive for stable.
> 
> > 	I doubt someone would use validate attribute when not debugging.
> 
> I doubt as well. Also it requires root, and even if somebody hits the issue,
> it's just spurious warnings, nothing fatal. So that doesn't warrant the
> intrusive stable backport IMHO.

Agreed.

> 
> >> [1] https://lore.kernel.org/all/69462916-2d1c-dd50-2e64-b31c2b61690e@suse.cz/
> >> 
> >>  mm/slub.c | 322 +++++++++++++++++++++++++++++++++++++++---------------
> >>  1 file changed, 231 insertions(+), 91 deletions(-)
> >> 
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index b1281b8654bd..01e5228809d7 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -1324,17 +1324,14 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
> >>  }
> > 
> > [...]
> > 
> >> +/*
> >> + * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
> >> + * slab from the n->partial list. Removes only a single object from the slab
> >> + * under slab_lock(), does the alloc_debug_processing() checks and leaves the
> >> + * slab on the list, or moves it to full list if it was the last object.
> >> + */
> >> +static void *alloc_single_from_partial(struct kmem_cache *s,
> >> +		struct kmem_cache_node *n, struct slab *slab)
> >> +{
> >> +	void *object;
> >> +	unsigned long flags;
> >> +
> >> +	lockdep_assert_held(&n->list_lock);
> >> +
> >> +	slab_lock(slab, &flags);
> >> +
> >> +	object = slab->freelist;
> >> +	slab->freelist = get_freepointer(s, object);
> >> +	slab->inuse++;
> >> +
> >> +	if (!alloc_debug_processing(s, slab, object)) {
> >> +		remove_partial(n, slab);
> >> +		slab_unlock(slab, &flags);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	if (slab->inuse == slab->objects) {
> >> +		remove_partial(n, slab);
> >> +		add_full(s, n, slab);
> >> +	}
> >> +
> >> +	slab_unlock(slab, &flags);
> > 
> > AFAIK add_full/remove_full/add_partial/remove_partial
> > can be called outside slab_lock but inside list_lock.
> 
> Right, I will adjust, thanks.
> 
> >> +	return object;
> >> +}
> >> +
> >> +/*
> >> + * Called only for kmem_cache_debug() caches to allocate from a freshly
> >> + * allocated slab. Allocates a single object instead of whole freelist
> >> + * and puts the slab to the partial (or full) list.
> >> + */
> >> +static void *alloc_single_from_new_slab(struct kmem_cache *s,
> >> +					struct slab *slab)
> >> +{
> >> +	int nid = slab_nid(slab);
> >> +	struct kmem_cache_node *n = get_node(s, nid);
> >> +	unsigned long flags, flags2;
> >> +	void *object;
> >> +
> >> +	spin_lock_irqsave(&n->list_lock, flags);
> >> +	slab_lock(slab, &flags2);
> >> +
> >> +	object = slab->freelist;
> >> +	slab->freelist = get_freepointer(s, object);
> >> +	/* Undo what allocate_slab() did */
> >> +	slab->frozen = 0;
> >> +	slab->inuse = 1;
> > 
> > Maybe do it in allocate_slab()?
> 
> Hmm yeah, I guess we could stop doing that pre-freezing and inuse = objects
> in allocate_slab(), and do it in __slab_alloc(), which thus won't add any
> overhead. Then we won't have to unfreeze in early_kmem_cache_node_alloc() as
> well.

Sounds good.

> 
> >> +	if (!alloc_debug_processing(s, slab, object)) {
> >> +		/*
> >> +		 * It's not really expected that this would fail on a
> >> +		 * freshly allocated slab, but a concurrent memory
> >> +		 * corruption in theory could cause that.
> >> +		 */
> >> +		slab_unlock(slab, &flags2);
> >> +		spin_unlock_irqrestore(&n->list_lock, flags);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	if (slab->inuse == slab->objects)
> >> +		add_full(s, n, slab);
> >> +	else
> >> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> >> +
> >> +	slab_unlock(slab, &flags2);
> >> +	inc_slabs_node(s, nid, slab->objects);
> >> +	spin_unlock_irqrestore(&n->list_lock, flags);
> >> +
> >> +	return object;
> >> +}
> > 
> > [...]
> > 
> >>  #endif /* CONFIG_SLUB_DEBUG */
> >>  
> >>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> >> @@ -3036,6 +3165,20 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>  		return NULL;
> >>  	}
> >>  
> >> +	stat(s, ALLOC_SLAB);
> >> +
> >> +	if (kmem_cache_debug(s)) {
> >> +		freelist = alloc_single_from_new_slab(s, slab);
> >> +
> >> +		if (unlikely(!freelist))
> >> +			goto new_objects;
> >> +
> >> +		if (s->flags & SLAB_STORE_USER)
> >> +			set_track(s, freelist, TRACK_ALLOC, addr);
> >> +
> >> +		return freelist;
> >> +	}
> >> +
> >>  	/*
> >>  	 * No other reference to the slab yet so we can
> >>  	 * muck around with it freely without cmpxchg
> >> @@ -3043,29 +3186,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>  	freelist = slab->freelist;
> >>  	slab->freelist = NULL;
> >>  
> >> -	stat(s, ALLOC_SLAB);
> >> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> >>  
> >>  check_new_slab:
> >>  
> >>  	if (kmem_cache_debug(s)) {
> >> -		if (!alloc_debug_processing(s, slab, freelist, addr)) {
> >> -			/* Slab failed checks. Next slab needed */
> >> -			goto new_slab;
> >> -		} else {
> >> -			/*
> >> -			 * For debug case, we don't load freelist so that all
> >> -			 * allocations go through alloc_debug_processing()
> >> -			 */
> >> -			goto return_single;
> >> -		}
> >> +		/*
> >> +		 * For debug caches here we had to go through
> >> +		 * alloc_single_from_partial() so just store the tracking info
> >> +		 * and return the object
> >> +		 */
> >> +		if (s->flags & SLAB_STORE_USER)
> >> +			set_track(s, freelist, TRACK_ALLOC, addr);
> >> +		return freelist;
> >>  	}
> >>  
> >> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
> >> +	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
> >>  		/*
> >>  		 * For !pfmemalloc_match() case we don't load freelist so that
> >>  		 * we don't make further mismatched allocations easier.
> >>  		 */
> >> -		goto return_single;
> >> +		deactivate_slab(s, slab, get_freepointer(s, freelist));
> >> +		return freelist;
> >> +	}
> > 
> > 
> > 
> >>  
> >>  retry_load_slab:
> >>  
> >> @@ -3089,11 +3232,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>  	c->slab = slab;
> >>  
> >>  	goto load_freelist;
> >> -
> >> -return_single:
> >> -
> >> -	deactivate_slab(s, slab, get_freepointer(s, freelist));
> >> -	return freelist;
> >>  }
> >>  
> >>  /*
> >> @@ -3341,9 +3479,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >>  	if (kfence_free(head))
> >>  		return;
> >>  
> >> -	if (kmem_cache_debug(s) &&
> >> -	    !free_debug_processing(s, slab, head, tail, cnt, addr))
> >> +	if (kmem_cache_debug(s)) {
> >> +		free_debug_processing(s, slab, head, tail, cnt, addr);
> >>  		return;
> >> +	}
> > 
> > Oh, now debugging caches does not share free path with non-debugging
> > caches.
> > 
> > Now free_debug_processing's return type can be void?
> 
> Right.

Thanks!

> 
> >>  
> >>  	do {
> >>  		if (unlikely(n)) {
> >> @@ -3958,6 +4097,7 @@ static void early_kmem_cache_node_alloc(int node)
> >>  	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> >>  
> >>  	BUG_ON(!slab);
> >> +	inc_slabs_node(kmem_cache_node, slab_nid(slab), slab->objects);
> >>  	if (slab_nid(slab) != node) {
> >>  		pr_err("SLUB: Unable to allocate memory from node %d\n", node);
> >>  		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
> >> @@ -5625,7 +5765,7 @@ static ssize_t validate_store(struct kmem_cache *s,
> >>  {
> >>  	int ret = -EINVAL;
> >>  
> >> -	if (buf[0] == '1') {
> >> +	if (buf[0] == '1' && kmem_cache_debug(s)) {
> >>  		ret = validate_slab_cache(s);
> >>  		if (ret >= 0)
> >>  			ret = length;
> > 
> > Yeah definitely this is what it should be,
> > instead of serializing inc_slabs_node()/dec_slabs_node()
> > for non-debugging caches.
> > 
> >> -- 
> >> 2.37.1
> >> 
> > 
> 

-- 
Thanks,
Hyeonggon


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-02  2:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-27 18:29 [RFC PATCH] mm, slub: restrict sysfs validation to debug caches and make it safe Vlastimil Babka
2022-07-31  8:44 ` Hyeonggon Yoo
2022-07-31 10:07   ` Rongwei Wang
2022-08-01 13:51   ` Vlastimil Babka
2022-08-02  2:47     ` Hyeonggon Yoo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.