All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs
@ 2023-10-17 15:44 chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.

This implementation has some drawbacks:

 - Alloc path: twice cmpxchg_double.
   It has to get some partial slabs from node when the allocator has used
   up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
   with node list_lock held, put those frozen slabs on its CPU partial
   list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
   slab is picked to use.

 - Alloc path: amplified contention on node list_lock.
   Since we have to synchronize the "frozen" bit changes under the node
   list_lock, the contention of slab (struct page) can be transferred
   to the node list_lock. On machine with many CPUs in one node, the
   contention of list_lock will be amplified by all CPUs' alloc path.

   The current code has to workaround this problem by avoiding using
   cmpxchg_double try-loop, which will just break and return when
   contention of page encountered and the first cmpxchg_double failed.
   But this workaround has its own problem.

 - Free path: redundant unfreeze.
   __slab_free() will freeze and cache some slabs on its partial list,
   and flush them to the node partial list when exceed, which has to
   unfreeze those slabs again under the node list_lock. Actually we
   don't need to freeze slab on CPU partial list, in which case we
   can save the unfreeze cmpxchg_double operations in flush path.

2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.

These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
use a bit in slab->flags to indicate whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.

The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.

3. Patches
==========
Patch-1 introduce the new slab->flags to indicate whether the slab is
on node partial list, which is protected by node list_lock.

Patch-2 change the free path to check if slab is on node partial list,
only in which case we can manipulate its list. Then we can keep unfrozen
partial slabs out of node partial list, since the free path won't
concurrently manipulate with it.

Patch-3 optimize the deactivate path, we can directly unfreeze the slab,
(since node list_lock is not needed to synchronize "frozen" bit anymore)
then grab node list_lock if it's needed to put on the node partial list.

Patch-4 change to don't freeze slab when moving out from node partial
list or put on the CPU partial list, and don't need to unfreeze these
slabs when put back to node partial list from CPU partial list.

Patch-5 change the alloc path to freeze the CPU partial slab when picked
to use.

4. Testing
==========
We just did some simple testing on a server with 128 CPUs (2 nodes) to
compare performance for now.

 - perf bench sched messaging -g 5 -t -l 100000
   baseline	RFC
   7.042s	6.966s
   7.022s	7.045s
   7.054s	6.985s

 - stress-ng --rawpkt 128 --rawpkt-ops 100000000
   baseline	RFC
   2.42s	2.15s
   2.45s	2.16s
   2.44s	2.17s

It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.

Thanks for any comment and code review!

Chengming Zhou (5):
  slub: Introduce on_partial()
  slub: Don't manipulate slab list when used by cpu
  slub: Optimize deactivate_slab()
  slub: Don't freeze slabs for cpu partial
  slub: Introduce get_cpu_partial()

 mm/slab.h |   2 +-
 mm/slub.c | 257 +++++++++++++++++++++++++++++++-----------------------
 2 files changed, 150 insertions(+), 109 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/5] slub: Introduce on_partial()
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
@ 2023-10-17 15:44 ` chengming.zhou
  2023-10-17 15:54   ` Matthew Wilcox
  2023-10-27  5:26   ` kernel test robot
  2023-10-17 15:44 ` [RFC PATCH 2/5] slub: Don't manipulate slab list when used by cpu chengming.zhou
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

We change slab->__unused to slab->flags to use it as SLUB_FLAGS, which
now only include SF_NODE_PARTIAL flag. It indicates whether or not the
slab is on node partial list.

The following patches will change to don't freeze slab when moving it
from node partial list to cpu partial list. So we can't rely on frozen
bit to see if we should manipulate the slab->slab_list.

Instead we will rely on this SF_NODE_PARTIAL flag, which is protected
by node list_lock.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/slab.h |  2 +-
 mm/slub.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 8cd3294fedf5..11e9c9a0f648 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -89,7 +89,7 @@ struct slab {
 		};
 		struct rcu_head rcu_head;
 	};
-	unsigned int __unused;
+	unsigned int flags;
 
 #else
 #error "Unexpected slab allocator configured"
diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..e5356ad14951 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1993,6 +1993,12 @@ static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab)
 }
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
+enum SLUB_FLAGS {
+	SF_INIT_VALUE = 0,
+	SF_EXIT_VALUE = -1,
+	SF_NODE_PARTIAL = 1 << 0,
+};
+
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct slab *slab;
@@ -2031,6 +2037,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	slab->objects = oo_objects(oo);
 	slab->inuse = 0;
 	slab->frozen = 0;
+	slab->flags = SF_INIT_VALUE;
 
 	account_slab(slab, oo_order(oo), s, flags);
 
@@ -2077,6 +2084,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	int order = folio_order(folio);
 	int pages = 1 << order;
 
+	slab->flags = SF_EXIT_VALUE;
 	__slab_clear_pfmemalloc(slab);
 	folio->mapping = NULL;
 	/* Make the mapping reset visible before clearing the flag */
@@ -2119,9 +2127,28 @@ static void discard_slab(struct kmem_cache *s, struct slab *slab)
 /*
  * Management of partially allocated slabs.
  */
+static void ___add_partial(struct kmem_cache_node *n, struct slab *slab)
+{
+	lockdep_assert_held(&n->list_lock);
+	slab->flags |= SF_NODE_PARTIAL;
+}
+
+static void ___remove_partial(struct kmem_cache_node *n, struct slab *slab)
+{
+	lockdep_assert_held(&n->list_lock);
+	slab->flags &= ~SF_NODE_PARTIAL;
+}
+
+static inline bool on_partial(struct kmem_cache_node *n, struct slab *slab)
+{
+	lockdep_assert_held(&n->list_lock);
+	return slab->flags & SF_NODE_PARTIAL;
+}
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
 {
+	___add_partial(n, slab);
 	n->nr_partial++;
 	if (tail == DEACTIVATE_TO_TAIL)
 		list_add_tail(&slab->slab_list, &n->partial);
@@ -2142,6 +2169,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
 	lockdep_assert_held(&n->list_lock);
 	list_del(&slab->slab_list);
 	n->nr_partial--;
+	___remove_partial(n, slab);
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/5] slub: Don't manipulate slab list when used by cpu
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
@ 2023-10-17 15:44 ` chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 3/5] slub: Optimize deactivate_slab() chengming.zhou
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

We will change to don't freeze slab when moving it out of node partial
list in the following patch, so we can't rely on the frozen bit to
indicate if we should manipulate the slab list or not.

This patch use the introduced on_partial() helper, which check the
slab->flags that protected by node list_lock, so we can know if the
slab is on the node partial list.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/slub.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index e5356ad14951..27eac93baa13 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3636,6 +3636,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long flags;
+	bool on_node_partial;
 
 	stat(s, FREE_SLOWPATH);
 
@@ -3683,6 +3684,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 				 */
 				spin_lock_irqsave(&n->list_lock, flags);
 
+				on_node_partial = on_partial(n, slab);
 			}
 		}
 
@@ -3711,6 +3713,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		return;
 	}
 
+	/*
+	 * This slab was not on node partial list and not full either,
+	 * in which case we shouldn't manipulate its list, early return.
+	 */
+	if (!on_node_partial && prior) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return;
+	}
+
 	if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
 		goto slab_empty;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/5] slub: Optimize deactivate_slab()
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 2/5] slub: Don't manipulate slab list when used by cpu chengming.zhou
@ 2023-10-17 15:44 ` chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 4/5] slub: Don't freeze slabs for cpu partial chengming.zhou
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Since the introduce of unfrozen slabs on cpu partial list, we don't
need to synchronize the slab frozen state under the node list_lock.

The caller of deactivate_slab() and the caller of __slab_free() won't
manipulate the slab list concurrently.

So we can get node list_lock in the stage three if we need to manipulate
the slab list in this path.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/slub.c | 70 ++++++++++++++++++++-----------------------------------
 1 file changed, 25 insertions(+), 45 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 27eac93baa13..5a9711b35c74 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2520,10 +2520,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 			    void *freelist)
 {
-	enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
 	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
 	int free_delta = 0;
-	enum slab_modes mode = M_NONE;
 	void *nextfree, *freelist_iter, *freelist_tail;
 	int tail = DEACTIVATE_TO_HEAD;
 	unsigned long flags = 0;
@@ -2570,58 +2568,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	 * unfrozen and number of objects in the slab may have changed.
 	 * Then release lock and retry cmpxchg again.
 	 */
-redo:
-
-	old.freelist = READ_ONCE(slab->freelist);
-	old.counters = READ_ONCE(slab->counters);
-	VM_BUG_ON(!old.frozen);
-
-	/* Determine target state of the slab */
-	new.counters = old.counters;
-	if (freelist_tail) {
-		new.inuse -= free_delta;
-		set_freepointer(s, freelist_tail, old.freelist);
-		new.freelist = freelist;
-	} else
-		new.freelist = old.freelist;
+	do {
+		old.freelist = READ_ONCE(slab->freelist);
+		old.counters = READ_ONCE(slab->counters);
+		VM_BUG_ON(!old.frozen);
+
+		/* Determine target state of the slab */
+		new.counters = old.counters;
+		new.frozen = 0;
+		if (freelist_tail) {
+			new.inuse -= free_delta;
+			set_freepointer(s, freelist_tail, old.freelist);
+			new.freelist = freelist;
+		} else
+			new.freelist = old.freelist;
 
-	new.frozen = 0;
+	} while (!slab_update_freelist(s, slab,
+		old.freelist, old.counters,
+		new.freelist, new.counters,
+		"unfreezing slab"));
 
+	/*
+	 * Stage three: Manipulate the slab list based on the updated state.
+	 */
 	if (!new.inuse && n->nr_partial >= s->min_partial) {
-		mode = M_FREE;
+		stat(s, DEACTIVATE_EMPTY);
+		discard_slab(s, slab);
+		stat(s, FREE_SLAB);
 	} else if (new.freelist) {
-		mode = M_PARTIAL;
-		/*
-		 * Taking the spinlock removes the possibility that
-		 * acquire_slab() will see a slab that is frozen
-		 */
 		spin_lock_irqsave(&n->list_lock, flags);
-	} else {
-		mode = M_FULL_NOLIST;
-	}
-
-
-	if (!slab_update_freelist(s, slab,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab")) {
-		if (mode == M_PARTIAL)
-			spin_unlock_irqrestore(&n->list_lock, flags);
-		goto redo;
-	}
-
-
-	if (mode == M_PARTIAL) {
 		add_partial(n, slab, tail);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 		stat(s, tail);
-	} else if (mode == M_FREE) {
-		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, slab);
-		stat(s, FREE_SLAB);
-	} else if (mode == M_FULL_NOLIST) {
+	} else
 		stat(s, DEACTIVATE_FULL);
-	}
 }
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/5] slub: Don't freeze slabs for cpu partial
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (2 preceding siblings ...)
  2023-10-17 15:44 ` [RFC PATCH 3/5] slub: Optimize deactivate_slab() chengming.zhou
@ 2023-10-17 15:44 ` chengming.zhou
  2023-10-17 15:44 ` [RFC PATCH 5/5] slub: Introduce get_cpu_partial() chengming.zhou
  2023-10-18  6:34 ` [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Hyeonggon Yoo
  5 siblings, 0 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Now we will freeze slabs when moving them out of node partial list to
cpu partial list, this method needs two cmpxchg_double operations:

1. freeze slab (acquire_slab()) under the node list_lock
2. get_freelist() when pick used in ___slab_alloc()

Actually we don't need to freeze when moving slabs out of node partial
list, we can delay freeze to get slab freelist in ___slab_alloc(), so
we can save one cmpxchg_double().

And there are other good points:

1. The moving of slabs between node partial list and cpu partial list
   becomes simpler, since we don't need to freeze or unfreeze at all.

2. The node list_lock contention would be less, since we only need to
   freeze one slab under the node list_lock. (In fact, we can first
   move slabs out of node partial list, don't need to freeze any slab
   at all, so the contention on slab won't transfer to the node list_lock
   contention.)

We can achieve this because there is no concurrent path would manipulate
the partial slab list except the __slab_free() path, which is serialized
using the new introduced slab->flags.

Note this patch just change the part of moving the partial slabs for
easy code review, we will fix other parts in the following patches.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/slub.c | 61 ++++++++++++++++---------------------------------------
 1 file changed, 17 insertions(+), 44 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 5a9711b35c74..044235bd8a45 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2329,19 +2329,21 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			continue;
 		}
 
-		t = acquire_slab(s, n, slab, object == NULL);
-		if (!t)
-			break;
-
 		if (!object) {
-			*pc->slab = slab;
-			stat(s, ALLOC_FROM_PARTIAL);
-			object = t;
-		} else {
-			put_cpu_partial(s, slab, 0);
-			stat(s, CPU_PARTIAL_NODE);
-			partial_slabs++;
+			t = acquire_slab(s, n, slab, object == NULL);
+			if (t) {
+				*pc->slab = slab;
+				stat(s, ALLOC_FROM_PARTIAL);
+				object = t;
+				continue;
+			}
 		}
+
+		remove_partial(n, slab);
+		put_cpu_partial(s, slab, 0);
+		stat(s, CPU_PARTIAL_NODE);
+		partial_slabs++;
+
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 		if (!kmem_cache_has_cpu_partial(s)
 			|| partial_slabs > s->cpu_partial_slabs / 2)
@@ -2612,9 +2614,6 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 	unsigned long flags = 0;
 
 	while (partial_slab) {
-		struct slab new;
-		struct slab old;
-
 		slab = partial_slab;
 		partial_slab = slab->next;
 
@@ -2627,23 +2626,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 			spin_lock_irqsave(&n->list_lock, flags);
 		}
 
-		do {
-
-			old.freelist = slab->freelist;
-			old.counters = slab->counters;
-			VM_BUG_ON(!old.frozen);
-
-			new.counters = old.counters;
-			new.freelist = old.freelist;
-
-			new.frozen = 0;
-
-		} while (!__slab_update_freelist(s, slab,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab"));
-
-		if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
+		if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
 			slab->next = slab_to_discard;
 			slab_to_discard = slab;
 		} else {
@@ -3640,18 +3623,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		was_frozen = new.frozen;
 		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
-
-			if (kmem_cache_has_cpu_partial(s) && !prior) {
-
-				/*
-				 * Slab was on no list before and will be
-				 * partially empty
-				 * We can defer the list move and instead
-				 * freeze it.
-				 */
-				new.frozen = 1;
-
-			} else { /* Needs to be taken off a list */
+			/* Needs to be taken off a list */
+			if (!kmem_cache_has_cpu_partial(s) || prior) {
 
 				n = get_node(s, slab_nid(slab));
 				/*
@@ -3681,7 +3654,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 			 * activity can be necessary.
 			 */
 			stat(s, FREE_FROZEN);
-		} else if (new.frozen) {
+		} else if (kmem_cache_has_cpu_partial(s) && !prior) {
 			/*
 			 * If we just froze the slab then put it onto the
 			 * per cpu partial list.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 5/5] slub: Introduce get_cpu_partial()
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (3 preceding siblings ...)
  2023-10-17 15:44 ` [RFC PATCH 4/5] slub: Don't freeze slabs for cpu partial chengming.zhou
@ 2023-10-17 15:44 ` chengming.zhou
  2023-10-18  6:34 ` [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Hyeonggon Yoo
  5 siblings, 0 replies; 12+ messages in thread
From: chengming.zhou @ 2023-10-17 15:44 UTC (permalink / raw)
  To: cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, vbabka, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, chengming.zhou,
	Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Since the slabs on cpu partial list are not frozen anymore, we introduce
get_cpu_partial() to get a frozen slab with its freelist from cpu partial
list. It's now much like getting a frozen slab with its freelist from
node partial list.

Another change is about get_partial(), which can return no frozen slab
when all slabs are failed when acquire_slab(), but get some unfreeze slabs
in its cpu partial list, so we need to check this rare case to avoid
allocating a new slab.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/slub.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 044235bd8a45..d58eaf8447fd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3064,6 +3064,68 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
 	return freelist;
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+
+static void *get_cpu_partial(struct kmem_cache *s, struct kmem_cache_cpu *c,
+			     struct slab **slabptr, int node, gfp_t gfpflags)
+{
+	unsigned long flags;
+	struct slab *slab;
+	struct slab new;
+	unsigned long counters;
+	void *freelist;
+
+	while (slub_percpu_partial(c)) {
+		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		if (unlikely(!slub_percpu_partial(c))) {
+			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			/* we were preempted and partial list got empty */
+			return NULL;
+		}
+
+		slab = slub_percpu_partial(c);
+		slub_set_percpu_partial(c, slab);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		stat(s, CPU_PARTIAL_ALLOC);
+
+		if (unlikely(!node_match(slab, node) ||
+			     !pfmemalloc_match(slab, gfpflags))) {
+			slab->next = NULL;
+			__unfreeze_partials(s, slab);
+			continue;
+		}
+
+		do {
+			freelist = slab->freelist;
+			counters = slab->counters;
+
+			new.counters = counters;
+			VM_BUG_ON(new.frozen);
+
+			new.inuse = slab->objects;
+			new.frozen = 1;
+		} while (!__slab_update_freelist(s, slab,
+			freelist, counters,
+			NULL, new.counters,
+			"get_cpu_partial"));
+
+		*slabptr = slab;
+		return freelist;
+	}
+
+	return NULL;
+}
+
+#else /* CONFIG_SLUB_CPU_PARTIAL */
+
+static void *get_cpu_partial(struct kmem_cache *s, struct kmem_cache_cpu *c,
+			     struct slab **slabptr, int node, gfp_t gfpflags)
+{
+	return NULL;
+}
+
+#endif
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -3106,7 +3168,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			node = NUMA_NO_NODE;
 		goto new_slab;
 	}
-redo:
 
 	if (unlikely(!node_match(slab, node))) {
 		/*
@@ -3182,24 +3243,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 new_slab:
 
-	if (slub_percpu_partial(c)) {
-		local_lock_irqsave(&s->cpu_slab->lock, flags);
-		if (unlikely(c->slab)) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-			goto reread_slab;
-		}
-		if (unlikely(!slub_percpu_partial(c))) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-			/* we were preempted and partial list got empty */
-			goto new_objects;
-		}
-
-		slab = c->slab = slub_percpu_partial(c);
-		slub_set_percpu_partial(c, slab);
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-		stat(s, CPU_PARTIAL_ALLOC);
-		goto redo;
-	}
+	freelist = get_cpu_partial(s, c, &slab, node, gfpflags);
+	if (freelist)
+		goto retry_load_slab;
 
 new_objects:
 
@@ -3210,6 +3256,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (freelist)
 		goto check_new_slab;
 
+	if (slub_percpu_partial(c))
+		goto new_slab;
+
 	slub_put_cpu_ptr(s->cpu_slab);
 	slab = new_slab(s, gfpflags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/5] slub: Introduce on_partial()
  2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
@ 2023-10-17 15:54   ` Matthew Wilcox
  2023-10-18  7:37     ` Chengming Zhou
  2023-10-27  5:26   ` kernel test robot
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2023-10-17 15:54 UTC (permalink / raw)
  To: chengming.zhou
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel,
	Chengming Zhou

On Tue, Oct 17, 2023 at 03:44:35PM +0000, chengming.zhou@linux.dev wrote:
> We change slab->__unused to slab->flags to use it as SLUB_FLAGS, which
> now only include SF_NODE_PARTIAL flag. It indicates whether or not the
> slab is on node partial list.

This is an unnecessarily complex solution.  As with the pfmemalloc bit,
we can reuse the folio flags for a few flags.  I would recommend the
PG_workingset bit for this purpose.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs
  2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (4 preceding siblings ...)
  2023-10-17 15:44 ` [RFC PATCH 5/5] slub: Introduce get_cpu_partial() chengming.zhou
@ 2023-10-18  6:34 ` Hyeonggon Yoo
  2023-10-18  7:44   ` Chengming Zhou
  5 siblings, 1 reply; 12+ messages in thread
From: Hyeonggon Yoo @ 2023-10-18  6:34 UTC (permalink / raw)
  To: chengming.zhou
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Wed, Oct 18, 2023 at 12:45 AM <chengming.zhou@linux.dev> wrote:
> 4. Testing
> ==========
> We just did some simple testing on a server with 128 CPUs (2 nodes) to
> compare performance for now.
>
>  - perf bench sched messaging -g 5 -t -l 100000
>    baseline     RFC
>    7.042s       6.966s
>    7.022s       7.045s
>    7.054s       6.985s
>
>  - stress-ng --rawpkt 128 --rawpkt-ops 100000000
>    baseline     RFC
>    2.42s        2.15s
>    2.45s        2.16s
>    2.44s        2.17s
>
> It shows above there is about 10% improvement on stress-ng rawpkt
> testcase, although no much improvement on perf sched bench testcase.
>
> Thanks for any comment and code review!

Hi Chengming, this is the kerneltesting.org test report for your patch series.

I applied this series on my slab-experimental tree [1] for testing,
and I observed several kernel panics [2] [3] [4] on kernels without
CONFIG_SLUB_CPU_PARTIAL.

To verify that this series caused kernel panics, I tested before and after
applying it on Vlastimil's slab/for-next and yeah, this series was the cause.

System is deadlocked on memory and the OOM-killer says there is a
huge amount of slab memory. So maybe there is a memory leak or it makes
slab memory grow unboundedly?

[1] https://git.kerneltesting.org/slab-experimental/
[2] https://lava.kerneltesting.org/scheduler/job/127#bottom
[3] https://lava.kerneltesting.org/scheduler/job/131#bottom
[4] https://lava.kerneltesting.org/scheduler/job/134#bottom

>
> Chengming Zhou (5):
>   slub: Introduce on_partial()
>   slub: Don't manipulate slab list when used by cpu
>   slub: Optimize deactivate_slab()
>   slub: Don't freeze slabs for cpu partial
>   slub: Introduce get_cpu_partial()
>
>  mm/slab.h |   2 +-
>  mm/slub.c | 257 +++++++++++++++++++++++++++++++-----------------------
>  2 files changed, 150 insertions(+), 109 deletions(-)
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/5] slub: Introduce on_partial()
  2023-10-17 15:54   ` Matthew Wilcox
@ 2023-10-18  7:37     ` Chengming Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2023-10-18  7:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel,
	Chengming Zhou

On 2023/10/17 23:54, Matthew Wilcox wrote:
> On Tue, Oct 17, 2023 at 03:44:35PM +0000, chengming.zhou@linux.dev wrote:
>> We change slab->__unused to slab->flags to use it as SLUB_FLAGS, which
>> now only include SF_NODE_PARTIAL flag. It indicates whether or not the
>> slab is on node partial list.
> 
> This is an unnecessarily complex solution.  As with the pfmemalloc bit,
> we can reuse the folio flags for a few flags.  I would recommend the
> PG_workingset bit for this purpose.
> 

Yeah, this is better indeed. Thanks for your suggestion!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs
  2023-10-18  6:34 ` [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Hyeonggon Yoo
@ 2023-10-18  7:44   ` Chengming Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2023-10-18  7:44 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On 2023/10/18 14:34, Hyeonggon Yoo wrote:
> On Wed, Oct 18, 2023 at 12:45 AM <chengming.zhou@linux.dev> wrote:
>> 4. Testing
>> ==========
>> We just did some simple testing on a server with 128 CPUs (2 nodes) to
>> compare performance for now.
>>
>>  - perf bench sched messaging -g 5 -t -l 100000
>>    baseline     RFC
>>    7.042s       6.966s
>>    7.022s       7.045s
>>    7.054s       6.985s
>>
>>  - stress-ng --rawpkt 128 --rawpkt-ops 100000000
>>    baseline     RFC
>>    2.42s        2.15s
>>    2.45s        2.16s
>>    2.44s        2.17s
>>
>> It shows above there is about 10% improvement on stress-ng rawpkt
>> testcase, although no much improvement on perf sched bench testcase.
>>
>> Thanks for any comment and code review!
> 
> Hi Chengming, this is the kerneltesting.org test report for your patch series.
> 
> I applied this series on my slab-experimental tree [1] for testing,
> and I observed several kernel panics [2] [3] [4] on kernels without
> CONFIG_SLUB_CPU_PARTIAL.
> 
> To verify that this series caused kernel panics, I tested before and after
> applying it on Vlastimil's slab/for-next and yeah, this series was the cause.
> 
> System is deadlocked on memory and the OOM-killer says there is a
> huge amount of slab memory. So maybe there is a memory leak or it makes
> slab memory grow unboundedly?

Thanks for the testing!

I can reproduce the OOM locally without CONFIG_SLUB_CPU_PARTIAL.

I made a quick fix below (will need to get another better fix). The root
cause is in patch-4, which wrongly put some partial slabs onto the CPU
partial list even without CONFIG_SLUB_CPU_PARTIAL. So these partial slabs
are leaked.

diff --git a/mm/slub.c b/mm/slub.c
index d58eaf8447fd..b7ba6c008122 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2339,12 +2339,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
                        }
                }

+#ifdef CONFIG_SLUB_CPU_PARTIAL
                remove_partial(n, slab);
                put_cpu_partial(s, slab, 0);
                stat(s, CPU_PARTIAL_NODE);
                partial_slabs++;

-#ifdef CONFIG_SLUB_CPU_PARTIAL
                if (!kmem_cache_has_cpu_partial(s)
                        || partial_slabs > s->cpu_partial_slabs / 2)
                        break;


> 
> [1] https://git.kerneltesting.org/slab-experimental/
> [2] https://lava.kerneltesting.org/scheduler/job/127#bottom
> [3] https://lava.kerneltesting.org/scheduler/job/131#bottom
> [4] https://lava.kerneltesting.org/scheduler/job/134#bottom
> 
>>
>> Chengming Zhou (5):
>>   slub: Introduce on_partial()
>>   slub: Don't manipulate slab list when used by cpu
>>   slub: Optimize deactivate_slab()
>>   slub: Don't freeze slabs for cpu partial
>>   slub: Introduce get_cpu_partial()
>>
>>  mm/slab.h |   2 +-
>>  mm/slub.c | 257 +++++++++++++++++++++++++++++++-----------------------
>>  2 files changed, 150 insertions(+), 109 deletions(-)
>>
>> --
>> 2.40.1
>>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/5] slub: Introduce on_partial()
  2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
  2023-10-17 15:54   ` Matthew Wilcox
@ 2023-10-27  5:26   ` kernel test robot
  2023-10-27  9:43     ` Chengming Zhou
  1 sibling, 1 reply; 12+ messages in thread
From: kernel test robot @ 2023-10-27  5:26 UTC (permalink / raw)
  To: chengming.zhou
  Cc: oe-lkp, lkp, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
	akpm, vbabka, roman.gushchin, 42.hyeyoo, linux-kernel,
	chengming.zhou, Chengming Zhou, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_mm/slub.c:#___add_partial" on:

commit: 0805463ab860a2dde667bd4423a30efbf650b34b ("[RFC PATCH 1/5] slub: Introduce on_partial()")
url: https://github.com/intel-lab-lkp/linux/commits/chengming-zhou-linux-dev/slub-Introduce-on_partial/20231017-234739
base: git://git.kernel.org/cgit/linux/kernel/git/vbabka/slab.git for-next
patch link: https://lore.kernel.org/all/20231017154439.3036608-2-chengming.zhou@linux.dev/
patch subject: [RFC PATCH 1/5] slub: Introduce on_partial()

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+--------------------------------------------+------------+------------+
|                                            | e050a704f3 | 0805463ab8 |
+--------------------------------------------+------------+------------+
| WARNING:at_mm/slub.c:#___add_partial       | 0          | 16         |
| RIP:___add_partial                         | 0          | 16         |
+--------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310271308.9076b4c0-oliver.sang@intel.com


[    2.344426][    T0] ------------[ cut here ]------------
[ 2.345095][ T0] WARNING: CPU: 0 PID: 0 at mm/slub.c:2132 ___add_partial (mm/slub.c:2132) 
[    2.346072][    T0] Modules linked in:
[    2.346555][    T0] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc5-00008-g0805463ab860 #1 e88a4d31ac7553ddd9cc4ecfa6b6cbc9ab8c98ab
[    2.348039][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 2.349271][ T0] RIP: 0010:___add_partial (mm/slub.c:2132) 
[ 2.349920][ T0] Code: 05 52 3f fb 05 53 48 89 f3 85 c0 75 0a 83 4b 30 01 5b e9 28 3c 06 03 48 83 c7 18 be ff ff ff ff e8 6a ec 02 03 85 c0 75 e4 90 <0f> 0b 90 83 4b 30 01 5b e9 08 3c 06 03 0f 1f 84 00 00 00 00 00 f6
All code
========
   0:	05 52 3f fb 05       	add    $0x5fb3f52,%eax
   5:	53                   	push   %rbx
   6:	48 89 f3             	mov    %rsi,%rbx
   9:	85 c0                	test   %eax,%eax
   b:	75 0a                	jne    0x17
   d:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
  11:	5b                   	pop    %rbx
  12:	e9 28 3c 06 03       	jmp    0x3063c3f
  17:	48 83 c7 18          	add    $0x18,%rdi
  1b:	be ff ff ff ff       	mov    $0xffffffff,%esi
  20:	e8 6a ec 02 03       	call   0x302ec8f
  25:	85 c0                	test   %eax,%eax
  27:	75 e4                	jne    0xd
  29:	90                   	nop
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	90                   	nop
  2d:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
  31:	5b                   	pop    %rbx
  32:	e9 08 3c 06 03       	jmp    0x3063c3f
  37:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  3e:	00 
  3f:	f6                   	.byte 0xf6

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	90                   	nop
   3:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
   7:	5b                   	pop    %rbx
   8:	e9 08 3c 06 03       	jmp    0x3063c15
   d:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  14:	00 
  15:	f6                   	.byte 0xf6
[    2.352308][    T0] RSP: 0000:ffffffff86407dd8 EFLAGS: 00010046
[    2.353078][    T0] RAX: 0000000000000000 RBX: ffffea0004001000 RCX: 0000000000000001
[    2.354058][    T0] RDX: 0000000000000000 RSI: ffffffff84e8e940 RDI: ffffffff855b1ca0
[    2.355041][    T0] RBP: ffff888100040000 R08: 0000000000000002 R09: 0000000000000000
[    2.355978][    T0] R10: ffffffff86f35083 R11: ffffffff819fd2f1 R12: 0000000000000000
[    2.356822][    T0] R13: ffff888100040048 R14: 0000000000000015 R15: ffffffff886073e0
[    2.357702][    T0] FS:  0000000000000000(0000) GS:ffff8883aec00000(0000) knlGS:0000000000000000
[    2.358674][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.359469][    T0] CR2: ffff88843ffff000 CR3: 00000000064dc000 CR4: 00000000000000b0
[    2.360402][    T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.361328][    T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.362305][    T0] Call Trace:
[    2.362722][    T0]  <TASK>
[ 2.363087][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:479) 
[ 2.365499][ T0] ? __warn (kernel/panic.c:673) 
[ 2.366034][ T0] ? ___add_partial (mm/slub.c:2132) 
[ 2.366627][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 2.367200][ T0] ? handle_bug (arch/x86/kernel/traps.c:237) 
[ 2.367743][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 2.368309][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 2.368930][ T0] ? kasan_set_track (mm/kasan/common.c:52) 
[ 2.369529][ T0] ? ___add_partial (mm/slub.c:2132) 
[ 2.370121][ T0] ? ___add_partial (mm/slub.c:2132 (discriminator 1)) 
[ 2.370706][ T0] early_kmem_cache_node_alloc (include/linux/list.h:169 mm/slub.c:2156 mm/slub.c:4308) 
[ 2.371471][ T0] kmem_cache_open (mm/slub.c:4340 mm/slub.c:4578) 
[ 2.372060][ T0] __kmem_cache_create (mm/slub.c:5140) 
[ 2.372688][ T0] create_boot_cache (mm/slab_common.c:654) 
[ 2.373317][ T0] kmem_cache_init (mm/slub.c:5075) 
[ 2.373936][ T0] mm_core_init (mm/mm_init.c:2786) 
[ 2.374519][ T0] start_kernel (init/main.c:929) 
[ 2.375103][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:544) 
[ 2.375763][ T0] x86_64_start_kernel (arch/x86/kernel/head64.c:486 (discriminator 17)) 
[ 2.376353][ T0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) 
[    2.377096][    T0]  </TASK>
[    2.377447][    T0] irq event stamp: 0
[ 2.377916][ T0] hardirqs last enabled at (0): 0x0 
[ 2.378794][ T0] hardirqs last disabled at (0): 0x0 
[ 2.379684][ T0] softirqs last enabled at (0): 0x0 
[ 2.380551][ T0] softirqs last disabled at (0): 0x0 
[    2.381441][    T0] ---[ end trace 0000000000000000 ]---
[    2.384117][    T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231027/202310271308.9076b4c0-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/5] slub: Introduce on_partial()
  2023-10-27  5:26   ` kernel test robot
@ 2023-10-27  9:43     ` Chengming Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2023-10-27  9:43 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
	akpm, vbabka, roman.gushchin, 42.hyeyoo, linux-kernel,
	Chengming Zhou

On 2023/10/27 13:26, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "WARNING:at_mm/slub.c:#___add_partial" on:
> 
> commit: 0805463ab860a2dde667bd4423a30efbf650b34b ("[RFC PATCH 1/5] slub: Introduce on_partial()")
> url: https://github.com/intel-lab-lkp/linux/commits/chengming-zhou-linux-dev/slub-Introduce-on_partial/20231017-234739
> base: git://git.kernel.org/cgit/linux/kernel/git/vbabka/slab.git for-next
> patch link: https://lore.kernel.org/all/20231017154439.3036608-2-chengming.zhou@linux.dev/
> patch subject: [RFC PATCH 1/5] slub: Introduce on_partial()
> 
> in testcase: boot
> 
> compiler: gcc-12
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> +--------------------------------------------+------------+------------+
> |                                            | e050a704f3 | 0805463ab8 |
> +--------------------------------------------+------------+------------+
> | WARNING:at_mm/slub.c:#___add_partial       | 0          | 16         |
> | RIP:___add_partial                         | 0          | 16         |
> +--------------------------------------------+------------+------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202310271308.9076b4c0-oliver.sang@intel.com
> 
> 
> [    2.344426][    T0] ------------[ cut here ]------------
> [ 2.345095][ T0] WARNING: CPU: 0 PID: 0 at mm/slub.c:2132 ___add_partial (mm/slub.c:2132) 

The latest version "RFC v3" should have not this problem, since it changes to
use page flag "workingset" bit, instead of the mapcount, which has to be
initialized from -1 to 0 in allocate_slab().

Here, the problem is that the boot cache is not from allocate_slab().

RFC v3: https://lore.kernel.org/all/20231024093345.3676493-1-chengming.zhou@linux.dev/

Thanks!

> [    2.346072][    T0] Modules linked in:
> [    2.346555][    T0] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc5-00008-g0805463ab860 #1 e88a4d31ac7553ddd9cc4ecfa6b6cbc9ab8c98ab
> [    2.348039][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 2.349271][ T0] RIP: 0010:___add_partial (mm/slub.c:2132) 
> [ 2.349920][ T0] Code: 05 52 3f fb 05 53 48 89 f3 85 c0 75 0a 83 4b 30 01 5b e9 28 3c 06 03 48 83 c7 18 be ff ff ff ff e8 6a ec 02 03 85 c0 75 e4 90 <0f> 0b 90 83 4b 30 01 5b e9 08 3c 06 03 0f 1f 84 00 00 00 00 00 f6
> All code
> ========
>    0:	05 52 3f fb 05       	add    $0x5fb3f52,%eax
>    5:	53                   	push   %rbx
>    6:	48 89 f3             	mov    %rsi,%rbx
>    9:	85 c0                	test   %eax,%eax
>    b:	75 0a                	jne    0x17
>    d:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
>   11:	5b                   	pop    %rbx
>   12:	e9 28 3c 06 03       	jmp    0x3063c3f
>   17:	48 83 c7 18          	add    $0x18,%rdi
>   1b:	be ff ff ff ff       	mov    $0xffffffff,%esi
>   20:	e8 6a ec 02 03       	call   0x302ec8f
>   25:	85 c0                	test   %eax,%eax
>   27:	75 e4                	jne    0xd
>   29:	90                   	nop
>   2a:*	0f 0b                	ud2		<-- trapping instruction
>   2c:	90                   	nop
>   2d:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
>   31:	5b                   	pop    %rbx
>   32:	e9 08 3c 06 03       	jmp    0x3063c3f
>   37:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
>   3e:	00 
>   3f:	f6                   	.byte 0xf6
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2
>    2:	90                   	nop
>    3:	83 4b 30 01          	orl    $0x1,0x30(%rbx)
>    7:	5b                   	pop    %rbx
>    8:	e9 08 3c 06 03       	jmp    0x3063c15
>    d:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
>   14:	00 
>   15:	f6                   	.byte 0xf6
> [    2.352308][    T0] RSP: 0000:ffffffff86407dd8 EFLAGS: 00010046
> [    2.353078][    T0] RAX: 0000000000000000 RBX: ffffea0004001000 RCX: 0000000000000001
> [    2.354058][    T0] RDX: 0000000000000000 RSI: ffffffff84e8e940 RDI: ffffffff855b1ca0
> [    2.355041][    T0] RBP: ffff888100040000 R08: 0000000000000002 R09: 0000000000000000
> [    2.355978][    T0] R10: ffffffff86f35083 R11: ffffffff819fd2f1 R12: 0000000000000000
> [    2.356822][    T0] R13: ffff888100040048 R14: 0000000000000015 R15: ffffffff886073e0
> [    2.357702][    T0] FS:  0000000000000000(0000) GS:ffff8883aec00000(0000) knlGS:0000000000000000
> [    2.358674][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.359469][    T0] CR2: ffff88843ffff000 CR3: 00000000064dc000 CR4: 00000000000000b0
> [    2.360402][    T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.361328][    T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.362305][    T0] Call Trace:
> [    2.362722][    T0]  <TASK>
> [ 2.363087][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:479) 
> [ 2.365499][ T0] ? __warn (kernel/panic.c:673) 
> [ 2.366034][ T0] ? ___add_partial (mm/slub.c:2132) 
> [ 2.366627][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
> [ 2.367200][ T0] ? handle_bug (arch/x86/kernel/traps.c:237) 
> [ 2.367743][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
> [ 2.368309][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
> [ 2.368930][ T0] ? kasan_set_track (mm/kasan/common.c:52) 
> [ 2.369529][ T0] ? ___add_partial (mm/slub.c:2132) 
> [ 2.370121][ T0] ? ___add_partial (mm/slub.c:2132 (discriminator 1)) 
> [ 2.370706][ T0] early_kmem_cache_node_alloc (include/linux/list.h:169 mm/slub.c:2156 mm/slub.c:4308) 
> [ 2.371471][ T0] kmem_cache_open (mm/slub.c:4340 mm/slub.c:4578) 
> [ 2.372060][ T0] __kmem_cache_create (mm/slub.c:5140) 
> [ 2.372688][ T0] create_boot_cache (mm/slab_common.c:654) 
> [ 2.373317][ T0] kmem_cache_init (mm/slub.c:5075) 
> [ 2.373936][ T0] mm_core_init (mm/mm_init.c:2786) 
> [ 2.374519][ T0] start_kernel (init/main.c:929) 
> [ 2.375103][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:544) 
> [ 2.375763][ T0] x86_64_start_kernel (arch/x86/kernel/head64.c:486 (discriminator 17)) 
> [ 2.376353][ T0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) 
> [    2.377096][    T0]  </TASK>
> [    2.377447][    T0] irq event stamp: 0
> [ 2.377916][ T0] hardirqs last enabled at (0): 0x0 
> [ 2.378794][ T0] hardirqs last disabled at (0): 0x0 
> [ 2.379684][ T0] softirqs last enabled at (0): 0x0 
> [ 2.380551][ T0] softirqs last disabled at (0): 0x0 
> [    2.381441][    T0] ---[ end trace 0000000000000000 ]---
> [    2.384117][    T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> 
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20231027/202310271308.9076b4c0-oliver.sang@intel.com
> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-10-27  9:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-17 15:44 [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs chengming.zhou
2023-10-17 15:44 ` [RFC PATCH 1/5] slub: Introduce on_partial() chengming.zhou
2023-10-17 15:54   ` Matthew Wilcox
2023-10-18  7:37     ` Chengming Zhou
2023-10-27  5:26   ` kernel test robot
2023-10-27  9:43     ` Chengming Zhou
2023-10-17 15:44 ` [RFC PATCH 2/5] slub: Don't manipulate slab list when used by cpu chengming.zhou
2023-10-17 15:44 ` [RFC PATCH 3/5] slub: Optimize deactivate_slab() chengming.zhou
2023-10-17 15:44 ` [RFC PATCH 4/5] slub: Don't freeze slabs for cpu partial chengming.zhou
2023-10-17 15:44 ` [RFC PATCH 5/5] slub: Introduce get_cpu_partial() chengming.zhou
2023-10-18  6:34 ` [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Hyeonggon Yoo
2023-10-18  7:44   ` Chengming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.