All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm/slub: Fix count_partial() problem
@ 2021-03-09 15:25 Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 1/4] mm/slub: Introduce two counters for partial objects Xunlei Pang
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-09 15:25 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, Vlastimil Babka, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

count_partial() can hold n->list_lock spinlock for quite long, which
makes much trouble to the system. This series eliminate this problem.

v1->v2:
- Improved changelog and variable naming for PATCH 1~2.
- PATCH3 adds per-cpu counter to avoid performance regression
  in concurrent __slab_free().

v2->v3:
- Changed "page->inuse" to the safe "new.inuse", etc.
- Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
- atomic_long_t -> unsigned long

[Testing]
There seems might be a little performance impact under extreme
__slab_free() concurrent calls according to my tests.

On my 32-cpu 2-socket physical machine:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

1) perf stat --null --repeat 10 -- hackbench 20 thread 20000

== original, no patched
Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.536050899 seconds time elapsed                                          ( +-  0.24% )


Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.588049142 seconds time elapsed                                          ( +-  0.35% )


== patched with patch1~4
Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.670892273 seconds time elapsed                                          ( +-  0.29% )


Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.746755689 seconds time elapsed                                          ( +-  0.21% )


2) perf stat --null --repeat 10 -- hackbench 32 thread 20000

== original, no patched
 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.784911855 seconds time elapsed                                          ( +-  0.14% )

 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.868687608 seconds time elapsed                                          ( +-  0.19% )

== patched with patch1~4
 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.681273015 seconds time elapsed                                          ( +-  0.21% )

 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.681238459 seconds time elapsed                                          ( +-  0.09% )


Xunlei Pang (4):
  mm/slub: Introduce two counters for partial objects
  mm/slub: Get rid of count_partial()
  percpu: Export per_cpu_sum()
  mm/slub: Use percpu partial free counter

 include/linux/percpu-defs.h   |  10 ++++
 kernel/locking/percpu-rwsem.c |  10 ----
 mm/slab.h                     |   4 ++
 mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
 4 files changed, 97 insertions(+), 47 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/4] mm/slub: Introduce two counters for partial objects
  2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
@ 2021-03-09 15:25 ` Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 2/4] mm/slub: Get rid of count_partial() Xunlei Pang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-09 15:25 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, Vlastimil Babka, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

The node list_lock in count_partial() spends long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention.

We have HSF RT(High-speed Service Framework Response-Time) monitors,
the RT figures fluctuated randomly, then we deployed a tool detecting
"irq off" and "preempt off" to dump the culprit's calltrace, capturing
the list_lock cost nearly 100ms with irq off issued by "ss", this also
caused network timeouts.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
The main operations are under list_lock in slow path, its performance
impact should be minimal except the __slab_free() path which will be
addressed later.

Tested-by: James Wang <jnwang@linux.alibaba.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 mm/slab.h |  4 ++++
 mm/slub.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 076582f..817bfa0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -547,6 +547,10 @@ struct kmem_cache_node {
 #ifdef CONFIG_SLUB
 	unsigned long nr_partial;
 	struct list_head partial;
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+	atomic_long_t partial_free_objs;
+	unsigned long partial_total_objs;
+#endif
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index e26c274..4d02831 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 /*
  * Management of partially allocated slabs.
  */
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+	atomic_long_add(delta, &n->partial_free_objs);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+	n->partial_total_objs += delta;
+}
+#else
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta) { }
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta) { }
+#endif
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
 	n->nr_partial++;
+	__update_partial_total(n, page->objects);
 	if (tail == DEACTIVATE_TO_TAIL)
 		list_add_tail(&page->slab_list, &n->partial);
 	else
@@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
 	lockdep_assert_held(&n->list_lock);
 	list_del(&page->slab_list);
 	n->nr_partial--;
+	__update_partial_total(n, -page->objects);
 }
 
 /*
@@ -1957,6 +1979,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 		return NULL;
 
 	remove_partial(n, page);
+	__update_partial_free(n, -*objects);
 	WARN_ON(!freelist);
 	return freelist;
 }
@@ -2286,8 +2309,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 				"unfreezing slab"))
 		goto redo;
 
-	if (lock)
+	if (lock) {
+		if (m == M_PARTIAL)
+			__update_partial_free(n, new.objects - new.inuse);
 		spin_unlock(&n->list_lock);
+	}
 
 	if (m == M_PARTIAL)
 		stat(s, tail);
@@ -2353,6 +2379,7 @@ static void unfreeze_partials(struct kmem_cache *s,
 			discard_page = page;
 		} else {
 			add_partial(n, page, DEACTIVATE_TO_TAIL);
+			__update_partial_free(n, new.objects - new.inuse);
 			stat(s, FREE_ADD_PARTIAL);
 		}
 	}
@@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		head, new.counters,
 		"__slab_free"));
 
+	if (!was_frozen && prior) {
+		if (n)
+			__update_partial_free(n, cnt);
+		else
+			__update_partial_free(get_node(s, page_to_nid(page)), cnt);
+	}
+
 	if (likely(!n)) {
 
 		if (likely(was_frozen)) {
@@ -3069,6 +3103,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
 		remove_full(s, n, page);
 		add_partial(n, page, DEACTIVATE_TO_TAIL);
+		__update_partial_free(n, cnt);
 		stat(s, FREE_ADD_PARTIAL);
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3080,6 +3115,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * Slab on the partial list.
 		 */
 		remove_partial(n, page);
+		__update_partial_free(n, -page->objects);
 		stat(s, FREE_REMOVE_PARTIAL);
 	} else {
 		/* Slab must be on the full list */
@@ -3520,6 +3556,10 @@ static inline int calculate_order(unsigned int size)
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+	atomic_long_set(&n->partial_free_objs, 0);
+	n->partial_total_objs = 0;
+#endif
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_set(&n->nr_slabs, 0);
 	atomic_long_set(&n->total_objects, 0);
@@ -3592,6 +3632,7 @@ static void early_kmem_cache_node_alloc(int node)
 	 * initialized and there is no concurrent access.
 	 */
 	__add_partial(n, page, DEACTIVATE_TO_HEAD);
+	__update_partial_free(n, page->objects - page->inuse);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
@@ -3922,6 +3963,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	list_for_each_entry_safe(page, h, &n->partial, slab_list) {
 		if (!page->inuse) {
 			remove_partial(n, page);
+			__update_partial_free(n, -page->objects);
 			list_add(&page->slab_list, &discard);
 		} else {
 			list_slab_objects(s, page,
@@ -4263,6 +4305,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			if (free == page->objects) {
 				list_move(&page->slab_list, &discard);
 				n->nr_partial--;
+				__update_partial_free(n, -free);
+				__update_partial_total(n, -free);
 			} else if (free <= SHRINK_PROMOTE_MAX)
 				list_move(&page->slab_list, promote + free - 1);
 		}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/4] mm/slub: Get rid of count_partial()
  2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 1/4] mm/slub: Introduce two counters for partial objects Xunlei Pang
@ 2021-03-09 15:25 ` Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 3/4] percpu: Export per_cpu_sum() Xunlei Pang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-09 15:25 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, Vlastimil Babka, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

Now the partial counters are ready, let's use them directly
and get rid of count_partial().

Tested-by: James Wang <jnwang@linux.alibaba.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 mm/slub.c | 54 ++++++++++++++++++++++--------------------------------
 1 file changed, 22 insertions(+), 32 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4d02831..3f76b57 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-	return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
 	return atomic_long_read(&n->total_objects);
@@ -2545,19 +2540,26 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-					int (*get_count)(struct page *))
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+		enum partial_item item)
 {
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
+	unsigned long ret = 0;
 
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, slab_list)
-		x += get_count(page);
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
+	if (item == PARTIAL_FREE) {
+		ret = atomic_long_read(&n->partial_free_objs);
+	} else if (item == PARTIAL_TOTAL) {
+		ret = n->partial_total_objs;
+	} else if (item == PARTIAL_INUSE) {
+		ret = n->partial_total_objs - atomic_long_read(&n->partial_free_objs);
+		if ((long)ret < 0)
+			ret = 0;
+	}
+
+	return ret;
 }
+
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
 static noinline void
@@ -2587,7 +2589,7 @@ static unsigned long count_partial(struct kmem_cache_node *n,
 		unsigned long nr_objs;
 		unsigned long nr_free;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = partial_counter(n, PARTIAL_FREE);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -4643,18 +4645,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-	return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-	return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -5091,7 +5081,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 				x = atomic_long_read(&n->total_objects);
 			else if (flags & SO_OBJECTS)
 				x = atomic_long_read(&n->total_objects) -
-					count_partial(n, count_free);
+					partial_counter(n, PARTIAL_FREE);
 			else
 				x = atomic_long_read(&n->nr_slabs);
 			total += x;
@@ -5105,9 +5095,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
 		for_each_kmem_cache_node(s, node, n) {
 			if (flags & SO_TOTAL)
-				x = count_partial(n, count_total);
+				x = partial_counter(n, PARTIAL_TOTAL);
 			else if (flags & SO_OBJECTS)
-				x = count_partial(n, count_inuse);
+				x = partial_counter(n, PARTIAL_INUSE);
 			else
 				x = n->nr_partial;
 			total += x;
@@ -5873,7 +5863,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	for_each_kmem_cache_node(s, node, n) {
 		nr_slabs += node_nr_slabs(n);
 		nr_objs += node_nr_objs(n);
-		nr_free += count_partial(n, count_free);
+		nr_free += partial_counter(n, PARTIAL_FREE);
 	}
 
 	sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/4] percpu: Export per_cpu_sum()
  2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 1/4] mm/slub: Introduce two counters for partial objects Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 2/4] mm/slub: Get rid of count_partial() Xunlei Pang
@ 2021-03-09 15:25 ` Xunlei Pang
  2021-03-09 15:25 ` [PATCH v3 4/4] mm/slub: Use percpu partial free counter Xunlei Pang
  2021-03-15 18:49 ` [PATCH v3 0/4] mm/slub: Fix count_partial() problem Vlastimil Babka
  4 siblings, 0 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-09 15:25 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, Vlastimil Babka, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

per_cpu_sum() is useful, and deserves to be exported.

Tested-by: James Wang <jnwang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 include/linux/percpu-defs.h   | 10 ++++++++++
 kernel/locking/percpu-rwsem.c | 10 ----------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index dff7040..0e71b68 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -220,6 +220,16 @@
 	(void)__vpp_verify;						\
 } while (0)
 
+#define per_cpu_sum(var)						\
+({									\
+	typeof(var) __sum = 0;						\
+	int cpu;							\
+	compiletime_assert_atomic_type(__sum);				\
+	for_each_possible_cpu(cpu)					\
+		__sum += per_cpu(var, cpu);				\
+	__sum;								\
+})
+
 #ifdef CONFIG_SMP
 
 /*
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a5..0980e51 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 }
 EXPORT_SYMBOL_GPL(__percpu_down_read);
 
-#define per_cpu_sum(var)						\
-({									\
-	typeof(var) __sum = 0;						\
-	int cpu;							\
-	compiletime_assert_atomic_type(__sum);				\
-	for_each_possible_cpu(cpu)					\
-		__sum += per_cpu(var, cpu);				\
-	__sum;								\
-})
-
 /*
  * Return true if the modular sum of the sem->read_count per-CPU variable is
  * zero.  If this sum is zero, then it is stable due to the fact that if any
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/4] mm/slub: Use percpu partial free counter
  2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
                   ` (2 preceding siblings ...)
  2021-03-09 15:25 ` [PATCH v3 3/4] percpu: Export per_cpu_sum() Xunlei Pang
@ 2021-03-09 15:25 ` Xunlei Pang
  2021-03-15 18:49 ` [PATCH v3 0/4] mm/slub: Fix count_partial() problem Vlastimil Babka
  4 siblings, 0 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-09 15:25 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, Vlastimil Babka, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

The only concern of introducing partial counter is that,
partial_free_objs may cause cache and atomic operation
contention in case of same SLUB concurrent __slab_free().

This patch changes it to be a percpu counter, also replace
the counter variables to avoid cacheline issues.

Tested-by: James Wang <jnwang@linux.alibaba.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 mm/slab.h |  6 ++++--
 mm/slub.c | 30 +++++++++++++++++++++++-------
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 817bfa0..c819597 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -546,16 +546,18 @@ struct kmem_cache_node {
 
 #ifdef CONFIG_SLUB
 	unsigned long nr_partial;
-	struct list_head partial;
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-	atomic_long_t partial_free_objs;
 	unsigned long partial_total_objs;
 #endif
+	struct list_head partial;
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+	unsigned long __percpu *partial_free_objs;
+#endif
 #endif
 
 };
diff --git a/mm/slub.c b/mm/slub.c
index 3f76b57..b6ec065 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1894,7 +1894,7 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 static inline void
 __update_partial_free(struct kmem_cache_node *n, long delta)
 {
-	atomic_long_add(delta, &n->partial_free_objs);
+	this_cpu_add(*n->partial_free_objs, delta);
 }
 
 static inline void
@@ -2548,11 +2548,16 @@ static unsigned long partial_counter(struct kmem_cache_node *n,
 	unsigned long ret = 0;
 
 	if (item == PARTIAL_FREE) {
-		ret = atomic_long_read(&n->partial_free_objs);
+		ret = per_cpu_sum(*n->partial_free_objs);
+		if ((long)ret < 0)
+			ret = 0;
 	} else if (item == PARTIAL_TOTAL) {
 		ret = n->partial_total_objs;
 	} else if (item == PARTIAL_INUSE) {
-		ret = n->partial_total_objs - atomic_long_read(&n->partial_free_objs);
+		ret = per_cpu_sum(*n->partial_free_objs);
+		if ((long)ret < 0)
+			ret = 0;
+		ret = n->partial_total_objs - ret;
 		if ((long)ret < 0)
 			ret = 0;
 	}
@@ -3552,14 +3557,16 @@ static inline int calculate_order(unsigned int size)
 	return -ENOSYS;
 }
 
-static void
+static int
 init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-	atomic_long_set(&n->partial_free_objs, 0);
+	n->partial_free_objs = alloc_percpu(unsigned long);
+	if (!n->partial_free_objs)
+		return -ENOMEM;
 	n->partial_total_objs = 0;
 #endif
 #ifdef CONFIG_SLUB_DEBUG
@@ -3567,6 +3574,8 @@ static inline int calculate_order(unsigned int size)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+
+	return 0;
 }
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
@@ -3626,7 +3635,7 @@ static void early_kmem_cache_node_alloc(int node)
 	page->inuse = 1;
 	page->frozen = 0;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	BUG_ON(init_kmem_cache_node(n) < 0);
 	inc_slabs_node(kmem_cache_node, node, page->objects);
 
 	/*
@@ -3644,6 +3653,9 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_kmem_cache_node(s, node, n) {
 		s->node[node] = NULL;
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+		free_percpu(n->partial_free_objs);
+#endif
 		kmem_cache_free(kmem_cache_node, n);
 	}
 }
@@ -3674,7 +3686,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 			return 0;
 		}
 
-		init_kmem_cache_node(n);
+		if (init_kmem_cache_node(n) < 0) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+
 		s->node[node] = n;
 	}
 	return 1;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
                   ` (3 preceding siblings ...)
  2021-03-09 15:25 ` [PATCH v3 4/4] mm/slub: Use percpu partial free counter Xunlei Pang
@ 2021-03-15 18:49 ` Vlastimil Babka
  2021-03-15 19:05   ` Roman Gushchin
  2021-03-16 10:42   ` Xunlei Pang
  4 siblings, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2021-03-15 18:49 UTC (permalink / raw)
  To: Xunlei Pang, Christoph Lameter, Pekka Enberg, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Thomas Gleixner

On 3/9/21 4:25 PM, Xunlei Pang wrote:
> count_partial() can hold n->list_lock spinlock for quite long, which
> makes much trouble to the system. This series eliminate this problem.

Before I check the details, I have two high-level comments:

- patch 1 introduces some counting scheme that patch 4 then changes, could we do
this in one step to avoid the churn?

- the series addresses the concern that spinlock is being held, but doesn't
address the fact that counting partial per-node slabs is not nearly enough if we
want accurate <active_objs> in /proc/slabinfo because there are also percpu
slabs and per-cpu partial slabs, where we don't track the free objects at all.
So after this series while the readers of /proc/slabinfo won't block the
spinlock, they will get the same garbage data as before. So Christoph is not
wrong to say that we can just report active_objs == num_objs and it won't
actually break any ABI.
At the same time somebody might actually want accurate object statistics at the
expense of peak performance, and it would be nice to give them such option in
SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
although that option provides many additional tuning stats, with additional
overhead.
So my proposal would be a new config for "accurate active objects" (or just tie
it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
enabled, let's just report active_objs == num_objs.

Vlastimil

> v1->v2:
> - Improved changelog and variable naming for PATCH 1~2.
> - PATCH3 adds per-cpu counter to avoid performance regression
>   in concurrent __slab_free().
> 
> v2->v3:
> - Changed "page->inuse" to the safe "new.inuse", etc.
> - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> - atomic_long_t -> unsigned long
> 
> [Testing]
> There seems might be a little performance impact under extreme
> __slab_free() concurrent calls according to my tests.
> 
> On my 32-cpu 2-socket physical machine:
> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> 
> 1) perf stat --null --repeat 10 -- hackbench 20 thread 20000
> 
> == original, no patched
> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> 
>       24.536050899 seconds time elapsed                                          ( +-  0.24% )
> 
> 
> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> 
>       24.588049142 seconds time elapsed                                          ( +-  0.35% )
> 
> 
> == patched with patch1~4
> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> 
>       24.670892273 seconds time elapsed                                          ( +-  0.29% )
> 
> 
> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> 
>       24.746755689 seconds time elapsed                                          ( +-  0.21% )
> 
> 
> 2) perf stat --null --repeat 10 -- hackbench 32 thread 20000
> 
> == original, no patched
>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> 
>       39.784911855 seconds time elapsed                                          ( +-  0.14% )
> 
>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> 
>       39.868687608 seconds time elapsed                                          ( +-  0.19% )
> 
> == patched with patch1~4
>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> 
>       39.681273015 seconds time elapsed                                          ( +-  0.21% )
> 
>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> 
>       39.681238459 seconds time elapsed                                          ( +-  0.09% )
> 
> 
> Xunlei Pang (4):
>   mm/slub: Introduce two counters for partial objects
>   mm/slub: Get rid of count_partial()
>   percpu: Export per_cpu_sum()
>   mm/slub: Use percpu partial free counter
> 
>  include/linux/percpu-defs.h   |  10 ++++
>  kernel/locking/percpu-rwsem.c |  10 ----
>  mm/slab.h                     |   4 ++
>  mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
>  4 files changed, 97 insertions(+), 47 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-15 18:49 ` [PATCH v3 0/4] mm/slub: Fix count_partial() problem Vlastimil Babka
@ 2021-03-15 19:05   ` Roman Gushchin
  2021-03-15 19:22       ` Yang Shi
  2021-03-16 10:42   ` Xunlei Pang
  1 sibling, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2021-03-15 19:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Xunlei Pang, Christoph Lameter, Pekka Enberg,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, linux-kernel, linux-mm, Wen Yang, James Wang,
	Thomas Gleixner


On Mon, Mar 15, 2021 at 07:49:57PM +0100, Vlastimil Babka wrote:
> On 3/9/21 4:25 PM, Xunlei Pang wrote:
> > count_partial() can hold n->list_lock spinlock for quite long, which
> > makes much trouble to the system. This series eliminate this problem.
> 
> Before I check the details, I have two high-level comments:
> 
> - patch 1 introduces some counting scheme that patch 4 then changes, could we do
> this in one step to avoid the churn?
> 
> - the series addresses the concern that spinlock is being held, but doesn't
> address the fact that counting partial per-node slabs is not nearly enough if we
> want accurate <active_objs> in /proc/slabinfo because there are also percpu
> slabs and per-cpu partial slabs, where we don't track the free objects at all.
> So after this series while the readers of /proc/slabinfo won't block the
> spinlock, they will get the same garbage data as before. So Christoph is not
> wrong to say that we can just report active_objs == num_objs and it won't
> actually break any ABI.
> At the same time somebody might actually want accurate object statistics at the
> expense of peak performance, and it would be nice to give them such option in
> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> although that option provides many additional tuning stats, with additional
> overhead.
> So my proposal would be a new config for "accurate active objects" (or just tie
> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
> enabled, let's just report active_objs == num_objs.

It sounds really good to me! The only thing, I'd avoid introducing a new option
and use CONFIG_SLUB_STATS instead.

It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

Thanks!

> 
> Vlastimil
> 
> > v1->v2:
> > - Improved changelog and variable naming for PATCH 1~2.
> > - PATCH3 adds per-cpu counter to avoid performance regression
> >   in concurrent __slab_free().
> > 
> > v2->v3:
> > - Changed "page->inuse" to the safe "new.inuse", etc.
> > - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> > - atomic_long_t -> unsigned long
> > 
> > [Testing]
> > There seems might be a little performance impact under extreme
> > __slab_free() concurrent calls according to my tests.
> > 
> > On my 32-cpu 2-socket physical machine:
> > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > 
> > 1) perf stat --null --repeat 10 -- hackbench 20 thread 20000
> > 
> > == original, no patched
> > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > 
> >       24.536050899 seconds time elapsed                                          ( +-  0.24% )
> > 
> > 
> > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > 
> >       24.588049142 seconds time elapsed                                          ( +-  0.35% )
> > 
> > 
> > == patched with patch1~4
> > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > 
> >       24.670892273 seconds time elapsed                                          ( +-  0.29% )
> > 
> > 
> > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > 
> >       24.746755689 seconds time elapsed                                          ( +-  0.21% )
> > 
> > 
> > 2) perf stat --null --repeat 10 -- hackbench 32 thread 20000
> > 
> > == original, no patched
> >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > 
> >       39.784911855 seconds time elapsed                                          ( +-  0.14% )
> > 
> >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > 
> >       39.868687608 seconds time elapsed                                          ( +-  0.19% )
> > 
> > == patched with patch1~4
> >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > 
> >       39.681273015 seconds time elapsed                                          ( +-  0.21% )
> > 
> >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > 
> >       39.681238459 seconds time elapsed                                          ( +-  0.09% )
> > 
> > 
> > Xunlei Pang (4):
> >   mm/slub: Introduce two counters for partial objects
> >   mm/slub: Get rid of count_partial()
> >   percpu: Export per_cpu_sum()
> >   mm/slub: Use percpu partial free counter
> > 
> >  include/linux/percpu-defs.h   |  10 ++++
> >  kernel/locking/percpu-rwsem.c |  10 ----
> >  mm/slab.h                     |   4 ++
> >  mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
> >  4 files changed, 97 insertions(+), 47 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-15 19:05   ` Roman Gushchin
@ 2021-03-15 19:22       ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-03-15 19:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Xunlei Pang, Christoph Lameter, Pekka Enberg,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, Linux Kernel Mailing List, Linux MM, Wen Yang,
	James Wang, Thomas Gleixner

On Mon, Mar 15, 2021 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
>
>
> On Mon, Mar 15, 2021 at 07:49:57PM +0100, Vlastimil Babka wrote:
> > On 3/9/21 4:25 PM, Xunlei Pang wrote:
> > > count_partial() can hold n->list_lock spinlock for quite long, which
> > > makes much trouble to the system. This series eliminate this problem.
> >
> > Before I check the details, I have two high-level comments:
> >
> > - patch 1 introduces some counting scheme that patch 4 then changes, could we do
> > this in one step to avoid the churn?
> >
> > - the series addresses the concern that spinlock is being held, but doesn't
> > address the fact that counting partial per-node slabs is not nearly enough if we
> > want accurate <active_objs> in /proc/slabinfo because there are also percpu
> > slabs and per-cpu partial slabs, where we don't track the free objects at all.
> > So after this series while the readers of /proc/slabinfo won't block the
> > spinlock, they will get the same garbage data as before. So Christoph is not
> > wrong to say that we can just report active_objs == num_objs and it won't
> > actually break any ABI.
> > At the same time somebody might actually want accurate object statistics at the
> > expense of peak performance, and it would be nice to give them such option in
> > SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> > although that option provides many additional tuning stats, with additional
> > overhead.
> > So my proposal would be a new config for "accurate active objects" (or just tie
> > it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> > patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
> > enabled, let's just report active_objs == num_objs.
>
> It sounds really good to me! The only thing, I'd avoid introducing a new option
> and use CONFIG_SLUB_STATS instead.
>
> It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
> CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
> I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
it is supposed not incur too much overhead unless specific debug (i.e.
red_zone) is turned on on demand.

>
> Thanks!
>
> >
> > Vlastimil
> >
> > > v1->v2:
> > > - Improved changelog and variable naming for PATCH 1~2.
> > > - PATCH3 adds per-cpu counter to avoid performance regression
> > >   in concurrent __slab_free().
> > >
> > > v2->v3:
> > > - Changed "page->inuse" to the safe "new.inuse", etc.
> > > - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> > > - atomic_long_t -> unsigned long
> > >
> > > [Testing]
> > > There seems might be a little performance impact under extreme
> > > __slab_free() concurrent calls according to my tests.
> > >
> > > On my 32-cpu 2-socket physical machine:
> > > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > >
> > > 1) perf stat --null --repeat 10 -- hackbench 20 thread 20000
> > >
> > > == original, no patched
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.536050899 seconds time elapsed                                          ( +-  0.24% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.588049142 seconds time elapsed                                          ( +-  0.35% )
> > >
> > >
> > > == patched with patch1~4
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.670892273 seconds time elapsed                                          ( +-  0.29% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.746755689 seconds time elapsed                                          ( +-  0.21% )
> > >
> > >
> > > 2) perf stat --null --repeat 10 -- hackbench 32 thread 20000
> > >
> > > == original, no patched
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.784911855 seconds time elapsed                                          ( +-  0.14% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.868687608 seconds time elapsed                                          ( +-  0.19% )
> > >
> > > == patched with patch1~4
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.681273015 seconds time elapsed                                          ( +-  0.21% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.681238459 seconds time elapsed                                          ( +-  0.09% )
> > >
> > >
> > > Xunlei Pang (4):
> > >   mm/slub: Introduce two counters for partial objects
> > >   mm/slub: Get rid of count_partial()
> > >   percpu: Export per_cpu_sum()
> > >   mm/slub: Use percpu partial free counter
> > >
> > >  include/linux/percpu-defs.h   |  10 ++++
> > >  kernel/locking/percpu-rwsem.c |  10 ----
> > >  mm/slab.h                     |   4 ++
> > >  mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
> > >  4 files changed, 97 insertions(+), 47 deletions(-)
> > >
> >
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
@ 2021-03-15 19:22       ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-03-15 19:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Xunlei Pang, Christoph Lameter, Pekka Enberg,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, Linux Kernel Mailing List, Linux MM, Wen Yang,
	James Wang, Thomas Gleixner

On Mon, Mar 15, 2021 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
>
>
> On Mon, Mar 15, 2021 at 07:49:57PM +0100, Vlastimil Babka wrote:
> > On 3/9/21 4:25 PM, Xunlei Pang wrote:
> > > count_partial() can hold n->list_lock spinlock for quite long, which
> > > makes much trouble to the system. This series eliminate this problem.
> >
> > Before I check the details, I have two high-level comments:
> >
> > - patch 1 introduces some counting scheme that patch 4 then changes, could we do
> > this in one step to avoid the churn?
> >
> > - the series addresses the concern that spinlock is being held, but doesn't
> > address the fact that counting partial per-node slabs is not nearly enough if we
> > want accurate <active_objs> in /proc/slabinfo because there are also percpu
> > slabs and per-cpu partial slabs, where we don't track the free objects at all.
> > So after this series while the readers of /proc/slabinfo won't block the
> > spinlock, they will get the same garbage data as before. So Christoph is not
> > wrong to say that we can just report active_objs == num_objs and it won't
> > actually break any ABI.
> > At the same time somebody might actually want accurate object statistics at the
> > expense of peak performance, and it would be nice to give them such option in
> > SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> > although that option provides many additional tuning stats, with additional
> > overhead.
> > So my proposal would be a new config for "accurate active objects" (or just tie
> > it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> > patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
> > enabled, let's just report active_objs == num_objs.
>
> It sounds really good to me! The only thing, I'd avoid introducing a new option
> and use CONFIG_SLUB_STATS instead.
>
> It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
> CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
> I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
it is supposed not incur too much overhead unless specific debug (i.e.
red_zone) is turned on on demand.

>
> Thanks!
>
> >
> > Vlastimil
> >
> > > v1->v2:
> > > - Improved changelog and variable naming for PATCH 1~2.
> > > - PATCH3 adds per-cpu counter to avoid performance regression
> > >   in concurrent __slab_free().
> > >
> > > v2->v3:
> > > - Changed "page->inuse" to the safe "new.inuse", etc.
> > > - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> > > - atomic_long_t -> unsigned long
> > >
> > > [Testing]
> > > There seems might be a little performance impact under extreme
> > > __slab_free() concurrent calls according to my tests.
> > >
> > > On my 32-cpu 2-socket physical machine:
> > > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > >
> > > 1) perf stat --null --repeat 10 -- hackbench 20 thread 20000
> > >
> > > == original, no patched
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.536050899 seconds time elapsed                                          ( +-  0.24% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.588049142 seconds time elapsed                                          ( +-  0.35% )
> > >
> > >
> > > == patched with patch1~4
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.670892273 seconds time elapsed                                          ( +-  0.29% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
> > >
> > >       24.746755689 seconds time elapsed                                          ( +-  0.21% )
> > >
> > >
> > > 2) perf stat --null --repeat 10 -- hackbench 32 thread 20000
> > >
> > > == original, no patched
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.784911855 seconds time elapsed                                          ( +-  0.14% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.868687608 seconds time elapsed                                          ( +-  0.19% )
> > >
> > > == patched with patch1~4
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.681273015 seconds time elapsed                                          ( +-  0.21% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
> > >
> > >       39.681238459 seconds time elapsed                                          ( +-  0.09% )
> > >
> > >
> > > Xunlei Pang (4):
> > >   mm/slub: Introduce two counters for partial objects
> > >   mm/slub: Get rid of count_partial()
> > >   percpu: Export per_cpu_sum()
> > >   mm/slub: Use percpu partial free counter
> > >
> > >  include/linux/percpu-defs.h   |  10 ++++
> > >  kernel/locking/percpu-rwsem.c |  10 ----
> > >  mm/slab.h                     |   4 ++
> > >  mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
> > >  4 files changed, 97 insertions(+), 47 deletions(-)
> > >
> >
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-15 19:22       ` Yang Shi
@ 2021-03-16 10:07         ` Christoph Lameter
  -1 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2021-03-16 10:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Vlastimil Babka, Xunlei Pang, Pekka Enberg,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, Linux Kernel Mailing List, Linux MM, Wen Yang,
	James Wang, Thomas Gleixner

On Mon, 15 Mar 2021, Yang Shi wrote:

> > It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
> > CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
> > I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.
>
> I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
> it is supposed not incur too much overhead unless specific debug (i.e.
> red_zone) is turned on on demand.

Correct. CONFIG_SLUB_DEBUG includes the code so the debugging can be
enabled on Distro kernels with a kernel command line option. So you dont
have to recompile the kernel to find weird memory corruption issues from
strange device drivers.

Somehow my email address dropped off this thread.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
@ 2021-03-16 10:07         ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2021-03-16 10:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Vlastimil Babka, Xunlei Pang, Pekka Enberg,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, Linux Kernel Mailing List, Linux MM, Wen Yang,
	James Wang, Thomas Gleixner

On Mon, 15 Mar 2021, Yang Shi wrote:

> > It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
> > CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
> > I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.
>
> I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
> it is supposed not incur too much overhead unless specific debug (i.e.
> red_zone) is turned on on demand.

Correct. CONFIG_SLUB_DEBUG includes the code so the debugging can be
enabled on Distro kernels with a kernel command line option. So you dont
have to recompile the kernel to find weird memory corruption issues from
strange device drivers.

Somehow my email address dropped off this thread.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-16 10:07         ` Christoph Lameter
  (?)
@ 2021-03-16 10:32         ` Vlastimil Babka
  -1 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2021-03-16 10:32 UTC (permalink / raw)
  To: Christoph Lameter, Yang Shi
  Cc: Roman Gushchin, Xunlei Pang, Pekka Enberg, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Shu Ming, Andrew Morton,
	Linux Kernel Mailing List, Linux MM, Wen Yang, James Wang,
	Thomas Gleixner

On 3/16/21 11:07 AM, Christoph Lameter wrote:
> On Mon, 15 Mar 2021, Yang Shi wrote:
> 
>> > It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
>> > CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
>> > I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

Hm I can imagine that (after due performance testing) we would consider having
accurate slabinfo in our distro kernel, just as we have CONFIG_SLUB_DEBUG but
not the full stats.

>> I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
>> it is supposed not incur too much overhead unless specific debug (i.e.
>> red_zone) is turned on on demand.
> 
> Correct. CONFIG_SLUB_DEBUG includes the code so the debugging can be
> enabled on Distro kernels with a kernel command line option. So you dont
> have to recompile the kernel to find weird memory corruption issues from
> strange device drivers.
> 
> Somehow my email address dropped off this thread.

Hm I see cl@linux.com in all e-mails of the thread, but now you replaced it with
cl@gentwo.de ?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-15 18:49 ` [PATCH v3 0/4] mm/slub: Fix count_partial() problem Vlastimil Babka
  2021-03-15 19:05   ` Roman Gushchin
@ 2021-03-16 10:42   ` Xunlei Pang
  2021-03-16 11:02     ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Xunlei Pang @ 2021-03-16 10:42 UTC (permalink / raw)
  To: Vlastimil Babka, Xunlei Pang, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Thomas Gleixner

On 3/16/21 2:49 AM, Vlastimil Babka wrote:
> On 3/9/21 4:25 PM, Xunlei Pang wrote:
>> count_partial() can hold n->list_lock spinlock for quite long, which
>> makes much trouble to the system. This series eliminate this problem.
> 
> Before I check the details, I have two high-level comments:
> 
> - patch 1 introduces some counting scheme that patch 4 then changes, could we do
> this in one step to avoid the churn?
> 
> - the series addresses the concern that spinlock is being held, but doesn't
> address the fact that counting partial per-node slabs is not nearly enough if we
> want accurate <active_objs> in /proc/slabinfo because there are also percpu
> slabs and per-cpu partial slabs, where we don't track the free objects at all.
> So after this series while the readers of /proc/slabinfo won't block the
> spinlock, they will get the same garbage data as before. So Christoph is not
> wrong to say that we can just report active_objs == num_objs and it won't
> actually break any ABI.

If maintainers don't mind this inaccuracy which I also doubt its
importance, then it becomes easy. For fear that some people who really
cares, introducing an extra config(default-off) for it would be a good
option.

> At the same time somebody might actually want accurate object statistics at the
> expense of peak performance, and it would be nice to give them such option in
> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> although that option provides many additional tuning stats, with additional
> overhead.
> So my proposal would be a new config for "accurate active objects" (or just tie
> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
> enabled, let's just report active_objs == num_objs.
For percpu slabs, the numbers can be retrieved from the existing
slub_percpu_partial()->pobjects, looks no need extra work.

> 
> Vlastimil
> 
>> v1->v2:
>> - Improved changelog and variable naming for PATCH 1~2.
>> - PATCH3 adds per-cpu counter to avoid performance regression
>>   in concurrent __slab_free().
>>
>> v2->v3:
>> - Changed "page->inuse" to the safe "new.inuse", etc.
>> - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
>> - atomic_long_t -> unsigned long
>>
>> [Testing]
>> There seems might be a little performance impact under extreme
>> __slab_free() concurrent calls according to my tests.
>>
>> On my 32-cpu 2-socket physical machine:
>> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>
>> 1) perf stat --null --repeat 10 -- hackbench 20 thread 20000
>>
>> == original, no patched
>> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
>>
>>       24.536050899 seconds time elapsed                                          ( +-  0.24% )
>>
>>
>> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
>>
>>       24.588049142 seconds time elapsed                                          ( +-  0.35% )
>>
>>
>> == patched with patch1~4
>> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
>>
>>       24.670892273 seconds time elapsed                                          ( +-  0.29% )
>>
>>
>> Performance counter stats for 'hackbench 20 thread 20000' (10 runs):
>>
>>       24.746755689 seconds time elapsed                                          ( +-  0.21% )
>>
>>
>> 2) perf stat --null --repeat 10 -- hackbench 32 thread 20000
>>
>> == original, no patched
>>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
>>
>>       39.784911855 seconds time elapsed                                          ( +-  0.14% )
>>
>>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
>>
>>       39.868687608 seconds time elapsed                                          ( +-  0.19% )
>>
>> == patched with patch1~4
>>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
>>
>>       39.681273015 seconds time elapsed                                          ( +-  0.21% )
>>
>>  Performance counter stats for 'hackbench 32 thread 20000' (10 runs):
>>
>>       39.681238459 seconds time elapsed                                          ( +-  0.09% )
>>
>>
>> Xunlei Pang (4):
>>   mm/slub: Introduce two counters for partial objects
>>   mm/slub: Get rid of count_partial()
>>   percpu: Export per_cpu_sum()
>>   mm/slub: Use percpu partial free counter
>>
>>  include/linux/percpu-defs.h   |  10 ++++
>>  kernel/locking/percpu-rwsem.c |  10 ----
>>  mm/slab.h                     |   4 ++
>>  mm/slub.c                     | 120 +++++++++++++++++++++++++++++-------------
>>  4 files changed, 97 insertions(+), 47 deletions(-)
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-16 10:42   ` Xunlei Pang
@ 2021-03-16 11:02     ` Vlastimil Babka
  2021-03-16 11:49       ` Xunlei Pang
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2021-03-16 11:02 UTC (permalink / raw)
  To: xlpang, Christoph Lameter, Pekka Enberg, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton, Christoph Lameter
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Thomas Gleixner

On 3/16/21 11:42 AM, Xunlei Pang wrote:
> On 3/16/21 2:49 AM, Vlastimil Babka wrote:
>> On 3/9/21 4:25 PM, Xunlei Pang wrote:
>>> count_partial() can hold n->list_lock spinlock for quite long, which
>>> makes much trouble to the system. This series eliminate this problem.
>> 
>> Before I check the details, I have two high-level comments:
>> 
>> - patch 1 introduces some counting scheme that patch 4 then changes, could we do
>> this in one step to avoid the churn?
>> 
>> - the series addresses the concern that spinlock is being held, but doesn't
>> address the fact that counting partial per-node slabs is not nearly enough if we
>> want accurate <active_objs> in /proc/slabinfo because there are also percpu
>> slabs and per-cpu partial slabs, where we don't track the free objects at all.
>> So after this series while the readers of /proc/slabinfo won't block the
>> spinlock, they will get the same garbage data as before. So Christoph is not
>> wrong to say that we can just report active_objs == num_objs and it won't
>> actually break any ABI.
> 
> If maintainers don't mind this inaccuracy which I also doubt its
> importance, then it becomes easy. For fear that some people who really
> cares, introducing an extra config(default-off) for it would be a good
> option.

Great.

>> At the same time somebody might actually want accurate object statistics at the
>> expense of peak performance, and it would be nice to give them such option in
>> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
>> although that option provides many additional tuning stats, with additional
>> overhead.
>> So my proposal would be a new config for "accurate active objects" (or just tie
>> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
>> patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
>> enabled, let's just report active_objs == num_objs.
> For percpu slabs, the numbers can be retrieved from the existing
> slub_percpu_partial()->pobjects, looks no need extra work.

Hm, unfortunately it's not that simple, the number there is a snapshot that can
become wildly inacurate afterwards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
  2021-03-16 11:02     ` Vlastimil Babka
@ 2021-03-16 11:49       ` Xunlei Pang
  0 siblings, 0 replies; 15+ messages in thread
From: Xunlei Pang @ 2021-03-16 11:49 UTC (permalink / raw)
  To: Vlastimil Babka, xlpang, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Shu Ming, Andrew Morton, Christoph Lameter
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Thomas Gleixner

On 3/16/21 7:02 PM, Vlastimil Babka wrote:
> On 3/16/21 11:42 AM, Xunlei Pang wrote:
>> On 3/16/21 2:49 AM, Vlastimil Babka wrote:
>>> On 3/9/21 4:25 PM, Xunlei Pang wrote:
>>>> count_partial() can hold n->list_lock spinlock for quite long, which
>>>> makes much trouble to the system. This series eliminate this problem.
>>>
>>> Before I check the details, I have two high-level comments:
>>>
>>> - patch 1 introduces some counting scheme that patch 4 then changes, could we do
>>> this in one step to avoid the churn?
>>>
>>> - the series addresses the concern that spinlock is being held, but doesn't
>>> address the fact that counting partial per-node slabs is not nearly enough if we
>>> want accurate <active_objs> in /proc/slabinfo because there are also percpu
>>> slabs and per-cpu partial slabs, where we don't track the free objects at all.
>>> So after this series while the readers of /proc/slabinfo won't block the
>>> spinlock, they will get the same garbage data as before. So Christoph is not
>>> wrong to say that we can just report active_objs == num_objs and it won't
>>> actually break any ABI.
>>
>> If maintainers don't mind this inaccuracy which I also doubt its
>> importance, then it becomes easy. For fear that some people who really
>> cares, introducing an extra config(default-off) for it would be a good
>> option.
> 
> Great.
> 
>>> At the same time somebody might actually want accurate object statistics at the
>>> expense of peak performance, and it would be nice to give them such option in
>>> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
>>> although that option provides many additional tuning stats, with additional
>>> overhead.
>>> So my proposal would be a new config for "accurate active objects" (or just tie
>>> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
>>> patch 4 to all alloc/free, so that it includes percpu slabs. Without this config
>>> enabled, let's just report active_objs == num_objs.
>> For percpu slabs, the numbers can be retrieved from the existing
>> slub_percpu_partial()->pobjects, looks no need extra work.
> 
> Hm, unfortunately it's not that simple, the number there is a snapshot that can
> become wildly inacurate afterwards.
> 

It's hard to make it absoultely accurate using percpu, the data can
change during you iterating all the cpus and total_objects, I can't
imagine its real-world usage, not to mention the percpu freelist cache.
I think sysfs slabs_cpu_partial should work enough for common debug purpose.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-03-16 11:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09 15:25 [PATCH v3 0/4] mm/slub: Fix count_partial() problem Xunlei Pang
2021-03-09 15:25 ` [PATCH v3 1/4] mm/slub: Introduce two counters for partial objects Xunlei Pang
2021-03-09 15:25 ` [PATCH v3 2/4] mm/slub: Get rid of count_partial() Xunlei Pang
2021-03-09 15:25 ` [PATCH v3 3/4] percpu: Export per_cpu_sum() Xunlei Pang
2021-03-09 15:25 ` [PATCH v3 4/4] mm/slub: Use percpu partial free counter Xunlei Pang
2021-03-15 18:49 ` [PATCH v3 0/4] mm/slub: Fix count_partial() problem Vlastimil Babka
2021-03-15 19:05   ` Roman Gushchin
2021-03-15 19:22     ` Yang Shi
2021-03-15 19:22       ` Yang Shi
2021-03-16 10:07       ` Christoph Lameter
2021-03-16 10:07         ` Christoph Lameter
2021-03-16 10:32         ` Vlastimil Babka
2021-03-16 10:42   ` Xunlei Pang
2021-03-16 11:02     ` Vlastimil Babka
2021-03-16 11:49       ` Xunlei Pang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.