linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] mm/slub: Fix count_partial() problem
@ 2021-03-17  7:54 Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Xunlei Pang @ 2021-03-17  7:54 UTC (permalink / raw)
  To: Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Vlastimil Babka, Roman Gushchin, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

count_partial() can hold n->list_lock spinlock for quite long, which
makes much trouble to the system. This series eliminate this problem.

v1->v2:
- Improved changelog and variable naming for PATCH 1~2.
- PATCH3 adds per-cpu counter to avoid performance regression
  in concurrent __slab_free().

v2->v3:
- Changed "page->inuse" to the safe "new.inuse", etc.
- Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
- atomic_long_t -> unsigned long

v3->v4:
- introduced new CONFIG_SLUB_DEBUG_PARTIAL to give a chance to be enabled for production use.
- Merged PATCH 4 into PATCH 1.

[Testing]
There seems might be a little performance impact under extreme
__slab_free() concurrent calls according to my tests.

On my 32-cpu 2-socket physical machine:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

1) perf stat --null --repeat 10 -- hackbench 20 thread 20000

== original, no patched
Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.536050899 seconds time elapsed                                          ( +-  0.24% )


Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.588049142 seconds time elapsed                                          ( +-  0.35% )


== patched with patch1~4
Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.670892273 seconds time elapsed                                          ( +-  0.29% )


Performance counter stats for 'hackbench 20 thread 20000' (10 runs):

      24.746755689 seconds time elapsed                                          ( +-  0.21% )


2) perf stat --null --repeat 10 -- hackbench 32 thread 20000

== original, no patched
 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.784911855 seconds time elapsed                                          ( +-  0.14% )

 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.868687608 seconds time elapsed                                          ( +-  0.19% )

== patched with patch1~4
 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.681273015 seconds time elapsed                                          ( +-  0.21% )

 Performance counter stats for 'hackbench 32 thread 20000' (10 runs):

      39.681238459 seconds time elapsed                                          ( +-  0.09% )

Xunlei Pang (3):
  mm/slub: Introduce two counters for partial objects
  percpu: Export per_cpu_sum()
  mm/slub: Get rid of count_partial()

 include/linux/percpu-defs.h   |  10 ++++
 init/Kconfig                  |  13 +++++
 kernel/locking/percpu-rwsem.c |  10 ----
 mm/slab.h                     |   6 ++
 mm/slub.c                     | 129 +++++++++++++++++++++++++++++-------------
 5 files changed, 120 insertions(+), 48 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-17  7:54 [PATCH v4 0/3] mm/slub: Fix count_partial() problem Xunlei Pang
@ 2021-03-17  7:54 ` Xunlei Pang
  2021-03-17 18:45   ` Vlastimil Babka
  2021-03-18 12:18   ` Vlastimil Babka
  2021-03-17  7:54 ` [PATCH v4 2/3] percpu: Export per_cpu_sum() Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 3/3] mm/slub: Get rid of count_partial() Xunlei Pang
  2 siblings, 2 replies; 11+ messages in thread
From: Xunlei Pang @ 2021-03-17  7:54 UTC (permalink / raw)
  To: Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Vlastimil Babka, Roman Gushchin, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

The node list_lock in count_partial() spends long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention.

We have HSF RT(High-speed Service Framework Response-Time) monitors,
the RT figures fluctuated randomly, then we deployed a tool detecting
"irq off" and "preempt off" to dump the culprit's calltrace, capturing
the list_lock cost nearly 100ms with irq off issued by "ss", this also
caused network timeouts.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
The main operations are under list_lock in slow path, its performance
impact is expected to be minimal except the __slab_free() path.

The only concern of introducing partial counter is that partial_free_objs
may cause cacheline contention and false sharing issues in case of same
SLUB concurrent __slab_free(), so define it to be a percpu counter and
places it carefully.

As "Vlastimil Babka" and "Christoph Lameter" both suggested that we
don't care about the accurate partial objects, and in case that somebody
might actually want accurate object statistics at the expense of peak
performance, and it would be nice to give them such option in SLUB.

Thus a new CONFIG_SLUB_DEBUG_PARTIAL is introduced to maintain accurate
partial objects, which is default off to avoid unexpected performance
degradation.

When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc files
related to the partial list (/proc/slabinfo, sysfs objects_partial, etc)
assume their zero usage data, e.g.,
  Value of "active_objs" and "num_objs" fields of "/proc/slabinfo" equal.
  "cat /sys/kernel/slab/<slab>/partial" displays "0".

Tested-by: James Wang <jnwang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 init/Kconfig | 13 +++++++++++++
 mm/slab.h    |  6 ++++++
 mm/slub.c    | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 22946fe..686851b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1867,6 +1867,19 @@ config SLUB_DEBUG
 	  SLUB sysfs support. /sys/slab will not exist and there will be
 	  no support for cache validation etc.
 
+config SLUB_DEBUG_PARTIAL
+	default n
+	bool "Enable SLUB debugging for the node partial list" if EXPERT
+	depends on SLUB && SYSFS
+	help
+	  Maintain runtime counters for the node partial list debugging.
+	  When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc
+	  files related to the partial list assume zero data usage, e.g.,
+	  value of "active_objs" and "num_objs" of "/proc/slabinfo" equals.
+	  "cat /sys/kernel/slab/<slab>/partial" displays "0".
+
+	  It might have slight performance impact for production use.
+
 config COMPAT_BRK
 	bool "Disable heap randomization"
 	default y
diff --git a/mm/slab.h b/mm/slab.h
index 076582f..f47c811 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -546,12 +546,18 @@ struct kmem_cache_node {
 
 #ifdef CONFIG_SLUB
 	unsigned long nr_partial;
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+	unsigned long partial_total_objs;
+#endif
 	struct list_head partial;
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_t nr_slabs;
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+	unsigned long __percpu *partial_free_objs ____cacheline_aligned_in_smp;
+#endif
 #endif
 
 };
diff --git a/mm/slub.c b/mm/slub.c
index e26c274..856aea4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
 /*
  * Management of partially allocated slabs.
  */
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+	this_cpu_add(*n->partial_free_objs, delta);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+	n->partial_total_objs += delta;
+}
+#else
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta) { }
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta) { }
+#endif
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
 	n->nr_partial++;
+	__update_partial_total(n, page->objects);
 	if (tail == DEACTIVATE_TO_TAIL)
 		list_add_tail(&page->slab_list, &n->partial);
 	else
@@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
 	lockdep_assert_held(&n->list_lock);
 	list_del(&page->slab_list);
 	n->nr_partial--;
+	__update_partial_total(n, -page->objects);
 }
 
 /*
@@ -1957,6 +1979,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 		return NULL;
 
 	remove_partial(n, page);
+	__update_partial_free(n, -*objects);
 	WARN_ON(!freelist);
 	return freelist;
 }
@@ -2286,8 +2309,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page,
 				"unfreezing slab"))
 		goto redo;
 
-	if (lock)
+	if (lock) {
+		if (m == M_PARTIAL)
+			__update_partial_free(n, new.objects - new.inuse);
 		spin_unlock(&n->list_lock);
+	}
 
 	if (m == M_PARTIAL)
 		stat(s, tail);
@@ -2353,6 +2379,7 @@ static void unfreeze_partials(struct kmem_cache *s,
 			discard_page = page;
 		} else {
 			add_partial(n, page, DEACTIVATE_TO_TAIL);
+			__update_partial_free(n, new.objects - new.inuse);
 			stat(s, FREE_ADD_PARTIAL);
 		}
 	}
@@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		head, new.counters,
 		"__slab_free"));
 
+	if (!was_frozen && prior) {
+		if (n)
+			__update_partial_free(n, cnt);
+		else
+			__update_partial_free(get_node(s, page_to_nid(page)), cnt);
+	}
+
 	if (likely(!n)) {
 
 		if (likely(was_frozen)) {
@@ -3069,6 +3103,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
 		remove_full(s, n, page);
 		add_partial(n, page, DEACTIVATE_TO_TAIL);
+		__update_partial_free(n, cnt);
 		stat(s, FREE_ADD_PARTIAL);
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3080,6 +3115,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * Slab on the partial list.
 		 */
 		remove_partial(n, page);
+		__update_partial_free(n, -page->objects);
 		stat(s, FREE_REMOVE_PARTIAL);
 	} else {
 		/* Slab must be on the full list */
@@ -3514,17 +3550,25 @@ static inline int calculate_order(unsigned int size)
 	return -ENOSYS;
 }
 
-static void
+static int
 init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
 	INIT_LIST_HEAD(&n->partial);
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+	n->partial_free_objs = alloc_percpu(unsigned long);
+	if (!n->partial_free_objs)
+		return -ENOMEM;
+	n->partial_total_objs = 0;
+#endif
 #ifdef CONFIG_SLUB_DEBUG
 	atomic_long_set(&n->nr_slabs, 0);
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+
+	return 0;
 }
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
@@ -3584,7 +3628,7 @@ static void early_kmem_cache_node_alloc(int node)
 	page->inuse = 1;
 	page->frozen = 0;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	BUG_ON(init_kmem_cache_node(n) < 0);
 	inc_slabs_node(kmem_cache_node, node, page->objects);
 
 	/*
@@ -3592,6 +3636,7 @@ static void early_kmem_cache_node_alloc(int node)
 	 * initialized and there is no concurrent access.
 	 */
 	__add_partial(n, page, DEACTIVATE_TO_HEAD);
+	__update_partial_free(n, page->objects - page->inuse);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
@@ -3601,6 +3646,9 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_kmem_cache_node(s, node, n) {
 		s->node[node] = NULL;
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+		free_percpu(n->partial_free_objs);
+#endif
 		kmem_cache_free(kmem_cache_node, n);
 	}
 }
@@ -3631,7 +3679,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 			return 0;
 		}
 
-		init_kmem_cache_node(n);
+		if (init_kmem_cache_node(n) < 0) {
+			free_kmem_cache_nodes(s);
+			return 0;
+		}
+
 		s->node[node] = n;
 	}
 	return 1;
@@ -3922,6 +3974,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 	list_for_each_entry_safe(page, h, &n->partial, slab_list) {
 		if (!page->inuse) {
 			remove_partial(n, page);
+			__update_partial_free(n, -page->objects);
 			list_add(&page->slab_list, &discard);
 		} else {
 			list_slab_objects(s, page,
@@ -4263,6 +4316,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			if (free == page->objects) {
 				list_move(&page->slab_list, &discard);
 				n->nr_partial--;
+				__update_partial_free(n, -free);
+				__update_partial_total(n, -free);
 			} else if (free <= SHRINK_PROMOTE_MAX)
 				list_move(&page->slab_list, promote + free - 1);
 		}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/3] percpu: Export per_cpu_sum()
  2021-03-17  7:54 [PATCH v4 0/3] mm/slub: Fix count_partial() problem Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
@ 2021-03-17  7:54 ` Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 3/3] mm/slub: Get rid of count_partial() Xunlei Pang
  2 siblings, 0 replies; 11+ messages in thread
From: Xunlei Pang @ 2021-03-17  7:54 UTC (permalink / raw)
  To: Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Vlastimil Babka, Roman Gushchin, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

per_cpu_sum() is useful, and deserves to be exported.

Tested-by: James Wang <jnwang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 include/linux/percpu-defs.h   | 10 ++++++++++
 kernel/locking/percpu-rwsem.c | 10 ----------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index dff7040..0e71b68 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -220,6 +220,16 @@
 	(void)__vpp_verify;						\
 } while (0)
 
+#define per_cpu_sum(var)						\
+({									\
+	typeof(var) __sum = 0;						\
+	int cpu;							\
+	compiletime_assert_atomic_type(__sum);				\
+	for_each_possible_cpu(cpu)					\
+		__sum += per_cpu(var, cpu);				\
+	__sum;								\
+})
+
 #ifdef CONFIG_SMP
 
 /*
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a5..0980e51 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 }
 EXPORT_SYMBOL_GPL(__percpu_down_read);
 
-#define per_cpu_sum(var)						\
-({									\
-	typeof(var) __sum = 0;						\
-	int cpu;							\
-	compiletime_assert_atomic_type(__sum);				\
-	for_each_possible_cpu(cpu)					\
-		__sum += per_cpu(var, cpu);				\
-	__sum;								\
-})
-
 /*
  * Return true if the modular sum of the sem->read_count per-CPU variable is
  * zero.  If this sum is zero, then it is stable due to the fact that if any
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 3/3] mm/slub: Get rid of count_partial()
  2021-03-17  7:54 [PATCH v4 0/3] mm/slub: Fix count_partial() problem Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
  2021-03-17  7:54 ` [PATCH v4 2/3] percpu: Export per_cpu_sum() Xunlei Pang
@ 2021-03-17  7:54 ` Xunlei Pang
  2 siblings, 0 replies; 11+ messages in thread
From: Xunlei Pang @ 2021-03-17  7:54 UTC (permalink / raw)
  To: Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Vlastimil Babka, Roman Gushchin, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang, Xunlei Pang

Now the partial counters are ready, let's use them to get rid
of count_partial().

The partial counters will involve in to calculate the accurate
partial usage when CONFIG_SLUB_DEBUG_PARTIAL is on, otherwise
simply assume their zero usage statistics.

Tested-by: James Wang <jnwang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 mm/slub.c | 64 +++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 856aea4..9bff669 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-	return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
 	return atomic_long_read(&n->total_objects);
@@ -2545,18 +2540,33 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-					int (*get_count)(struct page *))
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL, PARTIAL_SLAB };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+		enum partial_item item)
 {
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
+	unsigned long ret = 0;
 
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, slab_list)
-		x += get_count(page);
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+	if (item == PARTIAL_FREE) {
+		ret = per_cpu_sum(*n->partial_free_objs);
+		if ((long)ret < 0)
+			ret = 0;
+	} else if (item == PARTIAL_TOTAL) {
+		ret = n->partial_total_objs;
+	} else if (item == PARTIAL_INUSE) {
+		ret = per_cpu_sum(*n->partial_free_objs);
+		if ((long)ret < 0)
+			ret = 0;
+		ret = n->partial_total_objs - ret;
+		if ((long)ret < 0)
+			ret = 0;
+	} else { /* item == PARTIAL_SLAB */
+		ret = n->nr_partial;
+	}
+#endif
+
+	return ret;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
@@ -2587,7 +2597,7 @@ static unsigned long count_partial(struct kmem_cache_node *n,
 		unsigned long nr_objs;
 		unsigned long nr_free;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = partial_counter(n, PARTIAL_FREE);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -4654,18 +4664,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-	return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-	return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -5102,7 +5100,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 				x = atomic_long_read(&n->total_objects);
 			else if (flags & SO_OBJECTS)
 				x = atomic_long_read(&n->total_objects) -
-					count_partial(n, count_free);
+					partial_counter(n, PARTIAL_FREE);
 			else
 				x = atomic_long_read(&n->nr_slabs);
 			total += x;
@@ -5116,11 +5114,11 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
 		for_each_kmem_cache_node(s, node, n) {
 			if (flags & SO_TOTAL)
-				x = count_partial(n, count_total);
+				x = partial_counter(n, PARTIAL_TOTAL);
 			else if (flags & SO_OBJECTS)
-				x = count_partial(n, count_inuse);
+				x = partial_counter(n, PARTIAL_INUSE);
 			else
-				x = n->nr_partial;
+				x = partial_counter(n, PARTIAL_SLAB);
 			total += x;
 			nodes[node] += x;
 		}
@@ -5884,7 +5882,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	for_each_kmem_cache_node(s, node, n) {
 		nr_slabs += node_nr_slabs(n);
 		nr_objs += node_nr_objs(n);
-		nr_free += count_partial(n, count_free);
+		nr_free += partial_counter(n, PARTIAL_FREE);
 	}
 
 	sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
@ 2021-03-17 18:45   ` Vlastimil Babka
  2021-03-18  4:52     ` Xunlei Pang
  2021-03-18 12:18   ` Vlastimil Babka
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2021-03-17 18:45 UTC (permalink / raw)
  To: Xunlei Pang, Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang

On 3/17/21 8:54 AM, Xunlei Pang wrote:
> The node list_lock in count_partial() spends long time iterating
> in case of large amount of partial page lists, which can cause
> thunder herd effect to the list_lock contention.
> 
> We have HSF RT(High-speed Service Framework Response-Time) monitors,
> the RT figures fluctuated randomly, then we deployed a tool detecting
> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
> the list_lock cost nearly 100ms with irq off issued by "ss", this also
> caused network timeouts.
> 
> This patch introduces two counters to maintain the actual number
> of partial objects dynamically instead of iterating the partial
> page lists with list_lock held.
> 
> New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
> The main operations are under list_lock in slow path, its performance
> impact is expected to be minimal except the __slab_free() path.
> 
> The only concern of introducing partial counter is that partial_free_objs
> may cause cacheline contention and false sharing issues in case of same
> SLUB concurrent __slab_free(), so define it to be a percpu counter and
> places it carefully.

Hm I wonder, is it possible that this will eventually overflow/underflow the
counter on some CPU? (I guess practially only on 32bit). Maybe the operations
that are already done under n->list_lock should flush the percpu counter to a
shared counter?

...

> @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  		head, new.counters,
>  		"__slab_free"));
>  
> +	if (!was_frozen && prior) {
> +		if (n)
> +			__update_partial_free(n, cnt);
> +		else
> +			__update_partial_free(get_node(s, page_to_nid(page)), cnt);
> +	}

I would guess this is the part that makes your measurements notice that
(although tiny) difference. We didn't need to obtain the node pointer before and
now we do. And that is really done just for the per-node breakdown in "objects"
and "objects_partial" files under /sys/kernel/slab - distinguishing nodes is not
needed for /proc/slabinfo. So that kinda justifies putting this under a new
CONFIG as you did. Although perhaps somebody interested in these kind of stats
would enable CONFIG_SLUB_STATS anyway, so that's still an option to use instead
of introducing a new oddly specific CONFIG? At least until somebody comes up and
presents an use case where they want the per-node breakdowns in /sys but cannot
afford CONFIG_SLUB_STATS.

But I'm also still thinking about simply counting all free objects (for the
purposes of accurate <active_objs> in /proc/slabinfo) as a percpu variable in
struct kmem_cache itself. That would basically put this_cpu_add() in all the
fast paths, but AFAICS thanks to the segment register it doesn't mean disabling
interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it
shouldn't need to deal with these node pointers. So maybe that would be
acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try...



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-17 18:45   ` Vlastimil Babka
@ 2021-03-18  4:52     ` Xunlei Pang
  0 siblings, 0 replies; 11+ messages in thread
From: Xunlei Pang @ 2021-03-18  4:52 UTC (permalink / raw)
  To: Vlastimil Babka, Xunlei Pang, Christoph Lameter,
	Christoph Lameter, Pekka Enberg, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang

On 3/18/21 2:45 AM, Vlastimil Babka wrote:
> On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> The node list_lock in count_partial() spends long time iterating
>> in case of large amount of partial page lists, which can cause
>> thunder herd effect to the list_lock contention.
>>
>> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> the RT figures fluctuated randomly, then we deployed a tool detecting
>> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> caused network timeouts.
>>
>> This patch introduces two counters to maintain the actual number
>> of partial objects dynamically instead of iterating the partial
>> page lists with list_lock held.
>>
>> New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
>> The main operations are under list_lock in slow path, its performance
>> impact is expected to be minimal except the __slab_free() path.
>>
>> The only concern of introducing partial counter is that partial_free_objs
>> may cause cacheline contention and false sharing issues in case of same
>> SLUB concurrent __slab_free(), so define it to be a percpu counter and
>> places it carefully.
> 
> Hm I wonder, is it possible that this will eventually overflow/underflow the
> counter on some CPU? (I guess practially only on 32bit). Maybe the operations
> that are already done under n->list_lock should flush the percpu counter to a
> shared counter?

You are right, thanks a lot for noticing this.

> 
> ...
> 
>> @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		head, new.counters,
>>  		"__slab_free"));
>>  
>> +	if (!was_frozen && prior) {
>> +		if (n)
>> +			__update_partial_free(n, cnt);
>> +		else
>> +			__update_partial_free(get_node(s, page_to_nid(page)), cnt);
>> +	}
> 
> I would guess this is the part that makes your measurements notice that
> (although tiny) difference. We didn't need to obtain the node pointer before and
> now we do. And that is really done just for the per-node breakdown in "objects"
> and "objects_partial" files under /sys/kernel/slab - distinguishing nodes is not
> needed for /proc/slabinfo. So that kinda justifies putting this under a new
> CONFIG as you did. Although perhaps somebody interested in these kind of stats
> would enable CONFIG_SLUB_STATS anyway, so that's still an option to use instead
> of introducing a new oddly specific CONFIG? At least until somebody comes up and
> presents an use case where they want the per-node breakdowns in /sys but cannot
> afford CONFIG_SLUB_STATS.
> 
> But I'm also still thinking about simply counting all free objects (for the
> purposes of accurate <active_objs> in /proc/slabinfo) as a percpu variable in
> struct kmem_cache itself. That would basically put this_cpu_add() in all the
> fast paths, but AFAICS thanks to the segment register it doesn't mean disabling
> interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it
> shouldn't need to deal with these node pointers. So maybe that would be
> acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try...
> 

The percpu operation itself should be fine, it looks to be cacheline
pingpong issue due to extra percpu counter access, so making it
cacheline aligned improves a little according to my tests.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
  2021-03-17 18:45   ` Vlastimil Babka
@ 2021-03-18 12:18   ` Vlastimil Babka
  2021-03-18 12:56     ` Xunlei Pang
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2021-03-18 12:18 UTC (permalink / raw)
  To: Xunlei Pang, Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Shu Ming, Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang

On 3/17/21 8:54 AM, Xunlei Pang wrote:
> The node list_lock in count_partial() spends long time iterating
> in case of large amount of partial page lists, which can cause
> thunder herd effect to the list_lock contention.
> 
> We have HSF RT(High-speed Service Framework Response-Time) monitors,
> the RT figures fluctuated randomly, then we deployed a tool detecting
> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
> the list_lock cost nearly 100ms with irq off issued by "ss", this also
> caused network timeouts.

I forgot to ask, how does "ss" come into this? It displays network connections
AFAIK. Does it read any SLUB counters or slabinfo?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-18 12:18   ` Vlastimil Babka
@ 2021-03-18 12:56     ` Xunlei Pang
  2021-03-22  1:46       ` Shu Ming
  0 siblings, 1 reply; 11+ messages in thread
From: Xunlei Pang @ 2021-03-18 12:56 UTC (permalink / raw)
  To: Vlastimil Babka, Xunlei Pang, Christoph Lameter,
	Christoph Lameter, Pekka Enberg, Roman Gushchin,
	Konstantin Khlebnikov, David Rientjes, Matthew Wilcox, Shu Ming,
	Andrew Morton
  Cc: linux-kernel, linux-mm, Wen Yang, James Wang



On 3/18/21 8:18 PM, Vlastimil Babka wrote:
> On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> The node list_lock in count_partial() spends long time iterating
>> in case of large amount of partial page lists, which can cause
>> thunder herd effect to the list_lock contention.
>>
>> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> the RT figures fluctuated randomly, then we deployed a tool detecting
>> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> caused network timeouts.
> 
> I forgot to ask, how does "ss" come into this? It displays network connections
> AFAIK. Does it read any SLUB counters or slabinfo?
> 

ss may access /proc/slabinfo to acquire network related slab statistics.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-18 12:56     ` Xunlei Pang
@ 2021-03-22  1:46       ` Shu Ming
  2021-03-22 10:22         ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Shu Ming @ 2021-03-22  1:46 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Vlastimil Babka, Christoph Lameter, Christoph Lameter,
	Pekka Enberg, Roman Gushchin, Konstantin Khlebnikov,
	David Rientjes, Matthew Wilcox, Andrew Morton, LKML, linux-mm,
	Wen Yang, James Wang

More precisely, ss will count partial objects like denty objects with
"/sys/kernel/slab/dentry/partial"   whose number can become huge.

On Thu, Mar 18, 2021 at 8:56 PM Xunlei Pang <xlpang@linux.alibaba.com> wrote:
>
>
>
> On 3/18/21 8:18 PM, Vlastimil Babka wrote:
> > On 3/17/21 8:54 AM, Xunlei Pang wrote:
> >> The node list_lock in count_partial() spends long time iterating
> >> in case of large amount of partial page lists, which can cause
> >> thunder herd effect to the list_lock contention.
> >>
> >> We have HSF RT(High-speed Service Framework Response-Time) monitors,
> >> the RT figures fluctuated randomly, then we deployed a tool detecting
> >> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
> >> the list_lock cost nearly 100ms with irq off issued by "ss", this also
> >> caused network timeouts.
> >
> > I forgot to ask, how does "ss" come into this? It displays network connections
> > AFAIK. Does it read any SLUB counters or slabinfo?
> >
>
> ss may access /proc/slabinfo to acquire network related slab statistics.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-22  1:46       ` Shu Ming
@ 2021-03-22 10:22         ` Vlastimil Babka
  2021-03-29  1:58           ` Shu Ming
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2021-03-22 10:22 UTC (permalink / raw)
  To: Shu Ming, Xunlei Pang
  Cc: Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Andrew Morton, LKML, linux-mm, Wen Yang,
	James Wang

On 3/22/21 2:46 AM, Shu Ming wrote:
> More precisely, ss will count partial objects like denty objects with
> "/sys/kernel/slab/dentry/partial"   whose number can become huge.

Uh, that's interesting. Would you know what exactly it uses the value for? I can
think of several reasons why it might be misleading.

> On Thu, Mar 18, 2021 at 8:56 PM Xunlei Pang <xlpang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 3/18/21 8:18 PM, Vlastimil Babka wrote:
>> > On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> >> The node list_lock in count_partial() spends long time iterating
>> >> in case of large amount of partial page lists, which can cause
>> >> thunder herd effect to the list_lock contention.
>> >>
>> >> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> >> the RT figures fluctuated randomly, then we deployed a tool detecting
>> >> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> >> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> >> caused network timeouts.
>> >
>> > I forgot to ask, how does "ss" come into this? It displays network connections
>> > AFAIK. Does it read any SLUB counters or slabinfo?
>> >
>>
>> ss may access /proc/slabinfo to acquire network related slab statistics.
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
  2021-03-22 10:22         ` Vlastimil Babka
@ 2021-03-29  1:58           ` Shu Ming
  0 siblings, 0 replies; 11+ messages in thread
From: Shu Ming @ 2021-03-29  1:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Xunlei Pang, Christoph Lameter, Christoph Lameter, Pekka Enberg,
	Roman Gushchin, Konstantin Khlebnikov, David Rientjes,
	Matthew Wilcox, Andrew Morton, LKML, linux-mm, Wen Yang,
	James Wang

I am not sure how people are using partial object accounting.  I
believe it is used as a memory usage hint of slabs.

On Mon, Mar 22, 2021 at 6:22 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/22/21 2:46 AM, Shu Ming wrote:
> > More precisely, ss will count partial objects like denty objects with
> > "/sys/kernel/slab/dentry/partial"   whose number can become huge.
>
> Uh, that's interesting. Would you know what exactly it uses the value for? I can
> think of several reasons why it might be misleading.
>
> > On Thu, Mar 18, 2021 at 8:56 PM Xunlei Pang <xlpang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 3/18/21 8:18 PM, Vlastimil Babka wrote:
> >> > On 3/17/21 8:54 AM, Xunlei Pang wrote:
> >> >> The node list_lock in count_partial() spends long time iterating
> >> >> in case of large amount of partial page lists, which can cause
> >> >> thunder herd effect to the list_lock contention.
> >> >>
> >> >> We have HSF RT(High-speed Service Framework Response-Time) monitors,
> >> >> the RT figures fluctuated randomly, then we deployed a tool detecting
> >> >> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
> >> >> the list_lock cost nearly 100ms with irq off issued by "ss", this also
> >> >> caused network timeouts.
> >> >
> >> > I forgot to ask, how does "ss" come into this? It displays network connections
> >> > AFAIK. Does it read any SLUB counters or slabinfo?
> >> >
> >>
> >> ss may access /proc/slabinfo to acquire network related slab statistics.
> >
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-29  1:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17  7:54 [PATCH v4 0/3] mm/slub: Fix count_partial() problem Xunlei Pang
2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
2021-03-17 18:45   ` Vlastimil Babka
2021-03-18  4:52     ` Xunlei Pang
2021-03-18 12:18   ` Vlastimil Babka
2021-03-18 12:56     ` Xunlei Pang
2021-03-22  1:46       ` Shu Ming
2021-03-22 10:22         ` Vlastimil Babka
2021-03-29  1:58           ` Shu Ming
2021-03-17  7:54 ` [PATCH v4 2/3] percpu: Export per_cpu_sum() Xunlei Pang
2021-03-17  7:54 ` [PATCH v4 3/3] mm/slub: Get rid of count_partial() Xunlei Pang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).