All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] debugobjects: Reduce global pool_lock contention
@ 2016-12-02 21:14 Waiman Long
  2016-12-02 21:14 ` [PATCH v2 1/3] debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Waiman Long @ 2016-12-02 21:14 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Morton
  Cc: linux-kernel, Du Changbin, Christian Borntraeger, Jan Stancek,
	Waiman Long

 v1->v2:
  - Move patch 2 in front of patch 1.
  - Fix merge conflict with linux-next.

This patchset aims to reduce contention of the global pool_lock
while improving performance at the same time. It is done to resolve
the following soft lockup problem with a debug kernel in some of the
large SMP systems:

 NMI watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [rcuos/1:21]
 ...
 RIP: 0010:[<ffffffff817c216b>]  [<ffffffff817c216b>]
	_raw_spin_unlock_irqrestore+0x3b/0x60
 ...
 Call Trace:
  [<ffffffff813f40d1>] free_object+0x81/0xb0
  [<ffffffff813f4f33>] debug_check_no_obj_freed+0x193/0x220
  [<ffffffff81101a59>] ? trace_hardirqs_on_caller+0xf9/0x1c0
  [<ffffffff81284996>] ? file_free_rcu+0x36/0x60
  [<ffffffff81251712>] kmem_cache_free+0xd2/0x380
  [<ffffffff81284960>] ? fput+0x90/0x90
  [<ffffffff81284996>] file_free_rcu+0x36/0x60
  [<ffffffff81124c23>] rcu_nocb_kthread+0x1b3/0x550
  [<ffffffff81124b71>] ? rcu_nocb_kthread+0x101/0x550
  [<ffffffff81124a70>] ? sync_exp_work_done.constprop.63+0x50/0x50
  [<ffffffff810c59d1>] kthread+0x101/0x120
  [<ffffffff81101a59>] ? trace_hardirqs_on_caller+0xf9/0x1c0
  [<ffffffff817c2d32>] ret_from_fork+0x22/0x50

On a 8-socket IvyBridge-EX system (120 cores, 240 threads), the
elapsed time of a 4.9-rc7 kernel parallel build (make -j 240) was
reduced from 7m57s to 7m19s with a patched 4.9-rc7 kernel. There
was also about a 10X reduction in the number of debug objects being
allocated from or freed to the kmemcache during the kernel build.

Waiman Long (3):
  debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done
  debugobjects: Scale thresholds with # of CPUs
  debugobjects: Reduce contention on the global pool_lock

 lib/debugobjects.c | 57 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 12 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/3] debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done
  2016-12-02 21:14 [PATCH v2 0/3] debugobjects: Reduce global pool_lock contention Waiman Long
@ 2016-12-02 21:14 ` Waiman Long
  2016-12-02 21:14 ` [PATCH v2 2/3] debugobjects: Scale thresholds with # of CPUs Waiman Long
  2016-12-02 21:14 ` [PATCH v2 3/3] debugobjects: Reduce contention on the global pool_lock Waiman Long
  2 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2016-12-02 21:14 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Morton
  Cc: linux-kernel, Du Changbin, Christian Borntraeger, Jan Stancek,
	Waiman Long

New debugfs stat counters are added to track the numbers of
kmem_cache_alloc() and kmem_cache_free() function calls to get a
sense of how the internal debug objects cache management is performing.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/debugobjects.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 04c1ef7..d78673e 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -55,6 +55,12 @@ struct debug_bucket {
 
 static struct debug_obj_descr	*descr_test  __read_mostly;
 
+/*
+ * Track numbers of kmem_cache_alloc and kmem_cache_free done.
+ */
+static int			debug_objects_alloc;
+static int			debug_objects_freed;
+
 static void free_obj_work(struct work_struct *work);
 static DECLARE_WORK(debug_obj_work, free_obj_work);
 
@@ -102,6 +108,7 @@ static void fill_pool(void)
 
 		raw_spin_lock_irqsave(&pool_lock, flags);
 		hlist_add_head(&new->node, &obj_pool);
+		debug_objects_alloc++;
 		obj_pool_free++;
 		raw_spin_unlock_irqrestore(&pool_lock, flags);
 	}
@@ -173,6 +180,7 @@ static void free_obj_work(struct work_struct *work)
 		obj = hlist_entry(obj_pool.first, typeof(*obj), node);
 		hlist_del(&obj->node);
 		obj_pool_free--;
+		debug_objects_freed++;
 		/*
 		 * We release pool_lock across kmem_cache_free() to
 		 * avoid contention on pool_lock.
@@ -758,6 +766,8 @@ static int debug_stats_show(struct seq_file *m, void *v)
 	seq_printf(m, "pool_min_free :%d\n", obj_pool_min_free);
 	seq_printf(m, "pool_used     :%d\n", obj_pool_used);
 	seq_printf(m, "pool_max_used :%d\n", obj_pool_max_used);
+	seq_printf(m, "objects_alloc :%d\n", debug_objects_alloc);
+	seq_printf(m, "objects_freed :%d\n", debug_objects_freed);
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/3] debugobjects: Scale thresholds with # of CPUs
  2016-12-02 21:14 [PATCH v2 0/3] debugobjects: Reduce global pool_lock contention Waiman Long
  2016-12-02 21:14 ` [PATCH v2 1/3] debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done Waiman Long
@ 2016-12-02 21:14 ` Waiman Long
  2016-12-02 21:14 ` [PATCH v2 3/3] debugobjects: Reduce contention on the global pool_lock Waiman Long
  2 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2016-12-02 21:14 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Morton
  Cc: linux-kernel, Du Changbin, Christian Borntraeger, Jan Stancek,
	Waiman Long

On a large SMP systems with hundreds of CPUs, the current thresholds
for allocating and freeing debug objects (256 and 1024 respectively)
may not work well. This can cause a lot of needless calls to
kmem_aloc() and kmem_free() on those systems.

To alleviate this thrashing problem, the object freeing threshold
is now increased to "1024 + # of CPUs * 32". Whereas the object
allocation threshold is increased to "256 + # of CPUs * 4". That
should make the debug objects subsystem scale better with the number
of CPUs available in the system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/debugobjects.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index d78673e..dc78217 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -52,7 +52,10 @@ struct debug_bucket {
 static int			debug_objects_warnings __read_mostly;
 static int			debug_objects_enabled __read_mostly
 				= CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT;
-
+static int			debug_objects_pool_size __read_mostly
+				= ODEBUG_POOL_SIZE;
+static int			debug_objects_pool_min_level __read_mostly
+				= ODEBUG_POOL_MIN_LEVEL;
 static struct debug_obj_descr	*descr_test  __read_mostly;
 
 /*
@@ -94,13 +97,13 @@ static void fill_pool(void)
 	struct debug_obj *new;
 	unsigned long flags;
 
-	if (likely(obj_pool_free >= ODEBUG_POOL_MIN_LEVEL))
+	if (likely(obj_pool_free >= debug_objects_pool_min_level))
 		return;
 
 	if (unlikely(!obj_cache))
 		return;
 
-	while (obj_pool_free < ODEBUG_POOL_MIN_LEVEL) {
+	while (obj_pool_free < debug_objects_pool_min_level) {
 
 		new = kmem_cache_zalloc(obj_cache, gfp);
 		if (!new)
@@ -176,7 +179,7 @@ static void free_obj_work(struct work_struct *work)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&pool_lock, flags);
-	while (obj_pool_free > ODEBUG_POOL_SIZE) {
+	while (obj_pool_free > debug_objects_pool_size) {
 		obj = hlist_entry(obj_pool.first, typeof(*obj), node);
 		hlist_del(&obj->node);
 		obj_pool_free--;
@@ -206,7 +209,7 @@ static void free_object(struct debug_obj *obj)
 	 * schedule work when the pool is filled and the cache is
 	 * initialized:
 	 */
-	if (obj_pool_free > ODEBUG_POOL_SIZE && obj_cache)
+	if (obj_pool_free > debug_objects_pool_size && obj_cache)
 		sched = 1;
 	hlist_add_head(&obj->node, &obj_pool);
 	obj_pool_free++;
@@ -1126,4 +1129,11 @@ void __init debug_objects_mem_init(void)
 		pr_warn("out of memory.\n");
 	} else
 		debug_objects_selftest();
+
+	/*
+	 * Increase the thresholds for allocating and freeing objects
+	 * according to the number of possible CPUs available in the system.
+	 */
+	debug_objects_pool_size += num_possible_cpus() * 32;
+	debug_objects_pool_min_level += num_possible_cpus() * 4;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 3/3] debugobjects: Reduce contention on the global pool_lock
  2016-12-02 21:14 [PATCH v2 0/3] debugobjects: Reduce global pool_lock contention Waiman Long
  2016-12-02 21:14 ` [PATCH v2 1/3] debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done Waiman Long
  2016-12-02 21:14 ` [PATCH v2 2/3] debugobjects: Scale thresholds with # of CPUs Waiman Long
@ 2016-12-02 21:14 ` Waiman Long
  2 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2016-12-02 21:14 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Morton
  Cc: linux-kernel, Du Changbin, Christian Borntraeger, Jan Stancek,
	Waiman Long

On a large SMP system with many CPUs, the global pool_lock may become
a performance bottleneck as all the CPUs that need to allocate or
free debug objects have to take the lock. That can sometimes cause
soft lockups like:

 NMI watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [rcuos/1:21]
 ...
 RIP: 0010:[<ffffffff817c216b>]  [<ffffffff817c216b>]
	_raw_spin_unlock_irqrestore+0x3b/0x60
 ...
 Call Trace:
  [<ffffffff813f40d1>] free_object+0x81/0xb0
  [<ffffffff813f4f33>] debug_check_no_obj_freed+0x193/0x220
  [<ffffffff81101a59>] ? trace_hardirqs_on_caller+0xf9/0x1c0
  [<ffffffff81284996>] ? file_free_rcu+0x36/0x60
  [<ffffffff81251712>] kmem_cache_free+0xd2/0x380
  [<ffffffff81284960>] ? fput+0x90/0x90
  [<ffffffff81284996>] file_free_rcu+0x36/0x60
  [<ffffffff81124c23>] rcu_nocb_kthread+0x1b3/0x550
  [<ffffffff81124b71>] ? rcu_nocb_kthread+0x101/0x550
  [<ffffffff81124a70>] ? sync_exp_work_done.constprop.63+0x50/0x50
  [<ffffffff810c59d1>] kthread+0x101/0x120
  [<ffffffff81101a59>] ? trace_hardirqs_on_caller+0xf9/0x1c0
  [<ffffffff817c2d32>] ret_from_fork+0x22/0x50

To reduce the amount of contention on the pool_lock, the actual
kmem_cache_free() of the debug objects will be delayed if the pool_lock
is busy. This will temporarily increase the amount of free objects
available at the free pool when the system is busy. As a result,
the number of kmem_cache allocation and freeing should be reduced.

This patch also groups the freeing of debug objects in a batch of 4
to reduce the total number of lock/unlock cycles.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/debugobjects.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index dc78217..5476bbe 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -172,25 +172,38 @@ static struct debug_obj *lookup_object(void *addr, struct debug_bucket *b)
 
 /*
  * workqueue function to free objects.
+ *
+ * To reduce contention on the global pool_lock, the actual freeing of
+ * debug objects will be delayed if the pool_lock is busy. We also free
+ * the objects in a batch of 4 for each lock/unlock cycle.
  */
+#define ODEBUG_FREE_BATCH	4
 static void free_obj_work(struct work_struct *work)
 {
-	struct debug_obj *obj;
+	struct debug_obj *objs[ODEBUG_FREE_BATCH];
 	unsigned long flags;
+	int i;
 
-	raw_spin_lock_irqsave(&pool_lock, flags);
-	while (obj_pool_free > debug_objects_pool_size) {
-		obj = hlist_entry(obj_pool.first, typeof(*obj), node);
-		hlist_del(&obj->node);
-		obj_pool_free--;
-		debug_objects_freed++;
+	if (!raw_spin_trylock_irqsave(&pool_lock, flags))
+		return;
+	while (obj_pool_free >= debug_objects_pool_size + ODEBUG_FREE_BATCH) {
+		for (i = 0; i < ODEBUG_FREE_BATCH; i++) {
+			objs[i] = hlist_entry(obj_pool.first,
+					      typeof(*objs[0]), node);
+			hlist_del(&objs[i]->node);
+		}
+
+		obj_pool_free -= ODEBUG_FREE_BATCH;
+		debug_objects_freed += ODEBUG_FREE_BATCH;
 		/*
 		 * We release pool_lock across kmem_cache_free() to
 		 * avoid contention on pool_lock.
 		 */
 		raw_spin_unlock_irqrestore(&pool_lock, flags);
-		kmem_cache_free(obj_cache, obj);
-		raw_spin_lock_irqsave(&pool_lock, flags);
+		for (i = 0; i < ODEBUG_FREE_BATCH; i++)
+			kmem_cache_free(obj_cache, objs[i]);
+		if (!raw_spin_trylock_irqsave(&pool_lock, flags))
+			return;
 	}
 	raw_spin_unlock_irqrestore(&pool_lock, flags);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-12-02 21:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02 21:14 [PATCH v2 0/3] debugobjects: Reduce global pool_lock contention Waiman Long
2016-12-02 21:14 ` [PATCH v2 1/3] debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done Waiman Long
2016-12-02 21:14 ` [PATCH v2 2/3] debugobjects: Scale thresholds with # of CPUs Waiman Long
2016-12-02 21:14 ` [PATCH v2 3/3] debugobjects: Reduce contention on the global pool_lock Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.