[PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency
@ 2020-04-27 23:56 Waiman Long
  2020-04-27 23:56 ` [PATCH v2 1/4] mm, slab: Revert "extend slab/shrink to shrink all memcg caches" Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Waiman Long @ 2020-04-27 23:56 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-kernel, cgroups, Juri Lelli, Qian Cai, Waiman Long

v2:
 - Use regular cmpxchg() instead of x86-only try_cmpxchg() in patch 2.
 - Add patches 3 and 4 to fix circular locking dependency showing up
   at shutdown time.

With lockdep enabled, issuing the following command to the slub sysfs
files will cause splat about circular locking dependency to show up
either immediately afterwards or at shutdown time.

 # echo 1 > validate
 # echo 1 > shrink

This patchset fixes these lockdep splats by replacing slab_mutex with
memcg_cache_ids_sem as well as changing some of the lock operations
with trylock.

Waiman Long (4):
  mm, slab: Revert "extend slab/shrink to shrink all memcg caches"
  mm/slub: Fix slab_mutex circular locking problem in slab_attr_store()
  mm/slub: Fix another circular locking dependency in slab_attr_store()
  mm/slub: Fix sysfs shrink circular locking dependency

 include/linux/memcontrol.h     |  1 +
 include/linux/memory_hotplug.h |  2 +
 mm/memcontrol.c                |  5 ++
 mm/memory_hotplug.c            |  5 ++
 mm/slab.h                      |  1 -
 mm/slab_common.c               | 37 -------------
 mm/slub.c                      | 98 +++++++++++++++++++++++++++++-----
 7 files changed, 99 insertions(+), 50 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/4] mm, slab: Revert "extend slab/shrink to shrink all memcg caches"
  2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
@ 2020-04-27 23:56 ` Waiman Long
  2020-04-27 23:56 ` [PATCH v2 2/4] mm/slub: Fix slab_mutex circular locking problem in slab_attr_store() Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2020-04-27 23:56 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-kernel, cgroups, Juri Lelli, Qian Cai, Waiman Long

When the slub shrink sysfs file is written into, the function call
sequence is as follows:

  kernfs_fop_write
    => slab_attr_store
      => shrink_store
        => kmem_cache_shrink_all

It turns out that doing a memcg cache scan in kmem_cache_shrink_all()
is redundant as the same memcg cache scan is being done in
slab_attr_store(). So revert the commit 04f768a39d55 ("mm, slab: extend
slab/shrink to shrink all memcg caches") except the documentation change
which is still valid.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/slab.h        |  1 -
 mm/slab_common.c | 37 -------------------------------------
 mm/slub.c        |  2 +-
 3 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 207c83ef6e06..0937cb2ae8aa 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -237,7 +237,6 @@ int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
-void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23c7500eea7d..2e367ab8c15c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -995,43 +995,6 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
- * @s: The cache pointer
- */
-void kmem_cache_shrink_all(struct kmem_cache *s)
-{
-	struct kmem_cache *c;
-
-	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
-		kmem_cache_shrink(s);
-		return;
-	}
-
-	get_online_cpus();
-	get_online_mems();
-	kasan_cache_shrink(s);
-	__kmem_cache_shrink(s);
-
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
-		kasan_cache_shrink(c);
-		__kmem_cache_shrink(c);
-	}
-	mutex_unlock(&slab_mutex);
-	put_online_mems();
-	put_online_cpus();
-}
-
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index 9bf44955c4f1..183ccc364ccf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5343,7 +5343,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink_all(s);
+		kmem_cache_shrink(s);
 	else
 		return -EINVAL;
 	return length;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/4] mm/slub: Fix slab_mutex circular locking problem in slab_attr_store()
  2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
  2020-04-27 23:56 ` [PATCH v2 1/4] mm, slab: Revert "extend slab/shrink to shrink all memcg caches" Waiman Long
@ 2020-04-27 23:56 ` Waiman Long
  2020-04-27 23:56 ` [PATCH v2 3/4] mm/slub: Fix another circular locking dependency " Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2020-04-27 23:56 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-kernel, cgroups, Juri Lelli, Qian Cai, Waiman Long

The following lockdep splat was reported:

  [  176.241923] ======================================================
  [  176.241924] WARNING: possible circular locking dependency detected
  [  176.241926] 4.18.0-172.rt13.29.el8.x86_64+debug #1 Not tainted
  [  176.241927] ------------------------------------------------------
  [  176.241929] slub_cpu_partia/5371 is trying to acquire lock:
  [  176.241930] ffffffffa0b83718 (slab_mutex){+.+.}, at: slab_attr_store+0x6b/0xe0
  [  176.241941]
                 but task is already holding lock:
  [  176.241942] ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400
  [  176.241947]
                 which lock already depends on the new lock.

  [  176.241949]
                 the existing dependency chain (in reverse order) is:
  [  176.241949]
                 -> #1 (kn->count#103){++++}:
  [  176.241955]        __kernfs_remove+0x616/0x800
  [  176.241957]        kernfs_remove_by_name_ns+0x3e/0x80
  [  176.241959]        sysfs_slab_add+0x1c6/0x330
  [  176.241961]        __kmem_cache_create+0x15f/0x1b0
  [  176.241964]        create_cache+0xe1/0x220
  [  176.241966]        kmem_cache_create_usercopy+0x1a3/0x260
  [  176.241967]        kmem_cache_create+0x12/0x20
  [  176.242076]        mlx5_init_fs+0x18d/0x1a00 [mlx5_core]
  [  176.242100]        mlx5_load_one+0x3b4/0x1730 [mlx5_core]
  [  176.242124]        init_one+0x901/0x11b0 [mlx5_core]
  [  176.242127]        local_pci_probe+0xd4/0x180
  [  176.242131]        work_for_cpu_fn+0x51/0xa0
  [  176.242133]        process_one_work+0x91a/0x1ac0
  [  176.242134]        worker_thread+0x536/0xb40
  [  176.242136]        kthread+0x30c/0x3d0
  [  176.242140]        ret_from_fork+0x27/0x50
  [  176.242140]
                 -> #0 (slab_mutex){+.+.}:
  [  176.242145]        __lock_acquire+0x22cb/0x48c0
  [  176.242146]        lock_acquire+0x134/0x4c0
  [  176.242148]        _mutex_lock+0x28/0x40
  [  176.242150]        slab_attr_store+0x6b/0xe0
  [  176.242151]        kernfs_fop_write+0x251/0x400
  [  176.242154]        vfs_write+0x157/0x460
  [  176.242155]        ksys_write+0xb8/0x170
  [  176.242158]        do_syscall_64+0x13c/0x710
  [  176.242160]        entry_SYSCALL_64_after_hwframe+0x6a/0xdf
  [  176.242161]
                 other info that might help us debug this:

  [  176.242161]  Possible unsafe locking scenario:

  [  176.242162]        CPU0                    CPU1
  [  176.242163]        ----                    ----
  [  176.242163]   lock(kn->count#103);
  [  176.242165]                                lock(slab_mutex);
  [  176.242166]                                lock(kn->count#103);
  [  176.242167]   lock(slab_mutex);
  [  176.242169]
                  *** DEADLOCK ***

  [  176.242170] 3 locks held by slub_cpu_partia/5371:
  [  176.242170]  #0: ffff888705e3a800 (sb_writers#4){.+.+}, at: vfs_write+0x31c/0x460
  [  176.242174]  #1: ffff889aeec4d658 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1a9/0x400
  [  176.242177]  #2: ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400
  [  176.242180]
                 stack backtrace:
  [  176.242183] CPU: 36 PID: 5371 Comm: slub_cpu_partia Not tainted 4.18.0-172.rt13.29.el8.x86_64+debug #1
  [  176.242184] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019
  [  176.242185] Call Trace:
  [  176.242190]  dump_stack+0x9a/0xf0
  [  176.242193]  check_noncircular+0x317/0x3c0
  [  176.242195]  ? print_circular_bug+0x1e0/0x1e0
  [  176.242199]  ? native_sched_clock+0x32/0x1e0
  [  176.242202]  ? sched_clock+0x5/0x10
  [  176.242205]  ? sched_clock_cpu+0x238/0x340
  [  176.242208]  __lock_acquire+0x22cb/0x48c0
  [  176.242213]  ? trace_hardirqs_on+0x10/0x10
  [  176.242215]  ? trace_hardirqs_on+0x10/0x10
  [  176.242218]  lock_acquire+0x134/0x4c0
  [  176.242220]  ? slab_attr_store+0x6b/0xe0
  [  176.242223]  _mutex_lock+0x28/0x40
  [  176.242225]  ? slab_attr_store+0x6b/0xe0
  [  176.242227]  slab_attr_store+0x6b/0xe0
  [  176.242229]  ? sysfs_file_ops+0x160/0x160
  [  176.242230]  kernfs_fop_write+0x251/0x400
  [  176.242232]  ? __sb_start_write+0x26a/0x3f0
  [  176.242234]  vfs_write+0x157/0x460
  [  176.242237]  ksys_write+0xb8/0x170
  [  176.242239]  ? __ia32_sys_read+0xb0/0xb0
  [  176.242242]  ? do_syscall_64+0xb9/0x710
  [  176.242245]  do_syscall_64+0x13c/0x710
  [  176.242247]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

There was another lockdep splat generated by echoing "1" to
"/sys/kernel/slab/fs_cache/shrink":

[  445.231443] Chain exists of:
                 cpu_hotplug_lock --> mem_hotplug_lock --> slab_mutex

[  445.242025]  Possible unsafe locking scenario:

[  445.247977]        CPU0                    CPU1
[  445.252529]        ----                    ----
[  445.257082]   lock(slab_mutex);
[  445.260239]                                lock(mem_hotplug_lock);
[  445.266452]                                lock(slab_mutex);
[  445.272141]   lock(cpu_hotplug_lock);

So it is problematic to use slab_mutex to iterate the list of
child memcgs with for_each_memcg_cache(). Fortunately, there is
another way to do child memcg iteration by going through the array
entries in memcg_params.memcg_caches while holding a read lock on
memcg_cache_ids_sem.

To avoid other possible circular locking problems, we only take a
reference to the child memcgs and store their addresses while holding
memcg_cache_ids_sem. The actual store method is called for each of the
child memcgs after releasing the lock.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/slub.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 183ccc364ccf..44cb5215c17f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5567,13 +5567,32 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		return -EIO;
 
 	err = attribute->store(s, buf, len);
-#ifdef CONFIG_MEMCG
-	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	if (slab_state >= FULL && err >= 0 && is_root_cache(s) &&
+	    !list_empty(&s->memcg_params.children)) {
+		struct kmem_cache *c, **pcaches;
+		int idx, max, cnt = 0;
+		size_t size, old = s->max_attr_size;
+		struct memcg_cache_array *arr;
+
+		/*
+		 * Make atomic update to s->max_attr_size.
+		 */
+		do {
+			size = old;
+			if (len <= size)
+				break;
+			old = cmpxchg(&s->max_attr_size, size, len);
+		} while (old != size);
 
-		mutex_lock(&slab_mutex);
-		if (s->max_attr_size < len)
-			s->max_attr_size = len;
+		memcg_get_cache_ids();
+		max = memcg_nr_cache_ids;
+
+		pcaches = kmalloc_array(max, sizeof(void *), GFP_KERNEL);
+		if (!pcaches) {
+			memcg_put_cache_ids();
+			return -ENOMEM;
+		}
 
 		/*
 		 * This is a best effort propagation, so this function's return
@@ -5591,10 +5610,33 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * has well defined semantics. The cache being written to
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
+		 *
+		 * To avoid potential circular lock dependency problems, we
+		 * just get a reference and store child cache pointers while
+		 * holding the memcg_cache_ids_sem read lock. The store
+		 * method is then called for each child cache after releasing
+		 * the lock. Code sequence partly borrowed from
+		 * memcg_kmem_get_cache().
 		 */
-		for_each_memcg_cache(c, s)
+		rcu_read_lock();
+		arr = rcu_dereference(s->memcg_params.memcg_caches);
+		for (idx = 0; idx < max; idx++) {
+			c = READ_ONCE(arr->entries[idx]);
+			if (!c)
+				continue;
+			if (!percpu_ref_tryget(&c->memcg_params.refcnt))
+				continue;
+			pcaches[cnt++] = c;
+		}
+		rcu_read_unlock();
+		memcg_put_cache_ids();
+
+		for (idx = 0; idx < cnt; idx++) {
+			c = pcaches[idx];
 			attribute->store(c, buf, len);
-		mutex_unlock(&slab_mutex);
+			percpu_ref_put(&c->memcg_params.refcnt);
+		}
+		kfree(pcaches);
 	}
 #endif
 	return err;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/4] mm/slub: Fix another circular locking dependency in slab_attr_store()
  2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
  2020-04-27 23:56 ` [PATCH v2 1/4] mm, slab: Revert "extend slab/shrink to shrink all memcg caches" Waiman Long
  2020-04-27 23:56 ` [PATCH v2 2/4] mm/slub: Fix slab_mutex circular locking problem in slab_attr_store() Waiman Long
@ 2020-04-27 23:56 ` Waiman Long
  2020-05-17  2:19   ` Qian Cai
  2020-04-27 23:56 ` [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency Waiman Long
  2020-05-17  1:54 ` [PATCH v2 0/4] mm/slub: Fix sysfs " Qian Cai
  4 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2020-04-27 23:56 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-kernel, cgroups, Juri Lelli, Qian Cai, Waiman Long

It turns out that switching from slab_mutex to memcg_cache_ids_sem in
slab_attr_store() does not completely eliminate circular locking dependency
as shown by the following lockdep splat when the system is shut down:

[ 2095.079697] Chain exists of:
[ 2095.079697]   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
[ 2095.079697]
[ 2095.090278]  Possible unsafe locking scenario:
[ 2095.090278]
[ 2095.096227]        CPU0                    CPU1
[ 2095.100779]        ----                    ----
[ 2095.105331]   lock(slab_mutex);
[ 2095.108486]                                lock(memcg_cache_ids_sem);
[ 2095.114961]                                lock(slab_mutex);
[ 2095.120649]   lock(kn->count#278);
[ 2095.124068]
[ 2095.124068]  *** DEADLOCK ***

To eliminate this possibility, we have to use trylock to acquire
memcg_cache_ids_sem. Unlikely slab_mutex which can be acquired in
many places, the memcg_cache_ids_sem write lock is only acquired
in memcg_alloc_cache_id() to double the size of memcg_nr_cache_ids.
So the chance of successive calls to memcg_alloc_cache_id() within
a short time is pretty low. As a result, we can retry the read lock
acquisition a few times if the first attempt fails.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            |  5 +++++
 mm/slub.c                  | 25 +++++++++++++++++++++++--
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d275c72c4f8e..9285f14965b1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1379,6 +1379,7 @@ extern struct workqueue_struct *memcg_kmem_cache_wq;
 extern int memcg_nr_cache_ids;
 void memcg_get_cache_ids(void);
 void memcg_put_cache_ids(void);
+int  memcg_tryget_cache_ids(void);
 
 /*
  * Helper macro to loop through all memcg-specific caches. Callers must still
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..9fa8535ff72a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -279,6 +279,11 @@ void memcg_get_cache_ids(void)
 	down_read(&memcg_cache_ids_sem);
 }
 
+int memcg_tryget_cache_ids(void)
+{
+	return down_read_trylock(&memcg_cache_ids_sem);
+}
+
 void memcg_put_cache_ids(void)
 {
 	up_read(&memcg_cache_ids_sem);
diff --git a/mm/slub.c b/mm/slub.c
index 44cb5215c17f..cf2114ca27f7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -34,6 +34,7 @@
 #include <linux/prefetch.h>
 #include <linux/memcontrol.h>
 #include <linux/random.h>
+#include <linux/delay.h>
 
 #include <trace/events/kmem.h>
 
@@ -5572,6 +5573,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	    !list_empty(&s->memcg_params.children)) {
 		struct kmem_cache *c, **pcaches;
 		int idx, max, cnt = 0;
+		int retries = 3;
 		size_t size, old = s->max_attr_size;
 		struct memcg_cache_array *arr;
 
@@ -5585,9 +5587,28 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 			old = cmpxchg(&s->max_attr_size, size, len);
 		} while (old != size);
 
-		memcg_get_cache_ids();
-		max = memcg_nr_cache_ids;
+		/*
+		 * To avoid the following circular lock chain
+		 *
+		 *   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
+		 *
+		 * We need to use trylock to acquire memcg_cache_ids_sem.
+		 *
+		 * Since the write lock is acquired only in
+		 * memcg_alloc_cache_id() to double the size of
+		 * memcg_nr_cache_ids. The chance of successive
+		 * memcg_alloc_cache_id() calls within a short time is
+		 * very low except at the beginning where the number of
+		 * memory cgroups is low. So we retry a few times to get
+		 * the memcg_cache_ids_sem read lock.
+		 */
+		while (!memcg_tryget_cache_ids()) {
+			if (retries-- <= 0)
+				return -EBUSY;
+			msleep(100);
+		}
 
+		max = memcg_nr_cache_ids;
 		pcaches = kmalloc_array(max, sizeof(void *), GFP_KERNEL);
 		if (!pcaches) {
 			memcg_put_cache_ids();
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
                   ` (2 preceding siblings ...)
  2020-04-27 23:56 ` [PATCH v2 3/4] mm/slub: Fix another circular locking dependency " Waiman Long
@ 2020-04-27 23:56 ` Waiman Long
  2020-04-28  0:13   ` Qian Cai
  2020-05-17  1:57   ` Qian Cai
  2020-05-17  1:54 ` [PATCH v2 0/4] mm/slub: Fix sysfs " Qian Cai
  4 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2020-04-27 23:56 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-kernel, cgroups, Juri Lelli, Qian Cai, Waiman Long

A lockdep splat is observed by echoing "1" to the shrink sysfs file
and then shutting down the system:

[  167.473392] Chain exists of:
[  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
[  167.473392]
[  167.484323]  Possible unsafe locking scenario:
[  167.484323]
[  167.490273]        CPU0                    CPU1
[  167.494825]        ----                    ----
[  167.499376]   lock(slab_mutex);
[  167.502530]                                lock(mem_hotplug_lock.rw_sem);
[  167.509356]                                lock(slab_mutex);
[  167.515044]   lock(kn->count#279);
[  167.518462]
[  167.518462]  *** DEADLOCK ***

It is because of the get_online_cpus() and get_online_mems() calls in
kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
have to use trylock to get the memory and cpu hotplug read locks. Since
hotplug events are rare, it should be fine to refuse a kmem caches
shrink operation when some hotplug events are in progress.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memory_hotplug.h |  2 ++
 mm/memory_hotplug.c            |  5 +++++
 mm/slub.c                      | 19 +++++++++++++++----
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 93d9ada74ddd..4ec4b0a2f0fa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -231,6 +231,7 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 
 void get_online_mems(void);
 void put_online_mems(void);
+int  tryget_online_mems(void);
 
 void mem_hotplug_begin(void);
 void mem_hotplug_done(void);
@@ -274,6 +275,7 @@ static inline int try_online_node(int nid)
 
 static inline void get_online_mems(void) {}
 static inline void put_online_mems(void) {}
+static inline int  tryget_online_mems(void) { return 1; }
 
 static inline void mem_hotplug_begin(void) {}
 static inline void mem_hotplug_done(void) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..38f9ccec9259 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -59,6 +59,11 @@ void get_online_mems(void)
 	percpu_down_read(&mem_hotplug_lock);
 }
 
+int tryget_online_mems(void)
+{
+	return percpu_down_read_trylock(&mem_hotplug_lock);
+}
+
 void put_online_mems(void)
 {
 	percpu_up_read(&mem_hotplug_lock);
diff --git a/mm/slub.c b/mm/slub.c
index cf2114ca27f7..c4977ac3271b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5343,10 +5343,20 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1')
-		kmem_cache_shrink(s);
-	else
+	if (buf[0] != '1')
 		return -EINVAL;
+
+	if (!cpus_read_trylock())
+		return -EBUSY;
+	if (!tryget_online_mems()) {
+		length = -EBUSY;
+		goto cpus_unlock_out;
+	}
+	kasan_cache_shrink(s);
+	__kmem_cache_shrink(s);
+	put_online_mems();
+cpus_unlock_out:
+	cpus_read_unlock();
 	return length;
 }
 SLAB_ATTR(shrink);
@@ -5654,7 +5664,8 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 
 		for (idx = 0; idx < cnt; idx++) {
 			c = pcaches[idx];
-			attribute->store(c, buf, len);
+			if (attribute->store(c, buf, len) == -EBUSY)
+				err = -EBUSY;
 			percpu_ref_put(&c->memcg_params.refcnt);
 		}
 		kfree(pcaches);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-27 23:56 ` [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency Waiman Long
@ 2020-04-28  0:13   ` Qian Cai
  2020-04-28  1:39     ` Waiman Long
  2020-05-17  1:57   ` Qian Cai
  1 sibling, 1 reply; 16+ messages in thread
From: Qian Cai @ 2020-04-28  0:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> 
> A lockdep splat is observed by echoing "1" to the shrink sysfs file
> and then shutting down the system:
> 
> [  167.473392] Chain exists of:
> [  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
> [  167.473392]
> [  167.484323]  Possible unsafe locking scenario:
> [  167.484323]
> [  167.490273]        CPU0                    CPU1
> [  167.494825]        ----                    ----
> [  167.499376]   lock(slab_mutex);
> [  167.502530]                                lock(mem_hotplug_lock.rw_sem);
> [  167.509356]                                lock(slab_mutex);
> [  167.515044]   lock(kn->count#279);
> [  167.518462]
> [  167.518462]  *** DEADLOCK ***
> 
> It is because of the get_online_cpus() and get_online_mems() calls in
> kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
> have to use trylock to get the memory and cpu hotplug read locks. Since
> hotplug events are rare, it should be fine to refuse a kmem caches
> shrink operation when some hotplug events are in progress.

I don’t understand how trylock could prevent a splat. The fundamental issue is that in sysfs slab store case, the locking order (once trylock succeed) is,

kn->count —> cpu/memory_hotplug

But we have the existing reverse chain everywhere.

cpu/memory_hotplug —> slab_mutex —> kn->count



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-28  0:13   ` Qian Cai
@ 2020-04-28  1:39     ` Waiman Long
  2020-04-28  2:11       ` Qian Cai
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2020-04-28  1:39 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli

On 4/27/20 8:13 PM, Qian Cai wrote:
>
>> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> A lockdep splat is observed by echoing "1" to the shrink sysfs file
>> and then shutting down the system:
>>
>> [  167.473392] Chain exists of:
>> [  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
>> [  167.473392]
>> [  167.484323]  Possible unsafe locking scenario:
>> [  167.484323]
>> [  167.490273]        CPU0                    CPU1
>> [  167.494825]        ----                    ----
>> [  167.499376]   lock(slab_mutex);
>> [  167.502530]                                lock(mem_hotplug_lock.rw_sem);
>> [  167.509356]                                lock(slab_mutex);
>> [  167.515044]   lock(kn->count#279);
>> [  167.518462]
>> [  167.518462]  *** DEADLOCK ***
>>
>> It is because of the get_online_cpus() and get_online_mems() calls in
>> kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
>> have to use trylock to get the memory and cpu hotplug read locks. Since
>> hotplug events are rare, it should be fine to refuse a kmem caches
>> shrink operation when some hotplug events are in progress.
> I don’t understand how trylock could prevent a splat. The fundamental issue is that in sysfs slab store case, the locking order (once trylock succeed) is,
>
> kn->count —> cpu/memory_hotplug
>
> But we have the existing reverse chain everywhere.
>
> cpu/memory_hotplug —> slab_mutex —> kn->count
>
The sequence that was prevented by this patch is "kn->count --> 
mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once 
this link is broken, the 3-lock circular loop cannot be formed. Maybe I 
should modify the commit log to make this point more clear.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-28  1:39     ` Waiman Long
@ 2020-04-28  2:11       ` Qian Cai
  2020-04-28 14:06         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Qian Cai @ 2020-04-28  2:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
> 
> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.

I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-28  2:11       ` Qian Cai
@ 2020-04-28 14:06         ` Waiman Long
  2020-04-29  2:52           ` Qian Cai
  2020-05-17  1:46           ` Qian Cai
  0 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2020-04-28 14:06 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli

On 4/27/20 10:11 PM, Qian Cai wrote:
>
>> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.
> I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.
>
Trylock is handled differently from lockdep's perspective as trylock can 
failed. When trylock succeeds, the critical section is executed. As long 
as it doesn't try to acquire another lock in the circular chain, the 
execution will finish at some point and release the lock. On the other 
hand, if another task has already held all those locks, the trylock will 
fail and held locks should be released. Again, no deadlock will happen.

Regards,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-28 14:06         ` Waiman Long
@ 2020-04-29  2:52           ` Qian Cai
  2020-05-17  1:46           ` Qian Cai
  1 sibling, 0 replies; 16+ messages in thread
From: Qian Cai @ 2020-04-29  2:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 28, 2020, at 10:06 AM, Waiman Long <longman@redhat.com> wrote:
> 
> On 4/27/20 10:11 PM, Qian Cai wrote:
>> 
>>> On Apr 27, 2020, at 9:39 PM, Waiman Long <longman@redhat.com> wrote:
>>> 
>>> The sequence that was prevented by this patch is "kn->count --> mem_hotplug_lock.rwsem". This sequence isn't directly in the splat. Once this link is broken, the 3-lock circular loop cannot be formed. Maybe I should modify the commit log to make this point more clear.
>> I don’t know what you are talking about. Once trylock succeed once, you will have kn->count —> cpu/memory_hotplug_lock.
>> 
> Trylock is handled differently from lockdep's perspective as trylock can failed. When trylock succeeds, the critical section is executed. As long as it doesn't try to acquire another lock in the circular chain, the execution will finish at some point and release the lock. On the other hand, if another task has already held all those locks, the trylock will fail and held locks should be released. Again, no deadlock will happen.

So once,

CPU0 (trylock succeed):
kn->count —> cpu/memory_hotplug_lock.

Did you mean that lockdep will not record this existing chain?

If it did. Then later, are you still sure that CPU1 (via memcg path below) will still be impossible to trigger a splat just because lockdep will be able to tell that those arennon-exclusive (cpu/memory_hotplug_lock) locks instead?

 cpu/memory_hotplug_lock -> kn->count

[  290.805818] -> #3 (kn->count#86){++++}-{0:0}:
[  290.811954]        __kernfs_remove+0x455/0x4c0
[  290.816428]        kernfs_remove+0x23/0x40
[  290.820554]        sysfs_remove_dir+0x74/0x80
[  290.824947]        kobject_del+0x57/0xa0
[  290.828905]        sysfs_slab_unlink+0x1c/0x20
[  290.833377]        shutdown_cache+0x15d/0x1c0
[  290.837964]        kmemcg_cache_shutdown_fn+0xe/0x20
[  290.842963]        kmemcg_workfn+0x35/0x50   <—— cpu/memory_hotplug_lock
[  290.847095]        process_one_work+0x57e/0xb90
[  290.851658]        worker_thread+0x63/0x5b0
[  290.855872]        kthread+0x1f7/0x220
[  290.859653]        ret_from_fork+0x27/0x50

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-28 14:06         ` Waiman Long
  2020-04-29  2:52           ` Qian Cai
@ 2020-05-17  1:46           ` Qian Cai
  1 sibling, 0 replies; 16+ messages in thread
From: Qian Cai @ 2020-05-17  1:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 28, 2020, at 10:07 AM, Waiman Long <longman@redhat.com> wrote:
> 
> Trylock is handled differently from lockdep's perspective as trylock can failed. When trylock succeeds, the critical section is executed. As long as it doesn't try to acquire another lock in the circular chain, the execution will finish at some point and release the lock. On the other hand, if another task has already held all those locks, the trylock will fail and held locks should be released. Again, no deadlock will happen.

Ok, I can see that in validate_chain() especially mentioned,

“Trylock needs to maintain the stack of held locks, but it does not add new dependencies, because trylock can be done in any order.”

So, I agree this trylock trick could really work. Especially, I don’t know any other better way to fix this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency
  2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
                   ` (3 preceding siblings ...)
  2020-04-27 23:56 ` [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency Waiman Long
@ 2020-05-17  1:54 ` Qian Cai
  4 siblings, 0 replies; 16+ messages in thread
From: Qian Cai @ 2020-05-17  1:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> 
> v2:
> - Use regular cmpxchg() instead of x86-only try_cmpxchg() in patch 2.
> - Add patches 3 and 4 to fix circular locking dependency showing up
>   at shutdown time.
> 
> With lockdep enabled, issuing the following command to the slub sysfs
> files will cause splat about circular locking dependency to show up
> either immediately afterwards or at shutdown time.
> 
> # echo 1 > validate
> # echo 1 > shrink
> 
> This patchset fixes these lockdep splats by replacing slab_mutex with
> memcg_cache_ids_sem as well as changing some of the lock operations
> with trylock.

For the whole series, feel free to use,

Tested-by: Qian Cai <cai@lca.pw>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency
  2020-04-27 23:56 ` [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency Waiman Long
  2020-04-28  0:13   ` Qian Cai
@ 2020-05-17  1:57   ` Qian Cai
  1 sibling, 0 replies; 16+ messages in thread
From: Qian Cai @ 2020-05-17  1:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> 
> A lockdep splat is observed by echoing "1" to the shrink sysfs file
> and then shutting down the system:
> 
> [  167.473392] Chain exists of:
> [  167.473392]   kn->count#279 --> mem_hotplug_lock.rw_sem --> slab_mutex
> [  167.473392]
> [  167.484323]  Possible unsafe locking scenario:
> [  167.484323]
> [  167.490273]        CPU0                    CPU1
> [  167.494825]        ----                    ----
> [  167.499376]   lock(slab_mutex);
> [  167.502530]                                lock(mem_hotplug_lock.rw_sem);
> [  167.509356]                                lock(slab_mutex);
> [  167.515044]   lock(kn->count#279);
> [  167.518462]
> [  167.518462]  *** DEADLOCK ***
> 
> It is because of the get_online_cpus() and get_online_mems() calls in
> kmem_cache_shrink() invoked via the shrink sysfs file. To fix that, we
> have to use trylock to get the memory and cpu hotplug read locks. Since
> hotplug events are rare, it should be fine to refuse a kmem caches
> shrink operation when some hotplug events are in progress.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Feel free to use,

Reviewed-by: Qian Cai <cai@lca.pw>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm/slub: Fix another circular locking dependency in slab_attr_store()
  2020-04-27 23:56 ` [PATCH v2 3/4] mm/slub: Fix another circular locking dependency " Waiman Long
@ 2020-05-17  2:19   ` Qian Cai
  2020-05-18 22:05     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Qian Cai @ 2020-05-17  2:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli



> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> 
> It turns out that switching from slab_mutex to memcg_cache_ids_sem in
> slab_attr_store() does not completely eliminate circular locking dependency
> as shown by the following lockdep splat when the system is shut down:
> 
> [ 2095.079697] Chain exists of:
> [ 2095.079697]   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
> [ 2095.079697]
> [ 2095.090278]  Possible unsafe locking scenario:
> [ 2095.090278]
> [ 2095.096227]        CPU0                    CPU1
> [ 2095.100779]        ----                    ----
> [ 2095.105331]   lock(slab_mutex);
> [ 2095.108486]                                lock(memcg_cache_ids_sem);
> [ 2095.114961]                                lock(slab_mutex);
> [ 2095.120649]   lock(kn->count#278);
> [ 2095.124068]
> [ 2095.124068]  *** DEADLOCK ***

Can you show the full splat?

> 
> To eliminate this possibility, we have to use trylock to acquire
> memcg_cache_ids_sem. Unlikely slab_mutex which can be acquired in
> many places, the memcg_cache_ids_sem write lock is only acquired
> in memcg_alloc_cache_id() to double the size of memcg_nr_cache_ids.
> So the chance of successive calls to memcg_alloc_cache_id() within
> a short time is pretty low. As a result, we can retry the read lock
> acquisition a few times if the first attempt fails.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

The code looks a bit hacky and probably not that robust. Since it is the shutdown path which is not all that important without lockdep, maybe you could drop this single patch for now until there is a better solution?

> ---
> include/linux/memcontrol.h |  1 +
> mm/memcontrol.c            |  5 +++++
> mm/slub.c                  | 25 +++++++++++++++++++++++--
> 3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d275c72c4f8e..9285f14965b1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1379,6 +1379,7 @@ extern struct workqueue_struct *memcg_kmem_cache_wq;
> extern int memcg_nr_cache_ids;
> void memcg_get_cache_ids(void);
> void memcg_put_cache_ids(void);
> +int  memcg_tryget_cache_ids(void);
> 
> /*
>  * Helper macro to loop through all memcg-specific caches. Callers must still
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5beea03dd58a..9fa8535ff72a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -279,6 +279,11 @@ void memcg_get_cache_ids(void)
>    down_read(&memcg_cache_ids_sem);
> }
> 
> +int memcg_tryget_cache_ids(void)
> +{
> +    return down_read_trylock(&memcg_cache_ids_sem);
> +}
> +
> void memcg_put_cache_ids(void)
> {
>    up_read(&memcg_cache_ids_sem);
> diff --git a/mm/slub.c b/mm/slub.c
> index 44cb5215c17f..cf2114ca27f7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -34,6 +34,7 @@
> #include <linux/prefetch.h>
> #include <linux/memcontrol.h>
> #include <linux/random.h>
> +#include <linux/delay.h>
> 
> #include <trace/events/kmem.h>
> 
> @@ -5572,6 +5573,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>        !list_empty(&s->memcg_params.children)) {
>        struct kmem_cache *c, **pcaches;
>        int idx, max, cnt = 0;
> +        int retries = 3;
>        size_t size, old = s->max_attr_size;
>        struct memcg_cache_array *arr;
> 
> @@ -5585,9 +5587,28 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>            old = cmpxchg(&s->max_attr_size, size, len);
>        } while (old != size);
> 
> -        memcg_get_cache_ids();
> -        max = memcg_nr_cache_ids;
> +        /*
> +         * To avoid the following circular lock chain
> +         *
> +         *   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
> +         *
> +         * We need to use trylock to acquire memcg_cache_ids_sem.
> +         *
> +         * Since the write lock is acquired only in
> +         * memcg_alloc_cache_id() to double the size of
> +         * memcg_nr_cache_ids. The chance of successive
> +         * memcg_alloc_cache_id() calls within a short time is
> +         * very low except at the beginning where the number of
> +         * memory cgroups is low. So we retry a few times to get
> +         * the memcg_cache_ids_sem read lock.
> +         */
> +        while (!memcg_tryget_cache_ids()) {
> +            if (retries-- <= 0)
> +                return -EBUSY;
> +            msleep(100);
> +        }
> 
> +        max = memcg_nr_cache_ids;
>        pcaches = kmalloc_array(max, sizeof(void *), GFP_KERNEL);
>        if (!pcaches) {
>            memcg_put_cache_ids();

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm/slub: Fix another circular locking dependency in slab_attr_store()
  2020-05-17  2:19   ` Qian Cai
@ 2020-05-18 22:05     ` Waiman Long
  2020-05-19  2:37       ` Qian Cai
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2020-05-18 22:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm, linux-kernel, cgroups, Juri Lelli

On 5/16/20 10:19 PM, Qian Cai wrote:
>
>> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> It turns out that switching from slab_mutex to memcg_cache_ids_sem in
>> slab_attr_store() does not completely eliminate circular locking dependency
>> as shown by the following lockdep splat when the system is shut down:
>>
>> [ 2095.079697] Chain exists of:
>> [ 2095.079697]   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
>> [ 2095.079697]
>> [ 2095.090278]  Possible unsafe locking scenario:
>> [ 2095.090278]
>> [ 2095.096227]        CPU0                    CPU1
>> [ 2095.100779]        ----                    ----
>> [ 2095.105331]   lock(slab_mutex);
>> [ 2095.108486]                                lock(memcg_cache_ids_sem);
>> [ 2095.114961]                                lock(slab_mutex);
>> [ 2095.120649]   lock(kn->count#278);
>> [ 2095.124068]
>> [ 2095.124068]  *** DEADLOCK ***
> Can you show the full splat?
>
>> To eliminate this possibility, we have to use trylock to acquire
>> memcg_cache_ids_sem. Unlikely slab_mutex which can be acquired in
>> many places, the memcg_cache_ids_sem write lock is only acquired
>> in memcg_alloc_cache_id() to double the size of memcg_nr_cache_ids.
>> So the chance of successive calls to memcg_alloc_cache_id() within
>> a short time is pretty low. As a result, we can retry the read lock
>> acquisition a few times if the first attempt fails.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> The code looks a bit hacky and probably not that robust. Since it is the shutdown path which is not all that important without lockdep, maybe you could drop this single patch for now until there is a better solution?

That is true. Unlike using the slab_mutex, the chance of failing to 
acquire a read lock on memcg_cache_ids_sem is pretty low. Maybe just 
print_once a warning if that happen.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm/slub: Fix another circular locking dependency in slab_attr_store()
  2020-05-18 22:05     ` Waiman Long
@ 2020-05-19  2:37       ` Qian Cai
  0 siblings, 0 replies; 16+ messages in thread
From: Qian Cai @ 2020-05-19  2:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Linux-MM, Linux Kernel Mailing List, Cgroups, Juri Lelli

On Mon, May 18, 2020 at 6:05 PM Waiman Long <longman@redhat.com> wrote:
>
> On 5/16/20 10:19 PM, Qian Cai wrote:
> >
> >> On Apr 27, 2020, at 7:56 PM, Waiman Long <longman@redhat.com> wrote:
> >>
> >> It turns out that switching from slab_mutex to memcg_cache_ids_sem in
> >> slab_attr_store() does not completely eliminate circular locking dependency
> >> as shown by the following lockdep splat when the system is shut down:
> >>
> >> [ 2095.079697] Chain exists of:
> >> [ 2095.079697]   kn->count#278 --> memcg_cache_ids_sem --> slab_mutex
> >> [ 2095.079697]
> >> [ 2095.090278]  Possible unsafe locking scenario:
> >> [ 2095.090278]
> >> [ 2095.096227]        CPU0                    CPU1
> >> [ 2095.100779]        ----                    ----
> >> [ 2095.105331]   lock(slab_mutex);
> >> [ 2095.108486]                                lock(memcg_cache_ids_sem);
> >> [ 2095.114961]                                lock(slab_mutex);
> >> [ 2095.120649]   lock(kn->count#278);
> >> [ 2095.124068]
> >> [ 2095.124068]  *** DEADLOCK ***
> > Can you show the full splat?
> >
> >> To eliminate this possibility, we have to use trylock to acquire
> >> memcg_cache_ids_sem. Unlikely slab_mutex which can be acquired in
> >> many places, the memcg_cache_ids_sem write lock is only acquired
> >> in memcg_alloc_cache_id() to double the size of memcg_nr_cache_ids.
> >> So the chance of successive calls to memcg_alloc_cache_id() within
> >> a short time is pretty low. As a result, we can retry the read lock
> >> acquisition a few times if the first attempt fails.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> > The code looks a bit hacky and probably not that robust. Since it is the shutdown path which is not all that important without lockdep, maybe you could drop this single patch for now until there is a better solution?
>
> That is true. Unlike using the slab_mutex, the chance of failing to
> acquire a read lock on memcg_cache_ids_sem is pretty low. Maybe just
> print_once a warning if that happen.

That seems cleaner. If you are going to repost this series, you could
also mention that the series will fix slabinfo triggering a splat as
well.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-05-19  2:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27 23:56 [PATCH v2 0/4] mm/slub: Fix sysfs circular locking dependency Waiman Long
2020-04-27 23:56 ` [PATCH v2 1/4] mm, slab: Revert "extend slab/shrink to shrink all memcg caches" Waiman Long
2020-04-27 23:56 ` [PATCH v2 2/4] mm/slub: Fix slab_mutex circular locking problem in slab_attr_store() Waiman Long
2020-04-27 23:56 ` [PATCH v2 3/4] mm/slub: Fix another circular locking dependency " Waiman Long
2020-05-17  2:19   ` Qian Cai
2020-05-18 22:05     ` Waiman Long
2020-05-19  2:37       ` Qian Cai
2020-04-27 23:56 ` [PATCH v2 4/4] mm/slub: Fix sysfs shrink circular locking dependency Waiman Long
2020-04-28  0:13   ` Qian Cai
2020-04-28  1:39     ` Waiman Long
2020-04-28  2:11       ` Qian Cai
2020-04-28 14:06         ` Waiman Long
2020-04-29  2:52           ` Qian Cai
2020-05-17  1:46           ` Qian Cai
2020-05-17  1:57   ` Qian Cai
2020-05-17  1:54 ` [PATCH v2 0/4] mm/slub: Fix sysfs " Qian Cai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).