* [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems @ 2020-02-10 20:46 Waiman Long 2020-02-10 20:46 ` [PATCH 1/3] locking/mutex: Add mutex_timed_lock() Waiman Long ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Waiman Long @ 2020-02-10 20:46 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton Cc: linux-kernel, linux-mm, Waiman Long When writing to the some of the sysctl parameters or sysfs files, locks may be taken in an order that is different from other parts of the kernel. As a result, lockdep may complain about circular locking dependency indicating that deadlock may actually happen in some corner cases. Patch 3 shows an example of this in the sysfs files of the slub allocator. It is typically hard to change the locking order in many cases. One possible solution is to use a trylock loop. That is considered inelegant and it is hard to control the actual wait time. An alternative solution proposed by this patchset is to add a new mutex_timed_lock() call that allows an additional timeout argument. This function will return an error code if timeout happens. The use of this new API will prevent deadlock from happening while allowing the task to wait a sufficient period of time before giving up. The goal of this new API is to prevent deadlock from happening, so timeout accuracy is not high on the priority list. A coarse-grained and easily understood millisecond based integer timeout argument is used. That is somewhat different from the rt_mutex_timed_lock() function where a more precise but complex hrtimer_sleeper argument is used. On a 4-socket 128-thread x86-64 running a 128-thread mutex locking microbenchmark with 1ms timeout, the output of the microbenchmark were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 128, Min/Mean/Max = 247,667/601,134/1,621,145 Threads = 128, Total Rate = 7,694 kop/s; Percpu Rate = 60 kop/s The corresponding mutex locking events were: mutex_handoff=2032 mutex_optspin=3486239 mutex_sleep=2047 mutex_slowpath=3626 mutex_timeout=294 Waiman Long (3): locking/mutex: Add mutex_timed_lock() locking/mutex: Enable some lock event counters mm/slub: Fix potential deadlock problem in slab_attr_store() include/linux/mutex.h | 3 + kernel/locking/lock_events_list.h | 9 +++ kernel/locking/mutex.c | 114 +++++++++++++++++++++++++++--- mm/slub.c | 7 +- 4 files changed, 123 insertions(+), 10 deletions(-) -- 2.18.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/3] locking/mutex: Add mutex_timed_lock() 2020-02-10 20:46 [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Waiman Long @ 2020-02-10 20:46 ` Waiman Long 2020-02-10 20:46 ` [PATCH 2/3] locking/mutex: Enable some lock event counters Waiman Long ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Waiman Long @ 2020-02-10 20:46 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton Cc: linux-kernel, linux-mm, Waiman Long There are cases where a task wants to acquire a mutex but doesn't want to wait for an indefinite period of time. Instead, a task may want to bail out after a certain period of time. There are also cases where waiting indefinitely can potentially lead to deadlock. Doing it by using a trylock loop is inelegant as it increases cacheline contention and is difficult to control the actual wait time. To address this dilemma, a new mutex_timed_lock() variant is introduced which allows an additional timeout argument in milliseconds relative to now. With this new API, a task can now wait for a given period of time and bail out when the lock cannot be acquired within the given period. In reality, the actual wait time is likely to be longer than the given time. Timeout checking isn't done during optimistic spinning. Timeout accuracy isn't the design goal here. Therefore a short timeout value smaller than the typical task scheduling period may be less accurate. From the lockdep perspective, mutex_timed_lock() is treated similar to mutex_trylock(). With a timeout value of 0, mutex_timed_lock() behaves like mutex_trylock(). Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/mutex.h | 3 ++ kernel/locking/mutex.c | 105 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 100 insertions(+), 8 deletions(-) diff --git a/include/linux/mutex.h b/include/linux/mutex.h index aca8f36dfac9..d6b3ac84d60b 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -19,6 +19,7 @@ #include <asm/processor.h> #include <linux/osq_lock.h> #include <linux/debug_locks.h> +#include <linux/ktime.h> struct ww_acquire_ctx; @@ -165,6 +166,8 @@ do { \ extern void mutex_lock(struct mutex *lock); extern int __must_check mutex_lock_interruptible(struct mutex *lock); extern int __must_check mutex_lock_killable(struct mutex *lock); +extern int __must_check mutex_timed_lock(struct mutex *lock, + unsigned int timeout_ms); extern void mutex_lock_io(struct mutex *lock); # define mutex_lock_nested(lock, subclass) mutex_lock(lock) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 5352ce50a97e..976179a4ed9e 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -23,6 +23,7 @@ #include <linux/sched/signal.h> #include <linux/sched/rt.h> #include <linux/sched/wake_q.h> +#include <linux/sched/clock.h> #include <linux/sched/debug.h> #include <linux/export.h> #include <linux/spinlock.h> @@ -101,6 +102,26 @@ static inline unsigned long __owner_flags(unsigned long owner) return owner & MUTEX_FLAGS; } +/* + * Set up the hrtimer to fire at a future time relative to now. + * Return: The hrtimer_sleeper pointer if success, or NULL if it + * has timed out. + */ +static inline struct hrtimer_sleeper * +mutex_setup_hrtimer(struct hrtimer_sleeper *to, ktime_t timeout) +{ + ktime_t curtime = ns_to_ktime(sched_clock()); + + if (!ktime_before(curtime, timeout)) + return NULL; + + hrtimer_init_sleeper_on_stack(to, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_set_expires_range_ns(&to->timer, timeout - curtime, + current->timer_slack_ns); + hrtimer_start_expires(&to->timer, HRTIMER_MODE_REL); + return to; +} + /* * Trylock variant that retuns the owning task on failure. */ @@ -925,12 +946,15 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter, static __always_inline int __sched __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, struct lockdep_map *nest_lock, unsigned long ip, - struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx) + struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx, + const unsigned int timeout_ms) { struct mutex_waiter waiter; bool first = false; struct ww_mutex *ww; - int ret; + int ret = 0; + struct hrtimer_sleeper timer_sleeper, *to = NULL; + ktime_t timeout = 0; might_sleep(); @@ -953,7 +977,18 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, } preempt_disable(); - mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip); + + /* + * Treated as a trylock if timeout value is specified. + */ + mutex_acquire_nest(&lock->dep_map, subclass, !!timeout_ms, + nest_lock, ip); + + /* + * The timeuot value is now the end time when the timer will expire. + */ + if (timeout_ms) + timeout = ktime_add_ns(ms_to_ktime(timeout_ms), sched_clock()); if (__mutex_trylock(lock) || mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, NULL)) { @@ -1029,6 +1064,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, goto err; } + if (timeout_ms) { + if (!to) + to = mutex_setup_hrtimer(&timer_sleeper, + timeout); + if (!to || !to->task) { + ret = -ETIMEDOUT; + goto err; + } + } + spin_unlock(&lock->wait_lock); schedule_preempt_disabled(); @@ -1082,8 +1127,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, ww_mutex_lock_acquired(ww, ww_ctx); spin_unlock(&lock->wait_lock); +out: + if (to) { + hrtimer_cancel(&to->timer); + destroy_hrtimer_on_stack(&to->timer); + } preempt_enable(); - return 0; + return ret; err: __set_current_state(TASK_RUNNING); @@ -1092,15 +1142,15 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, spin_unlock(&lock->wait_lock); debug_mutex_free_waiter(&waiter); mutex_release(&lock->dep_map, ip); - preempt_enable(); - return ret; + goto out; } static int __sched __mutex_lock(struct mutex *lock, long state, unsigned int subclass, struct lockdep_map *nest_lock, unsigned long ip) { - return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false); + return __mutex_lock_common(lock, state, subclass, nest_lock, ip, + NULL, false, 0); } static int __sched @@ -1108,7 +1158,17 @@ __ww_mutex_lock(struct mutex *lock, long state, unsigned int subclass, struct lockdep_map *nest_lock, unsigned long ip, struct ww_acquire_ctx *ww_ctx) { - return __mutex_lock_common(lock, state, subclass, nest_lock, ip, ww_ctx, true); + return __mutex_lock_common(lock, state, subclass, nest_lock, ip, + ww_ctx, true, 0); +} + +static int __sched +__mutex_timed_lock(struct mutex *lock, const unsigned int timeout_ms, + unsigned int subclass, struct lockdep_map *nest_lock, + unsigned long ip) +{ + return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, + nest_lock, ip, NULL, false, timeout_ms); } #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -1393,6 +1453,35 @@ __ww_mutex_lock_interruptible_slowpath(struct ww_mutex *lock, #endif +/** + * mutex_timed_lock - acquire the mutex with timeout in ms + * @lock: mutex to be acquired + * @timeout_ms: timeout value in ms + * + * Return: 0 if mutex successfully acquired + * -ETIMEDOUT if timed out + * -EINTR if a signal is received. + * + * Wait to acquire the mutex atomically. The waiting will expires after + * the given timeout value in ms. + */ +int __sched mutex_timed_lock(struct mutex *lock, unsigned int timeout_ms) +{ +#ifdef CONFIG_DEBUG_MUTEXES + DEBUG_LOCKS_WARN_ON(lock->magic != lock); +#endif + + if (__mutex_trylock(lock)) { + mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); + return 0; + } else if (!timeout_ms) { + return -ETIMEDOUT; + } + + return __mutex_timed_lock(lock, timeout_ms, 0, NULL, _RET_IP_); +} +EXPORT_SYMBOL(mutex_timed_lock); + /** * mutex_trylock - try to acquire the mutex, without waiting * @lock: the mutex to be acquired -- 2.18.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/3] locking/mutex: Enable some lock event counters 2020-02-10 20:46 [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Waiman Long 2020-02-10 20:46 ` [PATCH 1/3] locking/mutex: Add mutex_timed_lock() Waiman Long @ 2020-02-10 20:46 ` Waiman Long 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long 2020-02-11 12:31 ` [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Peter Zijlstra 3 siblings, 0 replies; 13+ messages in thread From: Waiman Long @ 2020-02-10 20:46 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton Cc: linux-kernel, linux-mm, Waiman Long Add a preliminary set of mutex locking event counters. Signed-off-by: Waiman Long <longman@redhat.com> --- kernel/locking/lock_events_list.h | 9 +++++++++ kernel/locking/mutex.c | 9 ++++++++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 239039d0ce21..90db996f0608 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -69,3 +69,12 @@ LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */ LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */ LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */ LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */ + +/* + * Locking events for mutex + */ +LOCK_EVENT(mutex_slowpath) /* # of mutex sleeping slowpaths */ +LOCK_EVENT(mutex_optspin) /* # of successful optimistic spinnings */ +LOCK_EVENT(mutex_timeout) /* # of mutex timeouts */ +LOCK_EVENT(mutex_sleep) /* # of mutex sleeps */ +LOCK_EVENT(mutex_handoff) /* # of mutex lock handoffs */ diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 976179a4ed9e..1c5c252ce6fb 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -36,6 +36,7 @@ #else # include "mutex.h" #endif +#include "lock_events.h" void __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key) @@ -996,10 +997,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, lock_acquired(&lock->dep_map, ip); if (use_ww_ctx && ww_ctx) ww_mutex_set_context_fastpath(ww, ww_ctx); + lockevent_inc(mutex_optspin); preempt_enable(); return 0; } + lockevent_inc(mutex_slowpath); spin_lock(&lock->wait_lock); /* * After waiting to acquire the wait_lock, try again. @@ -1069,12 +1072,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, to = mutex_setup_hrtimer(&timer_sleeper, timeout); if (!to || !to->task) { + lockevent_inc(mutex_timeout); ret = -ETIMEDOUT; goto err; } } spin_unlock(&lock->wait_lock); + lockevent_inc(mutex_sleep); schedule_preempt_disabled(); /* @@ -1083,8 +1088,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, */ if ((use_ww_ctx && ww_ctx) || !first) { first = __mutex_waiter_is_first(lock, &waiter); - if (first) + if (first) { __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); + lockevent_inc(mutex_handoff); + } } set_current_state(state); -- 2.18.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 20:46 [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Waiman Long 2020-02-10 20:46 ` [PATCH 1/3] locking/mutex: Add mutex_timed_lock() Waiman Long 2020-02-10 20:46 ` [PATCH 2/3] locking/mutex: Enable some lock event counters Waiman Long @ 2020-02-10 20:46 ` Waiman Long 2020-02-10 22:03 ` Andrew Morton ` (2 more replies) 2020-02-11 12:31 ` [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Peter Zijlstra 3 siblings, 3 replies; 13+ messages in thread From: Waiman Long @ 2020-02-10 20:46 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton Cc: linux-kernel, linux-mm, Waiman Long The following lockdep splat was seen: [ 176.241923] ====================================================== [ 176.241924] WARNING: possible circular locking dependency detected [ 176.241926] 4.18.0-172.rt13.29.el8.x86_64+debug #1 Not tainted [ 176.241927] ------------------------------------------------------ [ 176.241929] slub_cpu_partia/5371 is trying to acquire lock: [ 176.241930] ffffffffa0b83718 (slab_mutex){+.+.}, at: slab_attr_store+0x6b/0xe0 [ 176.241941] but task is already holding lock: [ 176.241942] ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400 [ 176.241947] which lock already depends on the new lock. [ 176.241949] the existing dependency chain (in reverse order) is: [ 176.241949] -> #1 (kn->count#103){++++}: [ 176.241955] __kernfs_remove+0x616/0x800 [ 176.241957] kernfs_remove_by_name_ns+0x3e/0x80 [ 176.241959] sysfs_slab_add+0x1c6/0x330 [ 176.241961] __kmem_cache_create+0x15f/0x1b0 [ 176.241964] create_cache+0xe1/0x220 [ 176.241966] kmem_cache_create_usercopy+0x1a3/0x260 [ 176.241967] kmem_cache_create+0x12/0x20 [ 176.242076] mlx5_init_fs+0x18d/0x1a00 [mlx5_core] [ 176.242100] mlx5_load_one+0x3b4/0x1730 [mlx5_core] [ 176.242124] init_one+0x901/0x11b0 [mlx5_core] [ 176.242127] local_pci_probe+0xd4/0x180 [ 176.242131] work_for_cpu_fn+0x51/0xa0 [ 176.242133] process_one_work+0x91a/0x1ac0 [ 176.242134] worker_thread+0x536/0xb40 [ 176.242136] kthread+0x30c/0x3d0 [ 176.242140] ret_from_fork+0x27/0x50 [ 176.242140] -> #0 (slab_mutex){+.+.}: [ 176.242145] __lock_acquire+0x22cb/0x48c0 [ 176.242146] lock_acquire+0x134/0x4c0 [ 176.242148] _mutex_lock+0x28/0x40 [ 176.242150] slab_attr_store+0x6b/0xe0 [ 176.242151] kernfs_fop_write+0x251/0x400 [ 176.242154] vfs_write+0x157/0x460 [ 176.242155] ksys_write+0xb8/0x170 [ 176.242158] do_syscall_64+0x13c/0x710 [ 176.242160] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 176.242161] other info that might help us debug this: [ 176.242161] Possible unsafe locking scenario: [ 176.242162] CPU0 CPU1 [ 176.242163] ---- ---- [ 176.242163] lock(kn->count#103); [ 176.242165] lock(slab_mutex); [ 176.242166] lock(kn->count#103); [ 176.242167] lock(slab_mutex); [ 176.242169] *** DEADLOCK *** [ 176.242170] 3 locks held by slub_cpu_partia/5371: [ 176.242170] #0: ffff888705e3a800 (sb_writers#4){.+.+}, at: vfs_write+0x31c/0x460 [ 176.242174] #1: ffff889aeec4d658 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1a9/0x400 [ 176.242177] #2: ffff88bb6d8b83c8 (kn->count#103){++++}, at: kernfs_fop_write+0x1cc/0x400 [ 176.242180] stack backtrace: [ 176.242183] CPU: 36 PID: 5371 Comm: slub_cpu_partia Not tainted 4.18.0-172.rt13.29.el8.x86_64+debug #1 [ 176.242184] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019 [ 176.242185] Call Trace: [ 176.242190] dump_stack+0x9a/0xf0 [ 176.242193] check_noncircular+0x317/0x3c0 [ 176.242195] ? print_circular_bug+0x1e0/0x1e0 [ 176.242199] ? native_sched_clock+0x32/0x1e0 [ 176.242202] ? sched_clock+0x5/0x10 [ 176.242205] ? sched_clock_cpu+0x238/0x340 [ 176.242208] __lock_acquire+0x22cb/0x48c0 [ 176.242213] ? trace_hardirqs_on+0x10/0x10 [ 176.242215] ? trace_hardirqs_on+0x10/0x10 [ 176.242218] lock_acquire+0x134/0x4c0 [ 176.242220] ? slab_attr_store+0x6b/0xe0 [ 176.242223] _mutex_lock+0x28/0x40 [ 176.242225] ? slab_attr_store+0x6b/0xe0 [ 176.242227] slab_attr_store+0x6b/0xe0 [ 176.242229] ? sysfs_file_ops+0x160/0x160 [ 176.242230] kernfs_fop_write+0x251/0x400 [ 176.242232] ? __sb_start_write+0x26a/0x3f0 [ 176.242234] vfs_write+0x157/0x460 [ 176.242237] ksys_write+0xb8/0x170 [ 176.242239] ? __ia32_sys_read+0xb0/0xb0 [ 176.242242] ? do_syscall_64+0xb9/0x710 [ 176.242245] do_syscall_64+0x13c/0x710 [ 176.242247] entry_SYSCALL_64_after_hwframe+0x6a/0xdf In order to fix this circular lock dependency problem, we need to do a mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple trylock, however, is easy to fail causing user dissatisfaction. So the new mutex_timed_lock() function is now used to do a trylock with a 100ms timeout. Signed-off-by: Waiman Long <longman@redhat.com> --- mm/slub.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 17dc00e33115..495bec9b66ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { struct kmem_cache *c; - mutex_lock(&slab_mutex); + /* + * Timeout after 100ms + */ + if (mutex_timed_lock(&slab_mutex, 100) < 0) + return -EBUSY; + if (s->max_attr_size < len) s->max_attr_size = len; -- 2.18.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long @ 2020-02-10 22:03 ` Andrew Morton 2020-02-10 22:14 ` Waiman Long 2020-02-13 12:22 ` kbuild test robot 2020-02-13 16:48 ` kbuild test robot 2 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2020-02-10 22:03 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, linux-kernel, linux-mm On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote: > In order to fix this circular lock dependency problem, we need to do a > mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple > trylock, however, is easy to fail causing user dissatisfaction. So the > new mutex_timed_lock() function is now used to do a trylock with a > 100ms timeout. > > ... > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, > if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { > struct kmem_cache *c; > > - mutex_lock(&slab_mutex); > + /* > + * Timeout after 100ms > + */ > + if (mutex_timed_lock(&slab_mutex, 100) < 0) > + return -EBUSY; > + Oh dear. Surely there's a better fix here. Does slab really need to hold slab_mutex while creating that sysfs file? Why? If the issue is two threads trying to create the same sysfs file (unlikely, given that both will need to have created the same cache) then can we add a new mutex specifically for this purpose? Or something else. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 22:03 ` Andrew Morton @ 2020-02-10 22:14 ` Waiman Long 2020-02-10 23:10 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: Waiman Long @ 2020-02-10 22:14 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, linux-kernel, linux-mm On 2/10/20 5:03 PM, Andrew Morton wrote: > On Mon, 10 Feb 2020 15:46:51 -0500 Waiman Long <longman@redhat.com> wrote: > >> In order to fix this circular lock dependency problem, we need to do a >> mutex_trylock(&slab_mutex) in slab_attr_store() for CPU0 above. A simple >> trylock, however, is easy to fail causing user dissatisfaction. So the >> new mutex_timed_lock() function is now used to do a trylock with a >> 100ms timeout. >> >> ... >> >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >> struct kmem_cache *c; >> >> - mutex_lock(&slab_mutex); >> + /* >> + * Timeout after 100ms >> + */ >> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >> + return -EBUSY; >> + > Oh dear. Surely there's a better fix here. Does slab really need to > hold slab_mutex while creating that sysfs file? Why? > > If the issue is two threads trying to create the same sysfs file > (unlikely, given that both will need to have created the same cache) > then can we add a new mutex specifically for this purpose? > > Or something else. > Well, the current code iterates all the memory cgroups to set the same value in all of them. I believe the reason for holding the slab mutex is to make sure that memcg hierarchy is stable during this iteration process. Of course, we can argue if the attribute value should be duplicated in all memcg's. Cheers, Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 22:14 ` Waiman Long @ 2020-02-10 23:10 ` Andrew Morton 2020-02-11 23:30 ` Waiman Long 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2020-02-10 23:10 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, linux-kernel, linux-mm On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: > >> --- a/mm/slub.c > >> +++ b/mm/slub.c > >> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, > >> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { > >> struct kmem_cache *c; > >> > >> - mutex_lock(&slab_mutex); > >> + /* > >> + * Timeout after 100ms > >> + */ > >> + if (mutex_timed_lock(&slab_mutex, 100) < 0) > >> + return -EBUSY; > >> + > > Oh dear. Surely there's a better fix here. Does slab really need to > > hold slab_mutex while creating that sysfs file? Why? > > > > If the issue is two threads trying to create the same sysfs file > > (unlikely, given that both will need to have created the same cache) > > then can we add a new mutex specifically for this purpose? > > > > Or something else. > > > Well, the current code iterates all the memory cgroups to set the same > value in all of them. I believe the reason for holding the slab mutex is > to make sure that memcg hierarchy is stable during this iteration > process. But that is unrelated to creation of the sysfs file? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 23:10 ` Andrew Morton @ 2020-02-11 23:30 ` Waiman Long 2020-02-12 20:40 ` Waiman Long 0 siblings, 1 reply; 13+ messages in thread From: Waiman Long @ 2020-02-11 23:30 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, linux-kernel, linux-mm On 2/10/20 6:10 PM, Andrew Morton wrote: > On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: > >>>> --- a/mm/slub.c >>>> +++ b/mm/slub.c >>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >>>> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >>>> struct kmem_cache *c; >>>> >>>> - mutex_lock(&slab_mutex); >>>> + /* >>>> + * Timeout after 100ms >>>> + */ >>>> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >>>> + return -EBUSY; >>>> + >>> Oh dear. Surely there's a better fix here. Does slab really need to >>> hold slab_mutex while creating that sysfs file? Why? >>> >>> If the issue is two threads trying to create the same sysfs file >>> (unlikely, given that both will need to have created the same cache) >>> then can we add a new mutex specifically for this purpose? >>> >>> Or something else. >>> >> Well, the current code iterates all the memory cgroups to set the same >> value in all of them. I believe the reason for holding the slab mutex is >> to make sure that memcg hierarchy is stable during this iteration >> process. > But that is unrelated to creation of the sysfs file? > OK, I will take a closer look at that. Cheers, Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-11 23:30 ` Waiman Long @ 2020-02-12 20:40 ` Waiman Long 0 siblings, 0 replies; 13+ messages in thread From: Waiman Long @ 2020-02-12 20:40 UTC (permalink / raw) To: Andrew Morton Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, linux-kernel, linux-mm On 2/11/20 6:30 PM, Waiman Long wrote: > On 2/10/20 6:10 PM, Andrew Morton wrote: >> On Mon, 10 Feb 2020 17:14:31 -0500 Waiman Long <longman@redhat.com> wrote: >> >>>>> --- a/mm/slub.c >>>>> +++ b/mm/slub.c >>>>> @@ -5536,7 +5536,12 @@ static ssize_t slab_attr_store(struct kobject *kobj, >>>>> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { >>>>> struct kmem_cache *c; >>>>> >>>>> - mutex_lock(&slab_mutex); >>>>> + /* >>>>> + * Timeout after 100ms >>>>> + */ >>>>> + if (mutex_timed_lock(&slab_mutex, 100) < 0) >>>>> + return -EBUSY; >>>>> + >>>> Oh dear. Surely there's a better fix here. Does slab really need to >>>> hold slab_mutex while creating that sysfs file? Why? >>>> >>>> If the issue is two threads trying to create the same sysfs file >>>> (unlikely, given that both will need to have created the same cache) >>>> then can we add a new mutex specifically for this purpose? >>>> >>>> Or something else. >>>> >>> Well, the current code iterates all the memory cgroups to set the same >>> value in all of them. I believe the reason for holding the slab mutex is >>> to make sure that memcg hierarchy is stable during this iteration >>> process. >> But that is unrelated to creation of the sysfs file? >> > OK, I will take a closer look at that. During the creation of a sysfs file: static int sysfs_slab_add(struct kmem_cache *s) { : if (unmergeable) { /* * Slabcache can never be merged so we can use the name proper. * This is typically the case for debug situations. In that * case we can catch duplicate names easily. */ sysfs_remove_link(&slab_kset->kobj, s->name); name = s->name; The code is trying to remove sysfs files of a cache with conflicting name. So it seems like kmem_cache_create() is called with a name that has been used before. If it happens that a write to one of the sysfs files to be removed happens at the same time, a deadlock can happen. In this particular case, the kmem_cache_create() call comes from the mlx5_core module. steering->fgs_cache = kmem_cache_create("mlx5_fs_fgs", sizeof(struct mlx5_flow_group), 0, 0, NULL); Perhaps the module is somehow unloaded and then loaded again. Unfortunately this lockdep error was seen once. It is hard to find out how to fix it without an easy way to reproduce it. So I will table this for now until there is a way to reproduce it. Thanks, Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long 2020-02-10 22:03 ` Andrew Morton @ 2020-02-13 12:22 ` kbuild test robot 2020-02-13 16:48 ` kbuild test robot 2 siblings, 0 replies; 13+ messages in thread From: kbuild test robot @ 2020-02-13 12:22 UTC (permalink / raw) To: Waiman Long Cc: kbuild-all, Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, linux-kernel, linux-mm, Waiman Long [-- Attachment #1: Type: text/plain, Size: 3367 bytes --] Hi Waiman, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/locking/core] [also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213] [cannot apply to arm-perf/for-next/perf] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233 config: x86_64-randconfig-s2-20200213 (attached as .config) compiler: gcc-7 (Debian 7.5.0-4) 7.5.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/slub.c: In function 'slab_attr_store': >> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock'; did you mean 'mutex_trylock'? [-Werror=implicit-function-declaration] if (mutex_timed_lock(&slab_mutex, 100) < 0) ^~~~~~~~~~~~~~~~ mutex_trylock cc1: some warnings being treated as errors vim +5542 mm/slub.c 5519 5520 static ssize_t slab_attr_store(struct kobject *kobj, 5521 struct attribute *attr, 5522 const char *buf, size_t len) 5523 { 5524 struct slab_attribute *attribute; 5525 struct kmem_cache *s; 5526 int err; 5527 5528 attribute = to_slab_attr(attr); 5529 s = to_slab(kobj); 5530 5531 if (!attribute->store) 5532 return -EIO; 5533 5534 err = attribute->store(s, buf, len); 5535 #ifdef CONFIG_MEMCG 5536 if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { 5537 struct kmem_cache *c; 5538 5539 /* 5540 * Timeout after 100ms 5541 */ > 5542 if (mutex_timed_lock(&slab_mutex, 100) < 0) 5543 return -EBUSY; 5544 5545 if (s->max_attr_size < len) 5546 s->max_attr_size = len; 5547 5548 /* 5549 * This is a best effort propagation, so this function's return 5550 * value will be determined by the parent cache only. This is 5551 * basically because not all attributes will have a well 5552 * defined semantics for rollbacks - most of the actions will 5553 * have permanent effects. 5554 * 5555 * Returning the error value of any of the children that fail 5556 * is not 100 % defined, in the sense that users seeing the 5557 * error code won't be able to know anything about the state of 5558 * the cache. 5559 * 5560 * Only returning the error code for the parent cache at least 5561 * has well defined semantics. The cache being written to 5562 * directly either failed or succeeded, in which case we loop 5563 * through the descendants with best-effort propagation. 5564 */ 5565 for_each_memcg_cache(c, s) 5566 attribute->store(c, buf, len); 5567 mutex_unlock(&slab_mutex); 5568 } 5569 #endif 5570 return err; 5571 } 5572 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 36057 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long 2020-02-10 22:03 ` Andrew Morton 2020-02-13 12:22 ` kbuild test robot @ 2020-02-13 16:48 ` kbuild test robot 2 siblings, 0 replies; 13+ messages in thread From: kbuild test robot @ 2020-02-13 16:48 UTC (permalink / raw) To: Waiman Long Cc: kbuild-all, Peter Zijlstra, Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, linux-kernel, linux-mm, Waiman Long [-- Attachment #1: Type: text/plain, Size: 3348 bytes --] Hi Waiman, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/locking/core] [also build test ERROR on tip/auto-latest linus/master v5.6-rc1 next-20200213] [cannot apply to arm-perf/for-next/perf] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Waiman-Long/locking-mutex-Add-mutex_timed_lock-to-solve-potential-deadlock-problems/20200213-063401 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 41f0e29190ac9e38099a37abd1a8a4cb1dc21233 config: x86_64-randconfig-s2-20200213 (attached as .config) compiler: gcc-6 (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/slub.c: In function 'slab_attr_store': >> mm/slub.c:5542:7: error: implicit declaration of function 'mutex_timed_lock' [-Werror=implicit-function-declaration] if (mutex_timed_lock(&slab_mutex, 100) < 0) ^~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/mutex_timed_lock +5542 mm/slub.c 5519 5520 static ssize_t slab_attr_store(struct kobject *kobj, 5521 struct attribute *attr, 5522 const char *buf, size_t len) 5523 { 5524 struct slab_attribute *attribute; 5525 struct kmem_cache *s; 5526 int err; 5527 5528 attribute = to_slab_attr(attr); 5529 s = to_slab(kobj); 5530 5531 if (!attribute->store) 5532 return -EIO; 5533 5534 err = attribute->store(s, buf, len); 5535 #ifdef CONFIG_MEMCG 5536 if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { 5537 struct kmem_cache *c; 5538 5539 /* 5540 * Timeout after 100ms 5541 */ > 5542 if (mutex_timed_lock(&slab_mutex, 100) < 0) 5543 return -EBUSY; 5544 5545 if (s->max_attr_size < len) 5546 s->max_attr_size = len; 5547 5548 /* 5549 * This is a best effort propagation, so this function's return 5550 * value will be determined by the parent cache only. This is 5551 * basically because not all attributes will have a well 5552 * defined semantics for rollbacks - most of the actions will 5553 * have permanent effects. 5554 * 5555 * Returning the error value of any of the children that fail 5556 * is not 100 % defined, in the sense that users seeing the 5557 * error code won't be able to know anything about the state of 5558 * the cache. 5559 * 5560 * Only returning the error code for the parent cache at least 5561 * has well defined semantics. The cache being written to 5562 * directly either failed or succeeded, in which case we loop 5563 * through the descendants with best-effort propagation. 5564 */ 5565 for_each_memcg_cache(c, s) 5566 attribute->store(c, buf, len); 5567 mutex_unlock(&slab_mutex); 5568 } 5569 #endif 5570 return err; 5571 } 5572 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 36070 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems 2020-02-10 20:46 [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Waiman Long ` (2 preceding siblings ...) 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long @ 2020-02-11 12:31 ` Peter Zijlstra 2020-02-11 23:31 ` Waiman Long 3 siblings, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2020-02-11 12:31 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, linux-kernel, linux-mm On Mon, Feb 10, 2020 at 03:46:48PM -0500, Waiman Long wrote: > An alternative solution proposed by this patchset is to add a new > mutex_timed_lock() call that allows an additional timeout argument. This > function will return an error code if timeout happens. The use of this > new API will prevent deadlock from happening while allowing the task > to wait a sufficient period of time before giving up. We've always rejected timed_lock implementation because, as akpm has already expressed, their need is disgusting. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems 2020-02-11 12:31 ` [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Peter Zijlstra @ 2020-02-11 23:31 ` Waiman Long 0 siblings, 0 replies; 13+ messages in thread From: Waiman Long @ 2020-02-11 23:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Will Deacon, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, linux-kernel, linux-mm On 2/11/20 7:31 AM, Peter Zijlstra wrote: > On Mon, Feb 10, 2020 at 03:46:48PM -0500, Waiman Long wrote: >> An alternative solution proposed by this patchset is to add a new >> mutex_timed_lock() call that allows an additional timeout argument. This >> function will return an error code if timeout happens. The use of this >> new API will prevent deadlock from happening while allowing the task >> to wait a sufficient period of time before giving up. > We've always rejected timed_lock implementation because, as akpm has > already expressed, their need is disgusting. > That is fine. I will see if the lock order can be changed in a way to address the problem. Thanks, Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-02-13 16:49 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-10 20:46 [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Waiman Long 2020-02-10 20:46 ` [PATCH 1/3] locking/mutex: Add mutex_timed_lock() Waiman Long 2020-02-10 20:46 ` [PATCH 2/3] locking/mutex: Enable some lock event counters Waiman Long 2020-02-10 20:46 ` [PATCH 3/3] mm/slub: Fix potential deadlock problem in slab_attr_store() Waiman Long 2020-02-10 22:03 ` Andrew Morton 2020-02-10 22:14 ` Waiman Long 2020-02-10 23:10 ` Andrew Morton 2020-02-11 23:30 ` Waiman Long 2020-02-12 20:40 ` Waiman Long 2020-02-13 12:22 ` kbuild test robot 2020-02-13 16:48 ` kbuild test robot 2020-02-11 12:31 ` [PATCH 0/3] locking/mutex: Add mutex_timed_lock() to solve potential deadlock problems Peter Zijlstra 2020-02-11 23:31 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).