linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex
@ 2016-10-07 14:52 Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter


Hi all,

Since you all should not be sending patches during the merge window, I figured
I should to keep you all occupied with something.

Please review, test and otherwise try to break these here patches.

I would like to get these patches into -tip (and -next) once the merge window
closes, so please spend these quiet days staring at this stuff.

Small changes only, mostly the handoff logic as suggested by Waiman last time.


---
 arch/alpha/include/asm/mutex.h           |   9 -
 arch/arc/include/asm/mutex.h             |  18 -
 arch/arm/include/asm/mutex.h             |  21 --
 arch/arm64/include/asm/Kbuild            |   1 -
 arch/avr32/include/asm/mutex.h           |   9 -
 arch/blackfin/include/asm/Kbuild         |   1 -
 arch/c6x/include/asm/mutex.h             |   6 -
 arch/cris/include/asm/mutex.h            |   9 -
 arch/frv/include/asm/mutex.h             |   9 -
 arch/h8300/include/asm/mutex.h           |   9 -
 arch/hexagon/include/asm/mutex.h         |   8 -
 arch/ia64/include/asm/mutex.h            |  90 -----
 arch/m32r/include/asm/mutex.h            |   9 -
 arch/m68k/include/asm/Kbuild             |   1 -
 arch/metag/include/asm/Kbuild            |   1 -
 arch/microblaze/include/asm/mutex.h      |   1 -
 arch/mips/include/asm/Kbuild             |   1 -
 arch/mn10300/include/asm/mutex.h         |  16 -
 arch/nios2/include/asm/mutex.h           |   1 -
 arch/openrisc/include/asm/mutex.h        |  27 --
 arch/parisc/include/asm/Kbuild           |   1 -
 arch/powerpc/include/asm/mutex.h         | 132 -------
 arch/s390/include/asm/mutex.h            |   9 -
 arch/score/include/asm/mutex.h           |   6 -
 arch/sh/include/asm/mutex-llsc.h         | 109 ------
 arch/sh/include/asm/mutex.h              |  12 -
 arch/sparc/include/asm/Kbuild            |   1 -
 arch/tile/include/asm/Kbuild             |   1 -
 arch/um/include/asm/Kbuild               |   1 -
 arch/unicore32/include/asm/mutex.h       |  20 --
 arch/x86/include/asm/mutex.h             |   5 -
 arch/x86/include/asm/mutex_32.h          | 110 ------
 arch/x86/include/asm/mutex_64.h          | 127 -------
 arch/xtensa/include/asm/mutex.h          |   9 -
 drivers/gpu/drm/i915/i915_gem_shrinker.c |  26 +-
 drivers/gpu/drm/msm/msm_gem_shrinker.c   |  23 +-
 include/asm-generic/mutex-dec.h          |  88 -----
 include/asm-generic/mutex-null.h         |  19 --
 include/asm-generic/mutex-xchg.h         | 120 -------
 include/asm-generic/mutex.h              |   9 -
 include/linux/mutex-debug.h              |  24 --
 include/linux/mutex.h                    |  46 ++-
 kernel/Kconfig.locks                     |   2 +-
 kernel/locking/mutex-debug.c             |  13 -
 kernel/locking/mutex-debug.h             |  10 -
 kernel/locking/mutex.c                   | 569 ++++++++++++++++++-------------
 kernel/locking/mutex.h                   |  26 --
 kernel/sched/core.c                      |   2 +-
 48 files changed, 364 insertions(+), 1403 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-07 15:43   ` Peter Zijlstra
  2016-10-18 12:48   ` Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 2/8] locking/mutex: Rework mutex::owner Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter, Rob Clark

[-- Attachment #1: peterz-revert-drm-i915-brainmelt.patch --]
[-- Type: text/plain, Size: 2415 bytes --]

Poking at lock internals is not cool. Since I'm going to change the
implementation this will break, take it out.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Rob Clark <robdclark@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/gpu/drm/i915/i915_gem_shrinker.c |   26 +++-----------------------
 drivers/gpu/drm/msm/msm_gem_shrinker.c   |   23 +++--------------------
 2 files changed, 6 insertions(+), 43 deletions(-)

--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -35,19 +35,6 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 
-static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
-{
-	if (!mutex_is_locked(mutex))
-		return false;
-
-#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)
-	return mutex->owner == task;
-#else
-	/* Since UP may be pre-empted, we cannot assume that we own the lock */
-	return false;
-#endif
-}
-
 static int num_vma_bound(struct drm_i915_gem_object *obj)
 {
 	struct i915_vma *vma;
@@ -238,17 +225,10 @@ unsigned long i915_gem_shrink_all(struct
 
 static bool i915_gem_shrinker_lock(struct drm_device *dev, bool *unlock)
 {
-	if (!mutex_trylock(&dev->struct_mutex)) {
-		if (!mutex_is_locked_by(&dev->struct_mutex, current))
-			return false;
-
-		if (to_i915(dev)->mm.shrinker_no_lock_stealing)
-			return false;
-
-		*unlock = false;
-	} else
-		*unlock = true;
+	if (!mutex_trylock(&dev->struct_mutex))
+		return false;
 
+	*unlock = true;
 	return true;
 }
 
--- a/drivers/gpu/drm/msm/msm_gem_shrinker.c
+++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c
@@ -18,29 +18,12 @@
 #include "msm_drv.h"
 #include "msm_gem.h"
 
-static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
-{
-	if (!mutex_is_locked(mutex))
-		return false;
-
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_MUTEXES)
-	return mutex->owner == task;
-#else
-	/* Since UP may be pre-empted, we cannot assume that we own the lock */
-	return false;
-#endif
-}
-
 static bool msm_gem_shrinker_lock(struct drm_device *dev, bool *unlock)
 {
-	if (!mutex_trylock(&dev->struct_mutex)) {
-		if (!mutex_is_locked_by(&dev->struct_mutex, current))
-			return false;
-		*unlock = false;
-	} else {
-		*unlock = true;
-	}
+	if (!mutex_trylock(&dev->struct_mutex))
+		return false;
 
+	*unlock = true;
 	return true;
 }
 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 2/8] locking/mutex: Rework mutex::owner
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-12 17:59   ` Davidlohr Bueso
  2016-10-13 15:18   ` Will Deacon
  2016-10-07 14:52 ` [PATCH -v4 3/8] locking/mutex: Kill arch specific code Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter

[-- Attachment #1: peterz-locking-rewrite-mutex-owner.patch --]
[-- Type: text/plain, Size: 25584 bytes --]

The current mutex implementation has an atomic lock word and a
non-atomic owner field.

This disparity leads to a number of issues with the current mutex code
as it means that we can have a locked mutex without an explicit owner
(because the owner field has not been set, or already cleared).

This leads to a number of weird corner cases, esp. between the
optimistic spinning and debug code. Where the optimistic spinning
code needs the owner field updated inside the lock region, the debug
code is more relaxed because the whole lock is serialized by the
wait_lock.

Also, the spinning code itself has a few corner cases where we need to
deal with a held lock without an owner field.

Furthermore, it becomes even more of a problem when trying to fix
starvation cases in the current code. We end up stacking special case
on special case.

To solve this rework the basic mutex implementation to be a single
atomic word that contains the owner and uses the low bits for extra
state.

This matches how PI futexes and rt_mutex already work. By having the
owner an integral part of the lock state a lot of the problems
dissapear and we get a better option to deal with starvation cases,
direct owner handoff.

Changing the basic mutex does however invalidate all the arch specific
mutex code; this patch leaves that unused in-place, a later patch will
remove that.


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mutex-debug.h  |   24 --
 include/linux/mutex.h        |   46 +++--
 kernel/locking/mutex-debug.c |   13 -
 kernel/locking/mutex-debug.h |   10 -
 kernel/locking/mutex.c       |  371 ++++++++++++++++++-------------------------
 kernel/locking/mutex.h       |   26 ---
 kernel/sched/core.c          |    2 
 7 files changed, 187 insertions(+), 305 deletions(-)

--- a/include/linux/mutex-debug.h
+++ /dev/null
@@ -1,24 +0,0 @@
-#ifndef __LINUX_MUTEX_DEBUG_H
-#define __LINUX_MUTEX_DEBUG_H
-
-#include <linux/linkage.h>
-#include <linux/lockdep.h>
-#include <linux/debug_locks.h>
-
-/*
- * Mutexes - debugging helpers:
- */
-
-#define __DEBUG_MUTEX_INITIALIZER(lockname)				\
-	, .magic = &lockname
-
-#define mutex_init(mutex)						\
-do {									\
-	static struct lock_class_key __key;				\
-									\
-	__mutex_init((mutex), #mutex, &__key);				\
-} while (0)
-
-extern void mutex_destroy(struct mutex *lock);
-
-#endif
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -18,6 +18,7 @@
 #include <linux/atomic.h>
 #include <asm/processor.h>
 #include <linux/osq_lock.h>
+#include <linux/debug_locks.h>
 
 /*
  * Simple, straightforward mutexes with strict semantics:
@@ -48,16 +49,12 @@
  *   locks and tasks (and only those tasks)
  */
 struct mutex {
-	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
-	atomic_t		count;
+	atomic_long_t		owner;
 	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)
-	struct task_struct	*owner;
-#endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* Spinner MCS lock */
 #endif
+	struct list_head	wait_list;
 #ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
@@ -66,6 +63,11 @@ struct mutex {
 #endif
 };
 
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x03);
+}
+
 /*
  * This is the control structure for tasks blocked on mutex,
  * which resides on the blocked task's kernel stack:
@@ -79,9 +81,20 @@ struct mutex_waiter {
 };
 
 #ifdef CONFIG_DEBUG_MUTEXES
-# include <linux/mutex-debug.h>
+
+#define __DEBUG_MUTEX_INITIALIZER(lockname)				\
+	, .magic = &lockname
+
+extern void mutex_destroy(struct mutex *lock);
+
 #else
+
 # define __DEBUG_MUTEX_INITIALIZER(lockname)
+
+static inline void mutex_destroy(struct mutex *lock) {}
+
+#endif
+
 /**
  * mutex_init - initialize the mutex
  * @mutex: the mutex to be initialized
@@ -90,14 +103,12 @@ struct mutex_waiter {
  *
  * It is not allowed to initialize an already locked mutex.
  */
-# define mutex_init(mutex) \
-do {							\
-	static struct lock_class_key __key;		\
-							\
-	__mutex_init((mutex), #mutex, &__key);		\
+#define mutex_init(mutex)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__mutex_init((mutex), #mutex, &__key);				\
 } while (0)
-static inline void mutex_destroy(struct mutex *lock) {}
-#endif
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define __DEP_MAP_MUTEX_INITIALIZER(lockname) \
@@ -107,7 +118,7 @@ static inline void mutex_destroy(struct
 #endif
 
 #define __MUTEX_INITIALIZER(lockname) \
-		{ .count = ATOMIC_INIT(1) \
+		{ .owner = ATOMIC_LONG_INIT(0) \
 		, .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \
 		, .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
 		__DEBUG_MUTEX_INITIALIZER(lockname) \
@@ -127,7 +138,10 @@ extern void __mutex_init(struct mutex *l
  */
 static inline int mutex_is_locked(struct mutex *lock)
 {
-	return atomic_read(&lock->count) != 1;
+	/*
+	 * XXX think about spin_is_locked
+	 */
+	return __mutex_owner(lock) != NULL;
 }
 
 /*
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -73,21 +73,8 @@ void debug_mutex_unlock(struct mutex *lo
 {
 	if (likely(debug_locks)) {
 		DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-
-		if (!lock->owner)
-			DEBUG_LOCKS_WARN_ON(!lock->owner);
-		else
-			DEBUG_LOCKS_WARN_ON(lock->owner != current);
-
 		DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	}
-
-	/*
-	 * __mutex_slowpath_needs_to_unlock() is explicitly 0 for debug
-	 * mutexes so that we can do it here after we've verified state.
-	 */
-	mutex_clear_owner(lock);
-	atomic_set(&lock->count, 1);
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -27,16 +27,6 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
 
-static inline void mutex_set_owner(struct mutex *lock)
-{
-	WRITE_ONCE(lock->owner, current);
-}
-
-static inline void mutex_clear_owner(struct mutex *lock)
-{
-	WRITE_ONCE(lock->owner, NULL);
-}
-
 #define spin_lock_mutex(lock, flags)			\
 	do {						\
 		struct mutex *l = container_of(lock, struct mutex, wait_lock); \
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -27,41 +27,113 @@
 #include <linux/debug_locks.h>
 #include <linux/osq_lock.h>
 
-/*
- * In the DEBUG case we are using the "NULL fastpath" for mutexes,
- * which forces all calls into the slowpath:
- */
 #ifdef CONFIG_DEBUG_MUTEXES
 # include "mutex-debug.h"
-# include <asm-generic/mutex-null.h>
-/*
- * Must be 0 for the debug case so we do not do the unlock outside of the
- * wait_lock region. debug_mutex_unlock() will do the actual unlock in this
- * case.
- */
-# undef __mutex_slowpath_needs_to_unlock
-# define  __mutex_slowpath_needs_to_unlock()	0
 #else
 # include "mutex.h"
-# include <asm/mutex.h>
 #endif
 
 void
 __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 {
-	atomic_set(&lock->count, 1);
+	atomic_long_set(&lock->owner, 0);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
-	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	osq_lock_init(&lock->osq);
 #endif
 
 	debug_mutex_init(lock, name, key);
 }
-
 EXPORT_SYMBOL(__mutex_init);
 
+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * ARCH_MIN_TASKALIGN (which is at least sizeof(void *)), we have low
+ * bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ */
+#define MUTEX_FLAG_WAITERS	0x01
+
+#define MUTEX_FLAGS		0x03
+
+static inline struct task_struct *__owner_task(unsigned long owner)
+{
+	return (struct task_struct *)(owner & ~MUTEX_FLAGS);
+}
+
+static inline unsigned long __owner_flags(unsigned long owner)
+{
+	return owner & MUTEX_FLAGS;
+}
+
+/*
+ * Actual trylock that will work on any unlocked state.
+ */
+static inline bool __mutex_trylock(struct mutex *lock)
+{
+	unsigned long owner, curr = (unsigned long)current;
+
+	owner = atomic_long_read(&lock->owner);
+	for (;;) { /* must loop, can race against a flag */
+		unsigned long old;
+
+		if (__owner_task(owner))
+			return false;
+
+		old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
+						  curr | __owner_flags(owner));
+		if (old == owner)
+			return true;
+
+		owner = old;
+	}
+}
+
+#ifndef CONFIG_DEBUG_LOCK_ALLOC
+/*
+ * Lockdep annotations are contained to the slow paths for simplicity.
+ * There is nothing that would stop spreading the lockdep annotations outwards
+ * except more code.
+ */
+
+/*
+ * Optimistic trylock that only works in the uncontended case. Make sure to
+ * follow with a __mutex_trylock() before failing.
+ */
+static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
+{
+	unsigned long curr = (unsigned long)current;
+
+	if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
+		return true;
+
+	return false;
+}
+
+static __always_inline bool __mutex_unlock_fast(struct mutex *lock)
+{
+	unsigned long curr = (unsigned long)current;
+
+	if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)
+		return true;
+
+	return false;
+}
+#endif
+
+static inline void __mutex_set_flag(struct mutex *lock, unsigned long flag)
+{
+	atomic_long_or(flag, &lock->owner);
+}
+
+static inline void __mutex_clear_flag(struct mutex *lock, unsigned long flag)
+{
+	atomic_long_andnot(flag, &lock->owner);
+}
+
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * We split the mutex lock/unlock logic into separate fastpath and
@@ -69,7 +141,7 @@ EXPORT_SYMBOL(__mutex_init);
  * We also put the fastpath first in the kernel image, to make sure the
  * branch is predicted by the CPU as default-untaken.
  */
-__visible void __sched __mutex_lock_slowpath(atomic_t *lock_count);
+static void __sched __mutex_lock_slowpath(struct mutex *lock);
 
 /**
  * mutex_lock - acquire the mutex
@@ -95,14 +167,10 @@ __visible void __sched __mutex_lock_slow
 void __sched mutex_lock(struct mutex *lock)
 {
 	might_sleep();
-	/*
-	 * The locking fastpath is the 1->0 transition from
-	 * 'unlocked' into 'locked' state.
-	 */
-	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
-	mutex_set_owner(lock);
-}
 
+	if (!__mutex_trylock_fast(lock))
+		__mutex_lock_slowpath(lock);
+}
 EXPORT_SYMBOL(mutex_lock);
 #endif
 
@@ -149,9 +217,6 @@ static __always_inline void ww_mutex_loc
 /*
  * After acquiring lock with fastpath or when we lost out in contested
  * slowpath, set ctx and wake up any waiters so they can recheck.
- *
- * This function is never called when CONFIG_DEBUG_LOCK_ALLOC is set,
- * as the fastpath and opportunistic spinning are disabled in that case.
  */
 static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock,
@@ -176,7 +241,7 @@ ww_mutex_set_context_fastpath(struct ww_
 	/*
 	 * Check if lock is contended, if not there is nobody to wake up
 	 */
-	if (likely(atomic_read(&lock->base.count) == 0))
+	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
 		return;
 
 	/*
@@ -227,7 +292,7 @@ bool mutex_spin_on_owner(struct mutex *l
 	bool ret = true;
 
 	rcu_read_lock();
-	while (lock->owner == owner) {
+	while (__mutex_owner(lock) == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking lock->owner still matches owner. If that fails,
@@ -260,27 +325,20 @@ static inline int mutex_can_spin_on_owne
 		return 0;
 
 	rcu_read_lock();
-	owner = READ_ONCE(lock->owner);
+	owner = __mutex_owner(lock);
 	if (owner)
 		retval = owner->on_cpu;
 	rcu_read_unlock();
+
 	/*
-	 * if lock->owner is not set, the mutex owner may have just acquired
-	 * it and not set the owner yet or the mutex has been released.
+	 * If lock->owner is not set, the mutex has been released. Return true
+	 * such that we'll trylock in the spin path, which is a faster option
+	 * than the blocking slow path.
 	 */
 	return retval;
 }
 
 /*
- * Atomically try to take the lock when it is available
- */
-static inline bool mutex_try_to_acquire(struct mutex *lock)
-{
-	return !mutex_is_locked(lock) &&
-		(atomic_cmpxchg_acquire(&lock->count, 1, 0) == 1);
-}
-
-/*
  * Optimistic spinning.
  *
  * We try to spin for acquisition when we find that the lock owner
@@ -288,13 +346,6 @@ static inline bool mutex_try_to_acquire(
  * need to reschedule. The rationale is that if the lock owner is
  * running, it is likely to release the lock soon.
  *
- * Since this needs the lock owner, and this mutex implementation
- * doesn't track the owner atomically in the lock field, we need to
- * track it non-atomically.
- *
- * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
- * to serialize everything.
- *
  * The mutex spinners are queued up using MCS lock so that only one
  * spinner can compete for the mutex. However, if mutex spinning isn't
  * going to happen, there is no point in going through the lock/unlock
@@ -342,36 +393,17 @@ static bool mutex_optimistic_spin(struct
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		owner = READ_ONCE(lock->owner);
+		owner = __mutex_owner(lock);
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
 		/* Try to acquire the mutex if it is unlocked. */
-		if (mutex_try_to_acquire(lock)) {
-			lock_acquired(&lock->dep_map, ip);
-
-			if (use_ww_ctx) {
-				struct ww_mutex *ww;
-				ww = container_of(lock, struct ww_mutex, base);
-
-				ww_mutex_set_context_fastpath(ww, ww_ctx);
-			}
-
-			mutex_set_owner(lock);
+		if (__mutex_trylock(lock)) {
 			osq_unlock(&lock->osq);
 			return true;
 		}
 
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!owner && (need_resched() || rt_task(task)))
-			break;
-
-		/*
 		 * The cpu_relax() call is a compiler barrier which forces
 		 * everything in this loop to be re-loaded. We don't need
 		 * memory barriers as we'll eventually observe the right
@@ -406,8 +438,7 @@ static bool mutex_optimistic_spin(struct
 }
 #endif
 
-__visible __used noinline
-void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
+static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip);
 
 /**
  * mutex_unlock - release the mutex
@@ -422,21 +453,12 @@ void __sched __mutex_unlock_slowpath(ato
  */
 void __sched mutex_unlock(struct mutex *lock)
 {
-	/*
-	 * The unlocking fastpath is the 0->1 transition from 'locked'
-	 * into 'unlocked' state:
-	 */
-#ifndef CONFIG_DEBUG_MUTEXES
-	/*
-	 * When debugging is enabled we must not clear the owner before time,
-	 * the slow path will always be taken, and that clears the owner field
-	 * after verifying that it was indeed current.
-	 */
-	mutex_clear_owner(lock);
+#ifndef CONFIG_DEBUG_LOCK_ALLOC
+	if (__mutex_unlock_fast(lock))
+		return;
 #endif
-	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
+	__mutex_unlock_slowpath(lock, _RET_IP_);
 }
-
 EXPORT_SYMBOL(mutex_unlock);
 
 /**
@@ -465,15 +487,7 @@ void __sched ww_mutex_unlock(struct ww_m
 		lock->ctx = NULL;
 	}
 
-#ifndef CONFIG_DEBUG_MUTEXES
-	/*
-	 * When debugging is enabled we must not clear the owner before time,
-	 * the slow path will always be taken, and that clears the owner field
-	 * after verifying that it was indeed current.
-	 */
-	mutex_clear_owner(&lock->base);
-#endif
-	__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+	mutex_unlock(&lock->base);
 }
 EXPORT_SYMBOL(ww_mutex_unlock);
 
@@ -520,20 +534,24 @@ __mutex_lock_common(struct mutex *lock,
 	preempt_disable();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
-	if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
+	if (__mutex_trylock(lock) || mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
 		/* got the lock, yay! */
+		lock_acquired(&lock->dep_map, ip);
+		if (use_ww_ctx) {
+			struct ww_mutex *ww;
+			ww = container_of(lock, struct ww_mutex, base);
+
+			ww_mutex_set_context_fastpath(ww, ww_ctx);
+		}
 		preempt_enable();
 		return 0;
 	}
 
 	spin_lock_mutex(&lock->wait_lock, flags);
-
 	/*
-	 * Once more, try to acquire the lock. Only try-lock the mutex if
-	 * it is unlocked to reduce unnecessary xchg() operations.
+	 * After waiting to acquire the wait_lock, try again.
 	 */
-	if (!mutex_is_locked(lock) &&
-	    (atomic_xchg_acquire(&lock->count, 0) == 1))
+	if (__mutex_trylock(lock))
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -543,21 +561,13 @@ __mutex_lock_common(struct mutex *lock,
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
 
+	if (list_first_entry(&lock->wait_list, struct mutex_waiter, list) == &waiter)
+		__mutex_set_flag(lock, MUTEX_FLAG_WAITERS);
+
 	lock_contended(&lock->dep_map, ip);
 
 	for (;;) {
-		/*
-		 * Lets try to take the lock again - this is needed even if
-		 * we get here for the first time (shortly after failing to
-		 * acquire the lock), to make sure that we get a wakeup once
-		 * it's unlocked. Later on, if we sleep, this is the
-		 * operation that gives us the lock. We xchg it to -1, so
-		 * that when we release the lock, we properly wake up the
-		 * other waiters. We only attempt the xchg if the count is
-		 * non-negative in order to avoid unnecessary xchg operations:
-		 */
-		if (atomic_read(&lock->count) >= 0 &&
-		    (atomic_xchg_acquire(&lock->count, -1) == 1))
+		if (__mutex_trylock(lock))
 			break;
 
 		/*
@@ -585,15 +595,14 @@ __mutex_lock_common(struct mutex *lock,
 	__set_task_state(task, TASK_RUNNING);
 
 	mutex_remove_waiter(lock, &waiter, task);
-	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
-		atomic_set(&lock->count, 0);
+		__mutex_clear_flag(lock, MUTEX_FLAG_WAITERS);
+
 	debug_mutex_free_waiter(&waiter);
 
 skip_wait:
 	/* got the lock - cleanup and rejoice! */
 	lock_acquired(&lock->dep_map, ip);
-	mutex_set_owner(lock);
 
 	if (use_ww_ctx) {
 		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
@@ -631,7 +640,6 @@ _mutex_lock_nest_lock(struct mutex *lock
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
 			    0, nest, _RET_IP_, NULL, 0);
 }
-
 EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
 
 int __sched
@@ -650,7 +658,6 @@ mutex_lock_interruptible_nested(struct m
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
 				   subclass, NULL, _RET_IP_, NULL, 0);
 }
-
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
 
 static inline int
@@ -715,29 +722,22 @@ EXPORT_SYMBOL_GPL(__ww_mutex_lock_interr
 /*
  * Release the lock, slowpath:
  */
-static inline void
-__mutex_unlock_common_slowpath(struct mutex *lock, int nested)
+static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	unsigned long flags;
+	unsigned long owner, flags;
 	WAKE_Q(wake_q);
 
+	mutex_release(&lock->dep_map, 1, ip);
+
 	/*
-	 * As a performance measurement, release the lock before doing other
-	 * wakeup related duties to follow. This allows other tasks to acquire
-	 * the lock sooner, while still handling cleanups in past unlock calls.
-	 * This can be done as we do not enforce strict equivalence between the
-	 * mutex counter and wait_list.
-	 *
-	 *
-	 * Some architectures leave the lock unlocked in the fastpath failure
-	 * case, others need to leave it locked. In the later case we have to
-	 * unlock it here - as the lock counter is currently 0 or negative.
+	 * Release the lock before (potentially) taking the spinlock
+	 * such that other contenders can get on with things ASAP.
 	 */
-	if (__mutex_slowpath_needs_to_unlock())
-		atomic_set(&lock->count, 1);
+	owner = atomic_long_fetch_and_release(MUTEX_FLAGS, &lock->owner);
+	if (!__owner_flags(owner))
+		return;
 
 	spin_lock_mutex(&lock->wait_lock, flags);
-	mutex_release(&lock->dep_map, nested, _RET_IP_);
 	debug_mutex_unlock(lock);
 
 	if (!list_empty(&lock->wait_list)) {
@@ -754,17 +754,6 @@ __mutex_unlock_common_slowpath(struct mu
 	wake_up_q(&wake_q);
 }
 
-/*
- * Release the lock, slowpath:
- */
-__visible void
-__mutex_unlock_slowpath(atomic_t *lock_count)
-{
-	struct mutex *lock = container_of(lock_count, struct mutex, count);
-
-	__mutex_unlock_common_slowpath(lock, 1);
-}
-
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * Here come the less common (and hence less performance-critical) APIs:
@@ -789,38 +778,30 @@ __mutex_lock_interruptible_slowpath(stru
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
-	int ret;
-
 	might_sleep();
-	ret =  __mutex_fastpath_lock_retval(&lock->count);
-	if (likely(!ret)) {
-		mutex_set_owner(lock);
+
+	if (__mutex_trylock_fast(lock))
 		return 0;
-	} else
-		return __mutex_lock_interruptible_slowpath(lock);
+
+	return __mutex_lock_interruptible_slowpath(lock);
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
-	int ret;
-
 	might_sleep();
-	ret = __mutex_fastpath_lock_retval(&lock->count);
-	if (likely(!ret)) {
-		mutex_set_owner(lock);
+
+	if (__mutex_trylock_fast(lock))
 		return 0;
-	} else
-		return __mutex_lock_killable_slowpath(lock);
+
+	return __mutex_lock_killable_slowpath(lock);
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
-__visible void __sched
-__mutex_lock_slowpath(atomic_t *lock_count)
+static noinline void __sched
+__mutex_lock_slowpath(struct mutex *lock)
 {
-	struct mutex *lock = container_of(lock_count, struct mutex, count);
-
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,
 			    NULL, _RET_IP_, NULL, 0);
 }
@@ -856,37 +837,6 @@ __ww_mutex_lock_interruptible_slowpath(s
 
 #endif
 
-/*
- * Spinlock based trylock, we take the spinlock and check whether we
- * can get the lock:
- */
-static inline int __mutex_trylock_slowpath(atomic_t *lock_count)
-{
-	struct mutex *lock = container_of(lock_count, struct mutex, count);
-	unsigned long flags;
-	int prev;
-
-	/* No need to trylock if the mutex is locked. */
-	if (mutex_is_locked(lock))
-		return 0;
-
-	spin_lock_mutex(&lock->wait_lock, flags);
-
-	prev = atomic_xchg_acquire(&lock->count, -1);
-	if (likely(prev == 1)) {
-		mutex_set_owner(lock);
-		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-	}
-
-	/* Set it back to 0 if there are no waiters: */
-	if (likely(list_empty(&lock->wait_list)))
-		atomic_set(&lock->count, 0);
-
-	spin_unlock_mutex(&lock->wait_lock, flags);
-
-	return prev == 1;
-}
-
 /**
  * mutex_trylock - try to acquire the mutex, without waiting
  * @lock: the mutex to be acquired
@@ -903,13 +853,12 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	int ret;
+	bool locked = __mutex_trylock(lock);
 
-	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
-	if (ret)
-		mutex_set_owner(lock);
+	if (locked)
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 
-	return ret;
+	return locked;
 }
 EXPORT_SYMBOL(mutex_trylock);
 
@@ -917,36 +866,28 @@ EXPORT_SYMBOL(mutex_trylock);
 int __sched
 __ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
-	int ret;
-
 	might_sleep();
 
-	ret = __mutex_fastpath_lock_retval(&lock->base.count);
-
-	if (likely(!ret)) {
+	if (__mutex_trylock_fast(&lock->base)) {
 		ww_mutex_set_context_fastpath(lock, ctx);
-		mutex_set_owner(&lock->base);
-	} else
-		ret = __ww_mutex_lock_slowpath(lock, ctx);
-	return ret;
+		return 0;
+	}
+
+	return __ww_mutex_lock_slowpath(lock, ctx);
 }
 EXPORT_SYMBOL(__ww_mutex_lock);
 
 int __sched
 __ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
-	int ret;
-
 	might_sleep();
 
-	ret = __mutex_fastpath_lock_retval(&lock->base.count);
-
-	if (likely(!ret)) {
+	if (__mutex_trylock_fast(&lock->base)) {
 		ww_mutex_set_context_fastpath(lock, ctx);
-		mutex_set_owner(&lock->base);
-	} else
-		ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
-	return ret;
+		return 0;
+	}
+
+	return __ww_mutex_lock_interruptible_slowpath(lock, ctx);
 }
 EXPORT_SYMBOL(__ww_mutex_lock_interruptible);
 
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -16,32 +16,6 @@
 #define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-/*
- * The mutex owner can get read and written to locklessly.
- * We should use WRITE_ONCE when writing the owner value to
- * avoid store tearing, otherwise, a thread could potentially
- * read a partially written and incomplete owner value.
- */
-static inline void mutex_set_owner(struct mutex *lock)
-{
-	WRITE_ONCE(lock->owner, current);
-}
-
-static inline void mutex_clear_owner(struct mutex *lock)
-{
-	WRITE_ONCE(lock->owner, NULL);
-}
-#else
-static inline void mutex_set_owner(struct mutex *lock)
-{
-}
-
-static inline void mutex_clear_owner(struct mutex *lock)
-{
-}
-#endif
-
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -75,11 +75,11 @@
 #include <linux/compiler.h>
 #include <linux/frame.h>
 #include <linux/prefetch.h>
+#include <linux/mutex.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
-#include <asm/mutex.h>
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 3/8] locking/mutex: Kill arch specific code
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 2/8] locking/mutex: Rework mutex::owner Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 4/8] locking/mutex: Allow MUTEX_SPIN_ON_OWNER when DEBUG_MUTEXES Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter

[-- Attachment #1: peterz-locking-rewrite-mutex-rm.patch --]
[-- Type: text/plain, Size: 37398 bytes --]

Its all generic atomic_long_t stuff now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/alpha/include/asm/mutex.h      |    9 --
 arch/arc/include/asm/mutex.h        |   18 ----
 arch/arm/include/asm/mutex.h        |   21 -----
 arch/arm64/include/asm/Kbuild       |    1 
 arch/avr32/include/asm/mutex.h      |    9 --
 arch/blackfin/include/asm/Kbuild    |    1 
 arch/c6x/include/asm/mutex.h        |    6 -
 arch/cris/include/asm/mutex.h       |    9 --
 arch/frv/include/asm/mutex.h        |    9 --
 arch/h8300/include/asm/mutex.h      |    9 --
 arch/hexagon/include/asm/mutex.h    |    8 --
 arch/ia64/include/asm/mutex.h       |   90 ------------------------
 arch/m32r/include/asm/mutex.h       |    9 --
 arch/m68k/include/asm/Kbuild        |    1 
 arch/metag/include/asm/Kbuild       |    1 
 arch/microblaze/include/asm/mutex.h |    1 
 arch/mips/include/asm/Kbuild        |    1 
 arch/mn10300/include/asm/mutex.h    |   16 ----
 arch/nios2/include/asm/mutex.h      |    1 
 arch/openrisc/include/asm/mutex.h   |   27 -------
 arch/parisc/include/asm/Kbuild      |    1 
 arch/powerpc/include/asm/mutex.h    |  132 ------------------------------------
 arch/s390/include/asm/mutex.h       |    9 --
 arch/score/include/asm/mutex.h      |    6 -
 arch/sh/include/asm/mutex-llsc.h    |  109 -----------------------------
 arch/sh/include/asm/mutex.h         |   12 ---
 arch/sparc/include/asm/Kbuild       |    1 
 arch/tile/include/asm/Kbuild        |    1 
 arch/um/include/asm/Kbuild          |    1 
 arch/unicore32/include/asm/mutex.h  |   20 -----
 arch/x86/include/asm/mutex.h        |    5 -
 arch/x86/include/asm/mutex_32.h     |  110 ------------------------------
 arch/x86/include/asm/mutex_64.h     |  127 ----------------------------------
 arch/xtensa/include/asm/mutex.h     |    9 --
 include/asm-generic/mutex-dec.h     |   88 ------------------------
 include/asm-generic/mutex-null.h    |   19 -----
 include/asm-generic/mutex-xchg.h    |  120 --------------------------------
 include/asm-generic/mutex.h         |    9 --
 38 files changed, 1026 deletions(-)

--- a/arch/alpha/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/arc/include/asm/mutex.h
+++ /dev/null
@@ -1,18 +0,0 @@
-/*
- * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-/*
- * xchg() based mutex fast path maintains a state of 0 or 1, as opposed to
- * atomic dec based which can "count" any number of lock contenders.
- * This ideally needs to be fixed in core, but for now switching to dec ver.
- */
-#if defined(CONFIG_SMP) && (CONFIG_NR_CPUS > 2)
-#include <asm-generic/mutex-dec.h>
-#else
-#include <asm-generic/mutex-xchg.h>
-#endif
--- a/arch/arm/include/asm/mutex.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/*
- * arch/arm/include/asm/mutex.h
- *
- * ARM optimized mutex locking primitives
- *
- * Please look into asm-generic/mutex-xchg.h for a formal definition.
- */
-#ifndef _ASM_MUTEX_H
-#define _ASM_MUTEX_H
-/*
- * On pre-ARMv6 hardware this results in a swp-based implementation,
- * which is the most efficient. For ARMv6+, we have exclusive memory
- * accessors and use atomic_dec to avoid the extra xchg operations
- * on the locking slowpaths.
- */
-#if __LINUX_ARM_ARCH__ < 6
-#include <asm-generic/mutex-xchg.h>
-#else
-#include <asm-generic/mutex-dec.h>
-#endif
-#endif	/* _ASM_MUTEX_H */
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -26,7 +26,6 @@ generic-y += mm-arch-hooks.h
 generic-y += mman.h
 generic-y += msgbuf.h
 generic-y += msi.h
-generic-y += mutex.h
 generic-y += pci.h
 generic-y += poll.h
 generic-y += preempt.h
--- a/arch/avr32/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/blackfin/include/asm/Kbuild
+++ b/arch/blackfin/include/asm/Kbuild
@@ -24,7 +24,6 @@ generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += mman.h
 generic-y += msgbuf.h
-generic-y += mutex.h
 generic-y += param.h
 generic-y += percpu.h
 generic-y += pgalloc.h
--- a/arch/c6x/include/asm/mutex.h
+++ /dev/null
@@ -1,6 +0,0 @@
-#ifndef _ASM_C6X_MUTEX_H
-#define _ASM_C6X_MUTEX_H
-
-#include <asm-generic/mutex-null.h>
-
-#endif /* _ASM_C6X_MUTEX_H */
--- a/arch/cris/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/frv/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/h8300/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/hexagon/include/asm/mutex.h
+++ /dev/null
@@ -1,8 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-#include <asm-generic/mutex-xchg.h>
--- a/arch/ia64/include/asm/mutex.h
+++ /dev/null
@@ -1,90 +0,0 @@
-/*
- * ia64 implementation of the mutex fastpath.
- *
- * Copyright (C) 2006 Ken Chen <kenneth.w.chen@intel.com>
- *
- */
-
-#ifndef _ASM_MUTEX_H
-#define _ASM_MUTEX_H
-
-/**
- *  __mutex_fastpath_lock - try to take the lock by moving the count
- *                          from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 1
- *
- * Change the count from 1 to a value lower than 1, and call <fail_fn> if
- * it wasn't 1 originally. This function MUST leave the value lower than
- * 1 even when the "1" assertion wasn't true.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(ia64_fetchadd4_acq(count, -1) != 1))
-		fail_fn(count);
-}
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(ia64_fetchadd4_acq(count, -1) != 1))
-		return -1;
-	return 0;
-}
-
-/**
- *  __mutex_fastpath_unlock - try to promote the count from 0 to 1
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 0
- *
- * Try to promote the count from 0 to 1. If it wasn't 0, call <fail_fn>.
- * In the failure case, this function is allowed to either set the value to
- * 1, or to set it to a value lower than 1.
- *
- * If the implementation sets it to a value of lower than 1, then the
- * __mutex_slowpath_needs_to_unlock() macro needs to return 1, it needs
- * to return 0 otherwise.
- */
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	int ret = ia64_fetchadd4_rel(count, 1);
-	if (unlikely(ret < 0))
-		fail_fn(count);
-}
-
-#define __mutex_slowpath_needs_to_unlock()		1
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: fallback function
- *
- * Change the count from 1 to a value lower than 1, and return 0 (failure)
- * if it wasn't 1 originally, or return 1 (success) otherwise. This function
- * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
- * Additionally, if the value was < 0 originally, this function must not leave
- * it to 0 on failure.
- *
- * If the architecture has no effective trylock variant, it should call the
- * <fail_fn> spinlock-based trylock variant unconditionally.
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
-	if (atomic_read(count) == 1 && cmpxchg_acq(count, 1, 0) == 1)
-		return 1;
-	return 0;
-}
-
-#endif
--- a/arch/m32r/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -20,7 +20,6 @@ generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += mman.h
-generic-y += mutex.h
 generic-y += percpu.h
 generic-y += preempt.h
 generic-y += resource.h
--- a/arch/metag/include/asm/Kbuild
+++ b/arch/metag/include/asm/Kbuild
@@ -27,7 +27,6 @@ generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += msgbuf.h
-generic-y += mutex.h
 generic-y += param.h
 generic-y += pci.h
 generic-y += percpu.h
--- a/arch/microblaze/include/asm/mutex.h
+++ /dev/null
@@ -1 +0,0 @@
-#include <asm-generic/mutex-dec.h>
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -9,7 +9,6 @@ generic-y += irq_work.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
-generic-y += mutex.h
 generic-y += parport.h
 generic-y += percpu.h
 generic-y += preempt.h
--- a/arch/mn10300/include/asm/mutex.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/* MN10300 Mutex fastpath
- *
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- *
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-#include <asm-generic/mutex-null.h>
--- a/arch/nios2/include/asm/mutex.h
+++ /dev/null
@@ -1 +0,0 @@
-#include <asm-generic/mutex-dec.h>
--- a/arch/openrisc/include/asm/mutex.h
+++ /dev/null
@@ -1,27 +0,0 @@
-/*
- * OpenRISC Linux
- *
- * Linux architectural port borrowing liberally from similar works of
- * others.  All original copyrights apply as per the original source
- * declaration.
- *
- * OpenRISC implementation:
- * Copyright (C) 2003 Matjaz Breskvar <phoenix@bsemi.com>
- * Copyright (C) 2010-2011 Jonas Bonn <jonas@southpole.se>
- * et al.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -16,7 +16,6 @@ generic-y += local.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
-generic-y += mutex.h
 generic-y += param.h
 generic-y += percpu.h
 generic-y += poll.h
--- a/arch/powerpc/include/asm/mutex.h
+++ /dev/null
@@ -1,132 +0,0 @@
-/*
- * Optimised mutex implementation of include/asm-generic/mutex-dec.h algorithm
- */
-#ifndef _ASM_POWERPC_MUTEX_H
-#define _ASM_POWERPC_MUTEX_H
-
-static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new)
-{
-	int t;
-
-	__asm__ __volatile__ (
-"1:	lwarx	%0,0,%1		# mutex trylock\n\
-	cmpw	0,%0,%2\n\
-	bne-	2f\n"
-	PPC405_ERR77(0,%1)
-"	stwcx.	%3,0,%1\n\
-	bne-	1b"
-	PPC_ACQUIRE_BARRIER
-	"\n\
-2:"
-	: "=&r" (t)
-	: "r" (&v->counter), "r" (old), "r" (new)
-	: "cc", "memory");
-
-	return t;
-}
-
-static inline int __mutex_dec_return_lock(atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-"1:	lwarx	%0,0,%1		# mutex lock\n\
-	addic	%0,%0,-1\n"
-	PPC405_ERR77(0,%1)
-"	stwcx.	%0,0,%1\n\
-	bne-	1b"
-	PPC_ACQUIRE_BARRIER
-	: "=&r" (t)
-	: "r" (&v->counter)
-	: "cc", "memory");
-
-	return t;
-}
-
-static inline int __mutex_inc_return_unlock(atomic_t *v)
-{
-	int t;
-
-	__asm__ __volatile__(
-	PPC_RELEASE_BARRIER
-"1:	lwarx	%0,0,%1		# mutex unlock\n\
-	addic	%0,%0,1\n"
-	PPC405_ERR77(0,%1)
-"	stwcx.	%0,0,%1 \n\
-	bne-	1b"
-	: "=&r" (t)
-	: "r" (&v->counter)
-	: "cc", "memory");
-
-	return t;
-}
-
-/**
- *  __mutex_fastpath_lock - try to take the lock by moving the count
- *                          from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 1
- *
- * Change the count from 1 to a value lower than 1, and call <fail_fn> if
- * it wasn't 1 originally. This function MUST leave the value lower than
- * 1 even when the "1" assertion wasn't true.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(__mutex_dec_return_lock(count) < 0))
-		fail_fn(count);
-}
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(__mutex_dec_return_lock(count) < 0))
-		return -1;
-	return 0;
-}
-
-/**
- *  __mutex_fastpath_unlock - try to promote the count from 0 to 1
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 0
- *
- * Try to promote the count from 0 to 1. If it wasn't 0, call <fail_fn>.
- * In the failure case, this function is allowed to either set the value to
- * 1, or to set it to a value lower than 1.
- */
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(__mutex_inc_return_unlock(count) <= 0))
-		fail_fn(count);
-}
-
-#define __mutex_slowpath_needs_to_unlock()		1
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: fallback function
- *
- * Change the count from 1 to 0, and return 1 (success), or if the count
- * was not 1, then return 0 (failure).
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
-	if (likely(atomic_read(count) == 1 && __mutex_cmpxchg_lock(count, 1, 0) == 1))
-		return 1;
-	return 0;
-}
-
-#endif
--- a/arch/s390/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/arch/score/include/asm/mutex.h
+++ /dev/null
@@ -1,6 +0,0 @@
-#ifndef _ASM_SCORE_MUTEX_H
-#define _ASM_SCORE_MUTEX_H
-
-#include <asm-generic/mutex-dec.h>
-
-#endif /* _ASM_SCORE_MUTEX_H */
--- a/arch/sh/include/asm/mutex-llsc.h
+++ /dev/null
@@ -1,109 +0,0 @@
-/*
- * arch/sh/include/asm/mutex-llsc.h
- *
- * SH-4A optimized mutex locking primitives
- *
- * Please look into asm-generic/mutex-xchg.h for a formal definition.
- */
-#ifndef __ASM_SH_MUTEX_LLSC_H
-#define __ASM_SH_MUTEX_LLSC_H
-
-/*
- * Attempting to lock a mutex on SH4A is done like in ARMv6+ architecure.
- * with a bastardized atomic decrement (it is not a reliable atomic decrement
- * but it satisfies the defined semantics for our purpose, while being
- * smaller and faster than a real atomic decrement or atomic swap.
- * The idea is to attempt  decrementing the lock value only once. If once
- * decremented it isn't zero, or if its store-back fails due to a dispute
- * on the exclusive store, we simply bail out immediately through the slow
- * path where the lock will be reattempted until it succeeds.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	int __done, __res;
-
-	__asm__ __volatile__ (
-		"movli.l	@%2, %0	\n"
-		"add		#-1, %0	\n"
-		"movco.l	%0, @%2	\n"
-		"movt		%1	\n"
-		: "=&z" (__res), "=&r" (__done)
-		: "r" (&(count)->counter)
-		: "t");
-
-	if (unlikely(!__done || __res != 0))
-		fail_fn(count);
-}
-
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count)
-{
-	int __done, __res;
-
-	__asm__ __volatile__ (
-		"movli.l	@%2, %0	\n"
-		"add		#-1, %0	\n"
-		"movco.l	%0, @%2	\n"
-		"movt		%1	\n"
-		: "=&z" (__res), "=&r" (__done)
-		: "r" (&(count)->counter)
-		: "t");
-
-	if (unlikely(!__done || __res != 0))
-		__res = -1;
-
-	return __res;
-}
-
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	int __done, __res;
-
-	__asm__ __volatile__ (
-		"movli.l	@%2, %0	\n\t"
-		"add		#1, %0	\n\t"
-		"movco.l	%0, @%2 \n\t"
-		"movt		%1	\n\t"
-		: "=&z" (__res), "=&r" (__done)
-		: "r" (&(count)->counter)
-		: "t");
-
-	if (unlikely(!__done || __res <= 0))
-		fail_fn(count);
-}
-
-/*
- * If the unlock was done on a contended lock, or if the unlock simply fails
- * then the mutex remains locked.
- */
-#define __mutex_slowpath_needs_to_unlock()	1
-
-/*
- * For __mutex_fastpath_trylock we do an atomic decrement and check the
- * result and put it in the __res variable.
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
-	int __res, __orig;
-
-	__asm__ __volatile__ (
-		"1: movli.l	@%2, %0		\n\t"
-		"dt		%0		\n\t"
-		"movco.l	%0,@%2		\n\t"
-		"bf		1b		\n\t"
-		"cmp/eq		#0,%0		\n\t"
-		"bt		2f		\n\t"
-		"mov		#0, %1		\n\t"
-		"bf		3f		\n\t"
-		"2: mov		#1, %1		\n\t"
-		"3:				"
-		: "=&z" (__orig), "=&r" (__res)
-		: "r" (&count->counter)
-		: "t");
-
-	return __res;
-}
-#endif /* __ASM_SH_MUTEX_LLSC_H */
--- a/arch/sh/include/asm/mutex.h
+++ /dev/null
@@ -1,12 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-#if defined(CONFIG_CPU_SH4A)
-#include <asm/mutex-llsc.h>
-#else
-#include <asm-generic/mutex-dec.h>
-#endif
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -14,7 +14,6 @@ generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += module.h
-generic-y += mutex.h
 generic-y += preempt.h
 generic-y += rwsem.h
 generic-y += serial.h
--- a/arch/tile/include/asm/Kbuild
+++ b/arch/tile/include/asm/Kbuild
@@ -21,7 +21,6 @@ generic-y += local64.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += msgbuf.h
-generic-y += mutex.h
 generic-y += param.h
 generic-y += parport.h
 generic-y += poll.h
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,7 +17,6 @@ generic-y += irq_work.h
 generic-y += kdebug.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
-generic-y += mutex.h
 generic-y += param.h
 generic-y += pci.h
 generic-y += percpu.h
--- a/arch/unicore32/include/asm/mutex.h
+++ /dev/null
@@ -1,20 +0,0 @@
-/*
- * linux/arch/unicore32/include/asm/mutex.h
- *
- * Code specific to PKUnity SoC and UniCore ISA
- *
- * Copyright (C) 2001-2010 GUAN Xue-tao
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * UniCore optimized mutex locking primitives
- *
- * Please look into asm-generic/mutex-xchg.h for a formal definition.
- */
-#ifndef __UNICORE_MUTEX_H__
-#define __UNICORE_MUTEX_H__
-
-# include <asm-generic/mutex-xchg.h>
-#endif
--- a/arch/x86/include/asm/mutex.h
+++ /dev/null
@@ -1,5 +0,0 @@
-#ifdef CONFIG_X86_32
-# include <asm/mutex_32.h>
-#else
-# include <asm/mutex_64.h>
-#endif
--- a/arch/x86/include/asm/mutex_32.h
+++ /dev/null
@@ -1,110 +0,0 @@
-/*
- * Assembly implementation of the mutex fastpath, based on atomic
- * decrement/increment.
- *
- * started by Ingo Molnar:
- *
- *  Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
- */
-#ifndef _ASM_X86_MUTEX_32_H
-#define _ASM_X86_MUTEX_32_H
-
-#include <asm/alternative.h>
-
-/**
- *  __mutex_fastpath_lock - try to take the lock by moving the count
- *                          from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *  @fn: function to call if the original value was not 1
- *
- * Change the count from 1 to a value lower than 1, and call <fn> if it
- * wasn't 1 originally. This function MUST leave the value lower than 1
- * even when the "1" assertion wasn't true.
- */
-#define __mutex_fastpath_lock(count, fail_fn)			\
-do {								\
-	unsigned int dummy;					\
-								\
-	typecheck(atomic_t *, count);				\
-	typecheck_fn(void (*)(atomic_t *), fail_fn);		\
-								\
-	asm volatile(LOCK_PREFIX "   decl (%%eax)\n"		\
-		     "   jns 1f	\n"				\
-		     "   call " #fail_fn "\n"			\
-		     "1:\n"					\
-		     : "=a" (dummy)				\
-		     : "a" (count)				\
-		     : "memory", "ecx", "edx");			\
-} while (0)
-
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int __mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(atomic_dec_return(count) < 0))
-		return -1;
-	else
-		return 0;
-}
-
-/**
- *  __mutex_fastpath_unlock - try to promote the mutex from 0 to 1
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 0
- *
- * try to promote the mutex from 0 to 1. if it wasn't 0, call <fail_fn>.
- * In the failure case, this function is allowed to either set the value
- * to 1, or to set it to a value lower than 1.
- *
- * If the implementation sets it to a value of lower than 1, the
- * __mutex_slowpath_needs_to_unlock() macro needs to return 1, it needs
- * to return 0 otherwise.
- */
-#define __mutex_fastpath_unlock(count, fail_fn)			\
-do {								\
-	unsigned int dummy;					\
-								\
-	typecheck(atomic_t *, count);				\
-	typecheck_fn(void (*)(atomic_t *), fail_fn);		\
-								\
-	asm volatile(LOCK_PREFIX "   incl (%%eax)\n"		\
-		     "   jg	1f\n"				\
-		     "   call " #fail_fn "\n"			\
-		     "1:\n"					\
-		     : "=a" (dummy)				\
-		     : "a" (count)				\
-		     : "memory", "ecx", "edx");			\
-} while (0)
-
-#define __mutex_slowpath_needs_to_unlock()	1
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: fallback function
- *
- * Change the count from 1 to a value lower than 1, and return 0 (failure)
- * if it wasn't 1 originally, or return 1 (success) otherwise. This function
- * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
- * Additionally, if the value was < 0 originally, this function must not leave
- * it to 0 on failure.
- */
-static inline int __mutex_fastpath_trylock(atomic_t *count,
-					   int (*fail_fn)(atomic_t *))
-{
-	/* cmpxchg because it never induces a false contention state. */
-	if (likely(atomic_read(count) == 1 && atomic_cmpxchg(count, 1, 0) == 1))
-		return 1;
-
-	return 0;
-}
-
-#endif /* _ASM_X86_MUTEX_32_H */
--- a/arch/x86/include/asm/mutex_64.h
+++ /dev/null
@@ -1,127 +0,0 @@
-/*
- * Assembly implementation of the mutex fastpath, based on atomic
- * decrement/increment.
- *
- * started by Ingo Molnar:
- *
- *  Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
- */
-#ifndef _ASM_X86_MUTEX_64_H
-#define _ASM_X86_MUTEX_64_H
-
-/**
- * __mutex_fastpath_lock - decrement and call function if negative
- * @v: pointer of type atomic_t
- * @fail_fn: function to call if the result is negative
- *
- * Atomically decrements @v and calls <fail_fn> if the result is negative.
- */
-#ifdef CC_HAVE_ASM_GOTO
-static inline void __mutex_fastpath_lock(atomic_t *v,
-					 void (*fail_fn)(atomic_t *))
-{
-	asm_volatile_goto(LOCK_PREFIX "   decl %0\n"
-			  "   jns %l[exit]\n"
-			  : : "m" (v->counter)
-			  : "memory", "cc"
-			  : exit);
-	fail_fn(v);
-exit:
-	return;
-}
-#else
-#define __mutex_fastpath_lock(v, fail_fn)			\
-do {								\
-	unsigned long dummy;					\
-								\
-	typecheck(atomic_t *, v);				\
-	typecheck_fn(void (*)(atomic_t *), fail_fn);		\
-								\
-	asm volatile(LOCK_PREFIX "   decl (%%rdi)\n"		\
-		     "   jns 1f		\n"			\
-		     "   call " #fail_fn "\n"			\
-		     "1:"					\
-		     : "=D" (dummy)				\
-		     : "D" (v)					\
-		     : "rax", "rsi", "rdx", "rcx",		\
-		       "r8", "r9", "r10", "r11", "memory");	\
-} while (0)
-#endif
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int __mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(atomic_dec_return(count) < 0))
-		return -1;
-	else
-		return 0;
-}
-
-/**
- * __mutex_fastpath_unlock - increment and call function if nonpositive
- * @v: pointer of type atomic_t
- * @fail_fn: function to call if the result is nonpositive
- *
- * Atomically increments @v and calls <fail_fn> if the result is nonpositive.
- */
-#ifdef CC_HAVE_ASM_GOTO
-static inline void __mutex_fastpath_unlock(atomic_t *v,
-					   void (*fail_fn)(atomic_t *))
-{
-	asm_volatile_goto(LOCK_PREFIX "   incl %0\n"
-			  "   jg %l[exit]\n"
-			  : : "m" (v->counter)
-			  : "memory", "cc"
-			  : exit);
-	fail_fn(v);
-exit:
-	return;
-}
-#else
-#define __mutex_fastpath_unlock(v, fail_fn)			\
-do {								\
-	unsigned long dummy;					\
-								\
-	typecheck(atomic_t *, v);				\
-	typecheck_fn(void (*)(atomic_t *), fail_fn);		\
-								\
-	asm volatile(LOCK_PREFIX "   incl (%%rdi)\n"		\
-		     "   jg 1f\n"				\
-		     "   call " #fail_fn "\n"			\
-		     "1:"					\
-		     : "=D" (dummy)				\
-		     : "D" (v)					\
-		     : "rax", "rsi", "rdx", "rcx",		\
-		       "r8", "r9", "r10", "r11", "memory");	\
-} while (0)
-#endif
-
-#define __mutex_slowpath_needs_to_unlock()	1
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: fallback function
- *
- * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
- * if it wasn't 1 originally. [the fallback function is never used on
- * x86_64, because all x86_64 CPUs have a CMPXCHG instruction.]
- */
-static inline int __mutex_fastpath_trylock(atomic_t *count,
-					   int (*fail_fn)(atomic_t *))
-{
-	if (likely(atomic_read(count) == 1 && atomic_cmpxchg(count, 1, 0) == 1))
-		return 1;
-
-	return 0;
-}
-
-#endif /* _ASM_X86_MUTEX_64_H */
--- a/arch/xtensa/include/asm/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * Pull in the generic implementation for the mutex fastpath.
- *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
- */
-
-#include <asm-generic/mutex-dec.h>
--- a/include/asm-generic/mutex-dec.h
+++ /dev/null
@@ -1,88 +0,0 @@
-/*
- * include/asm-generic/mutex-dec.h
- *
- * Generic implementation of the mutex fastpath, based on atomic
- * decrement/increment.
- */
-#ifndef _ASM_GENERIC_MUTEX_DEC_H
-#define _ASM_GENERIC_MUTEX_DEC_H
-
-/**
- *  __mutex_fastpath_lock - try to take the lock by moving the count
- *                          from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 1
- *
- * Change the count from 1 to a value lower than 1, and call <fail_fn> if
- * it wasn't 1 originally. This function MUST leave the value lower than
- * 1 even when the "1" assertion wasn't true.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(atomic_dec_return_acquire(count) < 0))
-		fail_fn(count);
-}
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(atomic_dec_return_acquire(count) < 0))
-		return -1;
-	return 0;
-}
-
-/**
- *  __mutex_fastpath_unlock - try to promote the count from 0 to 1
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 0
- *
- * Try to promote the count from 0 to 1. If it wasn't 0, call <fail_fn>.
- * In the failure case, this function is allowed to either set the value to
- * 1, or to set it to a value lower than 1.
- *
- * If the implementation sets it to a value of lower than 1, then the
- * __mutex_slowpath_needs_to_unlock() macro needs to return 1, it needs
- * to return 0 otherwise.
- */
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(atomic_inc_return_release(count) <= 0))
-		fail_fn(count);
-}
-
-#define __mutex_slowpath_needs_to_unlock()		1
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: fallback function
- *
- * Change the count from 1 to a value lower than 1, and return 0 (failure)
- * if it wasn't 1 originally, or return 1 (success) otherwise. This function
- * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
- * Additionally, if the value was < 0 originally, this function must not leave
- * it to 0 on failure.
- *
- * If the architecture has no effective trylock variant, it should call the
- * <fail_fn> spinlock-based trylock variant unconditionally.
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
-	if (likely(atomic_read(count) == 1 && atomic_cmpxchg_acquire(count, 1, 0) == 1))
-		return 1;
-	return 0;
-}
-
-#endif
--- a/include/asm-generic/mutex-null.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/*
- * include/asm-generic/mutex-null.h
- *
- * Generic implementation of the mutex fastpath, based on NOP :-)
- *
- * This is used by the mutex-debugging infrastructure, but it can also
- * be used by architectures that (for whatever reason) want to use the
- * spinlock based slowpath.
- */
-#ifndef _ASM_GENERIC_MUTEX_NULL_H
-#define _ASM_GENERIC_MUTEX_NULL_H
-
-#define __mutex_fastpath_lock(count, fail_fn)		fail_fn(count)
-#define __mutex_fastpath_lock_retval(count)		(-1)
-#define __mutex_fastpath_unlock(count, fail_fn)		fail_fn(count)
-#define __mutex_fastpath_trylock(count, fail_fn)	fail_fn(count)
-#define __mutex_slowpath_needs_to_unlock()		1
-
-#endif
--- a/include/asm-generic/mutex-xchg.h
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * include/asm-generic/mutex-xchg.h
- *
- * Generic implementation of the mutex fastpath, based on xchg().
- *
- * NOTE: An xchg based implementation might be less optimal than an atomic
- *       decrement/increment based implementation. If your architecture
- *       has a reasonable atomic dec/inc then you should probably use
- *	 asm-generic/mutex-dec.h instead, or you could open-code an
- *	 optimized version in asm/mutex.h.
- */
-#ifndef _ASM_GENERIC_MUTEX_XCHG_H
-#define _ASM_GENERIC_MUTEX_XCHG_H
-
-/**
- *  __mutex_fastpath_lock - try to take the lock by moving the count
- *                          from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 1
- *
- * Change the count from 1 to a value lower than 1, and call <fail_fn> if it
- * wasn't 1 originally. This function MUST leave the value lower than 1
- * even when the "1" assertion wasn't true.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(atomic_xchg(count, 0) != 1))
-		/*
-		 * We failed to acquire the lock, so mark it contended
-		 * to ensure that any waiting tasks are woken up by the
-		 * unlock slow path.
-		 */
-		if (likely(atomic_xchg_acquire(count, -1) != 1))
-			fail_fn(count);
-}
-
-/**
- *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
- *                                 from 1 to a 0 value
- *  @count: pointer of type atomic_t
- *
- * Change the count from 1 to a value lower than 1. This function returns 0
- * if the fastpath succeeds, or -1 otherwise.
- */
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count)
-{
-	if (unlikely(atomic_xchg_acquire(count, 0) != 1))
-		if (likely(atomic_xchg(count, -1) != 1))
-			return -1;
-	return 0;
-}
-
-/**
- *  __mutex_fastpath_unlock - try to promote the mutex from 0 to 1
- *  @count: pointer of type atomic_t
- *  @fail_fn: function to call if the original value was not 0
- *
- * try to promote the mutex from 0 to 1. if it wasn't 0, call <function>
- * In the failure case, this function is allowed to either set the value to
- * 1, or to set it to a value lower than one.
- * If the implementation sets it to a value of lower than one, the
- * __mutex_slowpath_needs_to_unlock() macro needs to return 1, it needs
- * to return 0 otherwise.
- */
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
-	if (unlikely(atomic_xchg_release(count, 1) != 0))
-		fail_fn(count);
-}
-
-#define __mutex_slowpath_needs_to_unlock()		0
-
-/**
- * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
- *
- *  @count: pointer of type atomic_t
- *  @fail_fn: spinlock based trylock implementation
- *
- * Change the count from 1 to a value lower than 1, and return 0 (failure)
- * if it wasn't 1 originally, or return 1 (success) otherwise. This function
- * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
- * Additionally, if the value was < 0 originally, this function must not leave
- * it to 0 on failure.
- *
- * If the architecture has no effective trylock variant, it should call the
- * <fail_fn> spinlock-based trylock variant unconditionally.
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
-	int prev;
-
-	if (atomic_read(count) != 1)
-		return 0;
-
-	prev = atomic_xchg_acquire(count, 0);
-	if (unlikely(prev < 0)) {
-		/*
-		 * The lock was marked contended so we must restore that
-		 * state. If while doing so we get back a prev value of 1
-		 * then we just own it.
-		 *
-		 * [ In the rare case of the mutex going to 1, to 0, to -1
-		 *   and then back to 0 in this few-instructions window,
-		 *   this has the potential to trigger the slowpath for the
-		 *   owner's unlock path needlessly, but that's not a problem
-		 *   in practice. ]
-		 */
-		prev = atomic_xchg_acquire(count, prev);
-		if (prev < 0)
-			prev = 0;
-	}
-
-	return prev;
-}
-
-#endif
--- a/include/asm-generic/mutex.h
+++ /dev/null
@@ -1,9 +0,0 @@
-#ifndef __ASM_GENERIC_MUTEX_H
-#define __ASM_GENERIC_MUTEX_H
-/*
- * Pull in the generic implementation for the mutex fastpath,
- * which is a reasonable default on many architectures.
- */
-
-#include <asm-generic/mutex-dec.h>
-#endif /* __ASM_GENERIC_MUTEX_H */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 4/8] locking/mutex: Allow MUTEX_SPIN_ON_OWNER when DEBUG_MUTEXES
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 3/8] locking/mutex: Kill arch specific code Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter

[-- Attachment #1: peterz-locking-mutex-enable-spin_on_owner-debug.patch --]
[-- Type: text/plain, Size: 557 bytes --]

Now that mutex::count and mutex::owner are the same field, we can
allow SPIN_ON_OWNER while DEBUG_MUTEX.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/Kconfig.locks |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -225,7 +225,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW
 
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
-	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
 
 config RWSEM_SPIN_ON_OWNER
        def_bool y

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 4/8] locking/mutex: Allow MUTEX_SPIN_ON_OWNER when DEBUG_MUTEXES Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-13 15:14   ` Will Deacon
                     ` (3 more replies)
  2016-10-07 14:52 ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 4 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter

[-- Attachment #1: peterz-locking-mutex-steal.patch --]
[-- Type: text/plain, Size: 9153 bytes --]

Implement lock handoff to avoid lock starvation.

Lock starvation is possible because mutex_lock() allows lock stealing,
where a running (or optimistic spinning) task beats the woken waiter
to the acquire.

Lock stealing is an important performance optimization because waiting
for a waiter to wake up and get runtime can take a significant time,
during which everyboy would stall on the lock.

The down-side is of course that it allows for starvation.

This patch has the waiter requesting a handoff if it fails to acquire
the lock upon waking. This re-introduces some of the wait time,
because once we do a handoff we have to wait for the waiter to wake up
again.

A future patch will add a round of optimistic spinning to attempt to
alleviate this penalty, but if that turns out to not be enough, we can
add a counter and only request handoff after multiple failed wakeups.

There are a few tricky implementation details:

 - accepting a handoff must only be done in the wait-loop. Since the
   handoff condition is owner == current, it can easily cause
   recursive locking trouble.

 - accepting the handoff must be careful to provide the ACQUIRE
   semantics.

 - having the HANDOFF bit set on unlock requires care, we must not
   clear the owner.

 - we must be careful to not leave HANDOFF set after we've acquired
   the lock. The tricky scenario is setting the HANDOFF bit on an
   unlocked mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/mutex.c |  142 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 119 insertions(+), 23 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -54,8 +54,10 @@ EXPORT_SYMBOL(__mutex_init);
  * bits to store extra state.
  *
  * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
  */
 #define MUTEX_FLAG_WAITERS	0x01
+#define MUTEX_FLAG_HANDOFF	0x02
 
 #define MUTEX_FLAGS		0x03
 
@@ -71,20 +73,48 @@ static inline unsigned long __owner_flag
 
 /*
  * Actual trylock that will work on any unlocked state.
+ *
+ * When setting the owner field, we must preserve the low flag bits.
+ *
+ * Be careful with @handoff, only set that in a wait-loop (where you set
+ * HANDOFF) to avoid recursive lock attempts.
  */
-static inline bool __mutex_trylock(struct mutex *lock)
+static inline bool __mutex_trylock(struct mutex *lock, const bool handoff)
 {
 	unsigned long owner, curr = (unsigned long)current;
 
 	owner = atomic_long_read(&lock->owner);
 	for (;;) { /* must loop, can race against a flag */
-		unsigned long old;
+		unsigned long old, flags = __owner_flags(owner);
+
+		if (__owner_task(owner)) {
+			if (handoff && unlikely(__owner_task(owner) == current)) {
+				/*
+				 * Provide ACQUIRE semantics for the lock-handoff.
+				 *
+				 * We cannot easily use load-acquire here, since
+				 * the actual load is a failed cmpxchg, which
+				 * doesn't imply any barriers.
+				 *
+				 * Also, this is a fairly unlikely scenario, and
+				 * this contains the cost.
+				 */
+				smp_mb(); /* ACQUIRE */
+				return true;
+			}
 
-		if (__owner_task(owner))
 			return false;
+		}
 
-		old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
-						  curr | __owner_flags(owner));
+		/*
+		 * We set the HANDOFF bit, we must make sure it doesn't live
+		 * past the point where we acquire it. This would be possible
+		 * if we (accidentally) set the bit on an unlocked mutex.
+		 */
+		if (handoff)
+			flags &= ~MUTEX_FLAG_HANDOFF;
+
+		old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr | flags);
 		if (old == owner)
 			return true;
 
@@ -134,6 +164,39 @@ static inline void __mutex_clear_flag(st
 	atomic_long_andnot(flag, &lock->owner);
 }
 
+static inline bool __mutex_waiter_is_first(struct mutex *lock, struct mutex_waiter *waiter)
+{
+	return list_first_entry(&lock->wait_list, struct mutex_waiter, list) == waiter;
+}
+
+/*
+ * Give up ownership to a specific task, when @task = NULL, this is equivalent
+ * to a regular unlock. Clears HANDOFF, preserves WAITERS. Provides RELEASE
+ * semantics like a regular unlock, the __mutex_trylock() provides matching
+ * ACQUIRE semantics for the handoff.
+ */
+static void __mutex_handoff(struct mutex *lock, struct task_struct *task)
+{
+	unsigned long owner = atomic_long_read(&lock->owner);
+
+	for (;;) {
+		unsigned long old, new;
+
+#ifdef CONFIG_DEBUG_MUTEXES
+		DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
+#endif
+
+		new = (owner & MUTEX_FLAG_WAITERS);
+		new |= (unsigned long)task;
+
+		old = atomic_long_cmpxchg_release(&lock->owner, owner, new);
+		if (old == owner)
+			break;
+
+		owner = old;
+	}
+}
+
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * We split the mutex lock/unlock logic into separate fastpath and
@@ -398,7 +461,7 @@ static bool mutex_optimistic_spin(struct
 			break;
 
 		/* Try to acquire the mutex if it is unlocked. */
-		if (__mutex_trylock(lock)) {
+		if (__mutex_trylock(lock, false)) {
 			osq_unlock(&lock->osq);
 			return true;
 		}
@@ -523,6 +586,7 @@ __mutex_lock_common(struct mutex *lock,
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
+	bool first = false;
 	int ret;
 
 	if (use_ww_ctx) {
@@ -534,7 +598,8 @@ __mutex_lock_common(struct mutex *lock,
 	preempt_disable();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
-	if (__mutex_trylock(lock) || mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
+	if (__mutex_trylock(lock, false) ||
+	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
 		/* got the lock, yay! */
 		lock_acquired(&lock->dep_map, ip);
 		if (use_ww_ctx) {
@@ -551,7 +616,7 @@ __mutex_lock_common(struct mutex *lock,
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
-	if (__mutex_trylock(lock))
+	if (__mutex_trylock(lock, false))
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -561,13 +626,13 @@ __mutex_lock_common(struct mutex *lock,
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
 
-	if (list_first_entry(&lock->wait_list, struct mutex_waiter, list) == &waiter)
+	if (__mutex_waiter_is_first(lock, &waiter))
 		__mutex_set_flag(lock, MUTEX_FLAG_WAITERS);
 
 	lock_contended(&lock->dep_map, ip);
 
 	for (;;) {
-		if (__mutex_trylock(lock))
+		if (__mutex_trylock(lock, first))
 			break;
 
 		/*
@@ -586,17 +651,20 @@ __mutex_lock_common(struct mutex *lock,
 		}
 
 		__set_task_state(task, state);
-
-		/* didn't get the lock, go to sleep: */
 		spin_unlock_mutex(&lock->wait_lock, flags);
 		schedule_preempt_disabled();
 		spin_lock_mutex(&lock->wait_lock, flags);
+
+		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
+			first = true;
+			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
+		}
 	}
 	__set_task_state(task, TASK_RUNNING);
 
 	mutex_remove_waiter(lock, &waiter, task);
 	if (likely(list_empty(&lock->wait_list)))
-		__mutex_clear_flag(lock, MUTEX_FLAG_WAITERS);
+		__mutex_clear_flag(lock, MUTEX_FLAGS);
 
 	debug_mutex_free_waiter(&waiter);
 
@@ -724,33 +792,61 @@ EXPORT_SYMBOL_GPL(__ww_mutex_lock_interr
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
+	struct task_struct *next = NULL;
 	unsigned long owner, flags;
 	WAKE_Q(wake_q);
 
 	mutex_release(&lock->dep_map, 1, ip);
 
 	/*
-	 * Release the lock before (potentially) taking the spinlock
-	 * such that other contenders can get on with things ASAP.
+	 * Release the lock before (potentially) taking the spinlock such that
+	 * other contenders can get on with things ASAP.
+	 *
+	 * Except when HANDOFF, in that case we must not clear the owner field,
+	 * but instead set it to the top waiter.
 	 */
-	owner = atomic_long_fetch_and_release(MUTEX_FLAGS, &lock->owner);
-	if (!__owner_flags(owner))
-		return;
+	owner = atomic_long_read(&lock->owner);
+	for (;;) {
+		unsigned long old;
+
+#ifdef CONFIG_DEBUG_MUTEXES
+		DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
+#endif
+
+		if (owner & MUTEX_FLAG_HANDOFF)
+			break;
+
+		old = atomic_long_cmpxchg_release(&lock->owner, owner,
+						  __owner_flags(owner));
+		if (old == owner) {
+			if (owner & MUTEX_FLAG_WAITERS)
+				break;
+
+			return;
+		}
+
+		owner = old;
+	}
 
 	spin_lock_mutex(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
-				list_entry(lock->wait_list.next,
-					   struct mutex_waiter, list);
+			list_first_entry(&lock->wait_list,
+					 struct mutex_waiter, list);
+
+		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		wake_q_add(&wake_q, waiter->task);
+		wake_q_add(&wake_q, next);
 	}
 
+	if (owner & MUTEX_FLAG_HANDOFF)
+		__mutex_handoff(lock, next);
+
 	spin_unlock_mutex(&lock->wait_lock, flags);
+
 	wake_up_q(&wake_q);
 }
 
@@ -853,7 +949,7 @@ __ww_mutex_lock_interruptible_slowpath(s
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	bool locked = __mutex_trylock(lock);
+	bool locked = __mutex_trylock(lock, false);
 
 	if (locked)
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-13 15:17   ` Will Deacon
  2016-10-17 23:16   ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Waiman Long
  2016-10-07 14:52 ` [PATCH -v4 7/8] locking/mutex: Simplify some ww_mutex code in __mutex_lock_common() Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter

[-- Attachment #1: peterz-locking-mutex-wait_lock-opt.patch --]
[-- Type: text/plain, Size: 2558 bytes --]

Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long <waiman.long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/mutex.c |   30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
 
 	lock_contended(&lock->dep_map, ip);
 
+	set_task_state(task, state);
 	for (;;) {
+		/*
+		 * Once we hold wait_lock, we're serialized against
+		 * mutex_unlock() handing the lock off to us, do a trylock
+		 * before testing the error conditions to make sure we pick up
+		 * the handoff.
+		 */
 		if (__mutex_trylock(lock, first))
-			break;
+			goto acquired;
 
 		/*
-		 * got a signal? (This code gets eliminated in the
-		 * TASK_UNINTERRUPTIBLE case.)
+		 * Check for signals and wound conditions while holding
+		 * wait_lock. This ensures the lock cancellation is ordered
+		 * against mutex_unlock() and wake-ups do not go missing.
 		 */
 		if (unlikely(signal_pending_state(state, task))) {
 			ret = -EINTR;
@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
 				goto err;
 		}
 
-		__set_task_state(task, state);
 		spin_unlock_mutex(&lock->wait_lock, flags);
 		schedule_preempt_disabled();
-		spin_lock_mutex(&lock->wait_lock, flags);
 
 		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
 			first = true;
 			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
 		}
+
+		set_task_state(task, state);
+		/*
+		 * Here we order against unlock; we must either see it change
+		 * state back to RUNNING and fall through the next schedule(),
+		 * or we must see its unlock and acquire.
+		 */
+		if (__mutex_trylock(lock, first))
+			break;
+
+		spin_lock_mutex(&lock->wait_lock, flags);
 	}
+	spin_lock_mutex(&lock->wait_lock, flags);
+acquired:
 	__set_task_state(task, TASK_RUNNING);
 
 	mutex_remove_waiter(lock, &waiter, task);
@@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
 	return 0;
 
 err:
+	__set_task_state(task, TASK_RUNNING);
 	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 7/8] locking/mutex: Simplify some ww_mutex code in __mutex_lock_common()
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-07 14:52 ` [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter, Waiman Long

[-- Attachment #1: waiman_long-locking_mutex-simplify_some_ww_mutex_code_in___mutex_lock_common.patch --]
[-- Type: text/plain, Size: 1942 bytes --]

From: Waiman Long <Waiman.Long@hpe.com>

This patch removes some of the redundant ww_mutex code in
__mutex_lock_common().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Jason Low <jason.low2@hpe.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Will Deacon <Will.Deacon@arm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/mutex.c |   13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -580,10 +580,11 @@ __mutex_lock_common(struct mutex *lock,
 	struct mutex_waiter waiter;
 	unsigned long flags;
 	bool first = false;
+	struct ww_mutex *ww;
 	int ret;
 
 	if (use_ww_ctx) {
-		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+		ww = container_of(lock, struct ww_mutex, base);
 		if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
 			return -EALREADY;
 	}
@@ -595,12 +596,8 @@ __mutex_lock_common(struct mutex *lock,
 	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
 		/* got the lock, yay! */
 		lock_acquired(&lock->dep_map, ip);
-		if (use_ww_ctx) {
-			struct ww_mutex *ww;
-			ww = container_of(lock, struct ww_mutex, base);
-
+		if (use_ww_ctx)
 			ww_mutex_set_context_fastpath(ww, ww_ctx);
-		}
 		preempt_enable();
 		return 0;
 	}
@@ -680,10 +677,8 @@ __mutex_lock_common(struct mutex *lock,
 	/* got the lock - cleanup and rejoice! */
 	lock_acquired(&lock->dep_map, ip);
 
-	if (use_ww_ctx) {
-		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+	if (use_ww_ctx)
 		ww_mutex_set_context_slowpath(ww, ww_ctx);
-	}
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	preempt_enable();

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (6 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 7/8] locking/mutex: Simplify some ww_mutex code in __mutex_lock_common() Peter Zijlstra
@ 2016-10-07 14:52 ` Peter Zijlstra
  2016-10-13 15:28   ` Will Deacon
  2016-10-17 23:21   ` Waiman Long
  2016-10-07 15:20 ` [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Linus Torvalds
  2016-10-11 18:42 ` Jason Low
  9 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Peter Zijlstra
  Cc: Chris Wilson, Daniel Vetter, Waiman Long

[-- Attachment #1: waiman_long-locking_mutex-enable_optimistic_spinning_of_woken_waiter.patch --]
[-- Type: text/plain, Size: 5142 bytes --]

From: Waiman Long <Waiman.Long@hpe.com>

This patch makes the waiter that sets the HANDOFF flag start spinning
instead of sleeping until the handoff is complete or the owner
sleeps. Otherwise, the handoff will cause the optimistic spinners to
abort spinning as the handed-off owner may not be running.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Jason Low <jason.low2@hpe.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Will Deacon <Will.Deacon@arm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/mutex.c |   77 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 54 insertions(+), 23 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -416,24 +416,39 @@ static inline int mutex_can_spin_on_owne
  *
  * Returns true when the lock was taken, otherwise false, indicating
  * that we need to jump to the slowpath and sleep.
+ *
+ * The waiter flag is set to true if the spinner is a waiter in the wait
+ * queue. The waiter-spinner will spin on the lock directly and concurrently
+ * with the spinner at the head of the OSQ, if present, until the owner is
+ * changed to itself.
  */
 static bool mutex_optimistic_spin(struct mutex *lock,
-				  struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+				  struct ww_acquire_ctx *ww_ctx,
+				  const bool use_ww_ctx, const bool waiter)
 {
 	struct task_struct *task = current;
 
-	if (!mutex_can_spin_on_owner(lock))
-		goto done;
+	if (!waiter) {
+		/*
+		 * The purpose of the mutex_can_spin_on_owner() function is
+		 * to eliminate the overhead of osq_lock() and osq_unlock()
+		 * in case spinning isn't possible. As a waiter-spinner
+		 * is not going to take OSQ lock anyway, there is no need
+		 * to call mutex_can_spin_on_owner().
+		 */
+		if (!mutex_can_spin_on_owner(lock))
+			goto fail;
 
-	/*
-	 * In order to avoid a stampede of mutex spinners trying to
-	 * acquire the mutex all at once, the spinners need to take a
-	 * MCS (queued) lock first before spinning on the owner field.
-	 */
-	if (!osq_lock(&lock->osq))
-		goto done;
+		/*
+		 * In order to avoid a stampede of mutex spinners trying to
+		 * acquire the mutex all at once, the spinners need to take a
+		 * MCS (queued) lock first before spinning on the owner field.
+		 */
+		if (!osq_lock(&lock->osq))
+			goto fail;
+	}
 
-	while (true) {
+	for (;;) {
 		struct task_struct *owner;
 
 		if (use_ww_ctx && ww_ctx->acquired > 0) {
@@ -449,7 +464,7 @@ static bool mutex_optimistic_spin(struct
 			 * performed the optimistic spinning cannot be done.
 			 */
 			if (READ_ONCE(ww->ctx))
-				break;
+				goto fail_unlock;
 		}
 
 		/*
@@ -457,15 +472,20 @@ static bool mutex_optimistic_spin(struct
 		 * release the lock or go to sleep.
 		 */
 		owner = __mutex_owner(lock);
-		if (owner && !mutex_spin_on_owner(lock, owner))
-			break;
+		if (owner) {
+			if (waiter && owner == task) {
+				smp_mb(); /* ACQUIRE */
+				break;
+			}
 
-		/* Try to acquire the mutex if it is unlocked. */
-		if (__mutex_trylock(lock, false)) {
-			osq_unlock(&lock->osq);
-			return true;
+			if (!mutex_spin_on_owner(lock, owner))
+				goto fail_unlock;
 		}
 
+		/* Try to acquire the mutex if it is unlocked. */
+		if (__mutex_trylock(lock, waiter))
+			break;
+
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
 		 * everything in this loop to be re-loaded. We don't need
@@ -475,8 +495,17 @@ static bool mutex_optimistic_spin(struct
 		cpu_relax_lowlatency();
 	}
 
-	osq_unlock(&lock->osq);
-done:
+	if (!waiter)
+		osq_unlock(&lock->osq);
+
+	return true;
+
+
+fail_unlock:
+	if (!waiter)
+		osq_unlock(&lock->osq);
+
+fail:
 	/*
 	 * If we fell out of the spin path because of need_resched(),
 	 * reschedule now, before we try-lock the mutex. This avoids getting
@@ -495,7 +524,8 @@ static bool mutex_optimistic_spin(struct
 }
 #else
 static bool mutex_optimistic_spin(struct mutex *lock,
-				  struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+				  struct ww_acquire_ctx *ww_ctx,
+				  const bool use_ww_ctx, const bool waiter)
 {
 	return false;
 }
@@ -600,7 +630,7 @@ __mutex_lock_common(struct mutex *lock,
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 	if (__mutex_trylock(lock, false) ||
-	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
+	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, false)) {
 		/* got the lock, yay! */
 		lock_acquired(&lock->dep_map, ip);
 		if (use_ww_ctx)
@@ -669,7 +699,8 @@ __mutex_lock_common(struct mutex *lock,
 		 * state back to RUNNING and fall through the next schedule(),
 		 * or we must see its unlock and acquire.
 		 */
-		if (__mutex_trylock(lock, first))
+		if ((first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, true)) ||
+		     __mutex_trylock(lock, first))
 			break;
 
 		spin_lock_mutex(&lock->wait_lock, flags);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (7 preceding siblings ...)
  2016-10-07 14:52 ` [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter Peter Zijlstra
@ 2016-10-07 15:20 ` Linus Torvalds
  2016-10-11 18:42 ` Jason Low
  9 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2016-10-07 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On Fri, Oct 7, 2016 at 7:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Since you all should not be sending patches during the merge window, I figured
> I should to keep you all occupied with something.

Thanks..  I guess.

No real review of the code itself from me (I'll leave that to Waiman
and the usual gang), but I have to say that I like how it looks from
30000 ft.  Getting rid of the arch-specific code is lovely.

                      Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
@ 2016-10-07 15:43   ` Peter Zijlstra
  2016-10-07 15:58     ` Linus Torvalds
                       ` (2 more replies)
  2016-10-18 12:48   ` Peter Zijlstra
  1 sibling, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 15:43 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney
  Cc: Chris Wilson, Daniel Vetter, Rob Clark

On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> Poking at lock internals is not cool. Since I'm going to change the
> implementation this will break, take it out.


So something like the below would serve as a replacement for your
previous hacks. Is this API something acceptable to people? Ingo,
Thomas?

---
 include/linux/mutex.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 4d3bccabbea5..afcff2c85957 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -189,4 +189,29 @@ extern void mutex_unlock(struct mutex *lock);
 
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
+enum mutex_trylock_recursive_enum {
+	mutex_trylock_failed = 0,
+	mutex_trylock_success = 1,
+	mutex_trylock_recursive,
+};
+
+/**
+ * mutex_trylock_recursive - trylock variant that allows recursive locking
+ * @lock: mutex to be locked
+ *
+ *
+ * Returns:
+ *  mutex_trylock_failed    - trylock failed,
+ *  mutex_trylock_success   - lock acquired,
+ *  mutex_trylock_recursive - we already owned the lock.
+ */
+static inline enum mutex_trylock_recursive_enum
+mutex_trylock_recursive(struct mutex *lock)
+{
+	if (unlikely(__mutex_owner(lock) == current))
+		return mutex_trylock_recursive;
+
+	return mutex_trylock(lock);
+}
+
 #endif /* __LINUX_MUTEX_H */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 15:43   ` Peter Zijlstra
@ 2016-10-07 15:58     ` Linus Torvalds
  2016-10-07 16:13       ` Peter Zijlstra
  2016-10-08 11:58     ` Thomas Gleixner
  2016-11-09 10:38     ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2016-10-07 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Fri, Oct 7, 2016 at 8:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So something like the below would serve as a replacement for your
> previous hacks. Is this API something acceptable to people? Ingo,
> Thomas?

Ugh. I think the concept is fine, but can we place make these enum's
be all upper case or something to make them really stand out visually.

Also, it's a bit subtle how this depends on the success/failure enums
to just have the same values as the return value of the non-try mutex.
So a comment about why that particular numbering is important above
the enum definition, please?

You _kind_ of imply that magic numbering requirementby explicitly
using " = 0" and " = 1", but quite frankly, in many ways that's really
nasty, because I can see somebody looking at that enum definition and
saying "that makes no sense: it sets *two* of the three values
explicitly, and the values it sets explicitly are the same that it
would get implicitly _anyway_, so I'll just clean this up".

So I really think that enum needs a comment on the forced choice of
values, and making them all caps would make the users more obvious, I
think.

The other choice would be to just make the choices be negative (==
recursive), zero (== failed) or positive (== got lock), which allows
for the same value re-use for the non-recursive case, and you could
avoid the enum entirely.

                  Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 15:58     ` Linus Torvalds
@ 2016-10-07 16:13       ` Peter Zijlstra
  2016-10-07 21:58         ` Waiman Long
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-07 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Fri, Oct 07, 2016 at 08:58:43AM -0700, Linus Torvalds wrote:
> Ugh. I think the concept is fine, but can we place make these enum's
> be all upper case or something to make them really stand out visually.

OK.

> The other choice would be to just make the choices be negative (==
> recursive), zero (== failed) or positive (== got lock), which allows
> for the same value re-use for the non-recursive case, and you could
> avoid the enum entirely.

I thought about that, but liked the enum better for having to then spell
it out.

I'll go make the enum shout and add comment as you suggest.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 16:13       ` Peter Zijlstra
@ 2016-10-07 21:58         ` Waiman Long
  0 siblings, 0 replies; 59+ messages in thread
From: Waiman Long @ 2016-10-07 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On 10/07/2016 12:13 PM, Peter Zijlstra wrote:
> On Fri, Oct 07, 2016 at 08:58:43AM -0700, Linus Torvalds wrote:
>> The other choice would be to just make the choices be negative (==
>> recursive), zero (== failed) or positive (== got lock), which allows
>> for the same value re-use for the non-recursive case, and you could
>> avoid the enum entirely.
> I thought about that, but liked the enum better for having to then spell
> it out.
>
> I'll go make the enum shout and add comment as you suggest.

I like the idea of having a tri-state returned value (<0, 0, >0). I 
don't mind having the enum, but just making mutex_trylock_recursive 
equal to -1 will be great.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 15:43   ` Peter Zijlstra
  2016-10-07 15:58     ` Linus Torvalds
@ 2016-10-08 11:58     ` Thomas Gleixner
  2016-10-08 14:01       ` Peter Zijlstra
  2016-11-09 10:38     ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Thomas Gleixner @ 2016-10-08 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> > Poking at lock internals is not cool. Since I'm going to change the
> > implementation this will break, take it out.
> 
> 
> So something like the below would serve as a replacement for your
> previous hacks. Is this API something acceptable to people? Ingo,
> Thomas?
> 
> ---
>  include/linux/mutex.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 4d3bccabbea5..afcff2c85957 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -189,4 +189,29 @@ extern void mutex_unlock(struct mutex *lock);
>  
>  extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
>  
> +enum mutex_trylock_recursive_enum {
> +	mutex_trylock_failed = 0,
> +	mutex_trylock_success = 1,
> +	mutex_trylock_recursive,

Upper case enum symbols please, if at all.

> +};
> +
> +/**
> + * mutex_trylock_recursive - trylock variant that allows recursive locking
> + * @lock: mutex to be locked
> + *
> + *
> + * Returns:
> + *  mutex_trylock_failed    - trylock failed,
> + *  mutex_trylock_success   - lock acquired,
> + *  mutex_trylock_recursive - we already owned the lock.
> + */
> +static inline enum mutex_trylock_recursive_enum
> +mutex_trylock_recursive(struct mutex *lock)
> +{
> +	if (unlikely(__mutex_owner(lock) == current))
> +		return mutex_trylock_recursive;
> +
> +	return mutex_trylock(lock);
> +}

Hmm. I'm not a great fan of this, because that requires an conditional
unlock mechanism.

       res = trylock_recursive(lock);
       if (res == FAILED)
       	       goto out;
       .....

       if (res == SUCCESS)
       	       unlock(lock);

While if you actually keep track of recursion you can do:
  
      if (!trylock_recursive(lock))
		goto out;

      ....

      unlock_recursive(lock);

or even:

     lock_recursive(lock);

     unlock_recursive(lock);

That's making lock/trylock and unlock symetric, so its obvious in the
source what's going on and the recursion tracking allows for better
debugability.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-08 11:58     ` Thomas Gleixner
@ 2016-10-08 14:01       ` Peter Zijlstra
  2016-10-08 14:11         ` Thomas Gleixner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-08 14:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Sat, Oct 08, 2016 at 01:58:07PM +0200, Thomas Gleixner wrote:
> Hmm. I'm not a great fan of this, because that requires an conditional
> unlock mechanism.
> 
>        res = trylock_recursive(lock);
>        if (res == FAILED)
>        	       goto out;
>        .....
> 
>        if (res == SUCCESS)
>        	       unlock(lock);
> 
> While if you actually keep track of recursion you can do:
>   
>       if (!trylock_recursive(lock))
> 		goto out;
> 
>       ....
> 
>       unlock_recursive(lock);
> 
> or even:
> 
>      lock_recursive(lock);
> 
>      unlock_recursive(lock);
> 
> That's making lock/trylock and unlock symetric, so its obvious in the
> source what's going on and the recursion tracking allows for better
> debugability.

Hurm,. so I thought that in general we disliked recursive locking
because it quickly turns in to a horrible mess.

Adding such primitives makes it 'easy' to use recursive locking and then
where does it stop?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-08 14:01       ` Peter Zijlstra
@ 2016-10-08 14:11         ` Thomas Gleixner
  2016-10-08 16:42           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Gleixner @ 2016-10-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Sat, 8 Oct 2016, Peter Zijlstra wrote:
> On Sat, Oct 08, 2016 at 01:58:07PM +0200, Thomas Gleixner wrote:
> > Hmm. I'm not a great fan of this, because that requires an conditional
> > unlock mechanism.
> > 
> >        res = trylock_recursive(lock);
> >        if (res == FAILED)
> >        	       goto out;
> >        .....
> > 
> >        if (res == SUCCESS)
> >        	       unlock(lock);
> > 
> > While if you actually keep track of recursion you can do:
> >   
> >       if (!trylock_recursive(lock))
> > 		goto out;
> > 
> >       ....
> > 
> >       unlock_recursive(lock);
> > 
> > or even:
> > 
> >      lock_recursive(lock);
> > 
> >      unlock_recursive(lock);
> > 
> > That's making lock/trylock and unlock symetric, so its obvious in the
> > source what's going on and the recursion tracking allows for better
> > debugability.
> 
> Hurm,. so I thought that in general we disliked recursive locking
> because it quickly turns in to a horrible mess.
> 
> Adding such primitives makes it 'easy' to use recursive locking and then
> where does it stop?

Well, when you add just trylock_recursive then people are going to use it
anyway no matter whether it is easy or not.

So if we decide to provide something which supports recursive locking for
mutexes then we are better off doing it with a proper set of functions and
not just a single undebugable wrapper.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-08 14:11         ` Thomas Gleixner
@ 2016-10-08 16:42           ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-08 16:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Chris Wilson, Daniel Vetter, Rob Clark

On Sat, Oct 08, 2016 at 04:11:25PM +0200, Thomas Gleixner wrote:

> Well, when you add just trylock_recursive then people are going to use it
> anyway no matter whether it is easy or not.
> 
> So if we decide to provide something which supports recursive locking for
> mutexes then we are better off doing it with a proper set of functions and
> not just a single undebugable wrapper.

So ideally I'd say, no recursive stuff at all. But that means the GEM
people need to either do custom hacks (and you know that will spread),
or need to be convinced to rework their locking somehow.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex
  2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
                   ` (8 preceding siblings ...)
  2016-10-07 15:20 ` [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Linus Torvalds
@ 2016-10-11 18:42 ` Jason Low
  9 siblings, 0 replies; 59+ messages in thread
From: Jason Low @ 2016-10-11 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jason.low2, Linus Torvalds, Waiman Long, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Chris Wilson, Daniel Vetter

On Fri, 2016-10-07 at 16:52 +0200, Peter Zijlstra wrote:
> Hi all,
> 
> Since you all should not be sending patches during the merge window, I figured
> I should to keep you all occupied with something.
> 
> Please review, test and otherwise try to break these here patches.
> 
> I would like to get these patches into -tip (and -next) once the merge window
> closes, so please spend these quiet days staring at this stuff.
> 
> Small changes only, mostly the handoff logic as suggested by Waiman last time.

I tested these new patches on the 8 socket system with the high_systime
workload which stresses mutexes.

The average throughput with and without the patches were very similar at
lower levels of contention. At high contention, the throughput with the
patches were slightly lower, but only by a small amount, which I think
is expected due to the additional waiter wait time, ect...

-------------------------------------------------
|  users      | avg throughput | avg throughput |
              | without patch  | with patch     |
-------------------------------------------------
| 10 - 90     |   13,989 JPM   |   13,920 JPM   |
-------------------------------------------------
| 100 - 900   |   76,362 JPM   |   76,298 JPM   |
-------------------------------------------------
| 1000 - 1900 |   77,146 JPM   |   76,061 JPM   |
-------------------------------------------------

Tested-by: Jason Low <jason.low2@hpe.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 2/8] locking/mutex: Rework mutex::owner
  2016-10-07 14:52 ` [PATCH -v4 2/8] locking/mutex: Rework mutex::owner Peter Zijlstra
@ 2016-10-12 17:59   ` Davidlohr Bueso
  2016-10-12 19:52     ` Jason Low
  2016-10-13 15:18   ` Will Deacon
  1 sibling, 1 reply; 59+ messages in thread
From: Davidlohr Bueso @ 2016-10-12 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Fri, 07 Oct 2016, Peter Zijlstra wrote:
>+/*
>+ * Optimistic trylock that only works in the uncontended case. Make sure to
>+ * follow with a __mutex_trylock() before failing.
>+ */
>+static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
>+{
>+	unsigned long curr = (unsigned long)current;
>+
>+	if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
>+		return true;

Do we want to do a ccas check for !lock->owner? Although I can see a possible
case of 'optimizing for the contended' reasons for nay.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 2/8] locking/mutex: Rework mutex::owner
  2016-10-12 17:59   ` Davidlohr Bueso
@ 2016-10-12 19:52     ` Jason Low
  0 siblings, 0 replies; 59+ messages in thread
From: Jason Low @ 2016-10-12 19:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: jason.low2, Peter Zijlstra, Linus Torvalds, Waiman Long,
	Ding Tianhong, Thomas Gleixner, Will Deacon, Ingo Molnar,
	Imre Deak, Linux Kernel Mailing List, Tim Chen, Terry Rudd,
	Paul E. McKenney, Chris Wilson, Daniel Vetter

On Wed, 2016-10-12 at 10:59 -0700, Davidlohr Bueso wrote:
> On Fri, 07 Oct 2016, Peter Zijlstra wrote:
> >+/*
> >+ * Optimistic trylock that only works in the uncontended case. Make sure to
> >+ * follow with a __mutex_trylock() before failing.
> >+ */
> >+static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
> >+{
> >+	unsigned long curr = (unsigned long)current;
> >+
> >+	if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
> >+		return true;
> 
> Do we want to do a ccas check for !lock->owner? Although I can see a possible
> case of 'optimizing for the contended' reasons for nay.

Since this is the fast path version that gets used in mutex_lock(),
ect..., I think it would make sense to keep it like it is so that we
optimize it for the "common" case. This trylock function is more likely
to succeed as it is used for the initial attempt to get the mutex, so I
think we could avoid the ccas in the trylock_fast().

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
@ 2016-10-13 15:14   ` Will Deacon
  2016-10-17  9:22     ` Peter Zijlstra
  2016-10-17 18:45   ` Waiman Long
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Will Deacon @ 2016-10-13 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

Hi Peter,

Just one comment below.

On Fri, Oct 07, 2016 at 04:52:48PM +0200, Peter Zijlstra wrote:
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -54,8 +54,10 @@ EXPORT_SYMBOL(__mutex_init);
>   * bits to store extra state.
>   *
>   * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
> + * Bit1 indicates unlock needs to hand the lock to the top-waiter
>   */
>  #define MUTEX_FLAG_WAITERS	0x01
> +#define MUTEX_FLAG_HANDOFF	0x02
>  
>  #define MUTEX_FLAGS		0x03
>  
> @@ -71,20 +73,48 @@ static inline unsigned long __owner_flag
>  
>  /*
>   * Actual trylock that will work on any unlocked state.
> + *
> + * When setting the owner field, we must preserve the low flag bits.
> + *
> + * Be careful with @handoff, only set that in a wait-loop (where you set
> + * HANDOFF) to avoid recursive lock attempts.
>   */
> -static inline bool __mutex_trylock(struct mutex *lock)
> +static inline bool __mutex_trylock(struct mutex *lock, const bool handoff)
>  {
>  	unsigned long owner, curr = (unsigned long)current;
>  
>  	owner = atomic_long_read(&lock->owner);
>  	for (;;) { /* must loop, can race against a flag */
> -		unsigned long old;
> +		unsigned long old, flags = __owner_flags(owner);
> +
> +		if (__owner_task(owner)) {
> +			if (handoff && unlikely(__owner_task(owner) == current)) {
> +				/*
> +				 * Provide ACQUIRE semantics for the lock-handoff.
> +				 *
> +				 * We cannot easily use load-acquire here, since
> +				 * the actual load is a failed cmpxchg, which
> +				 * doesn't imply any barriers.
> +				 *
> +				 * Also, this is a fairly unlikely scenario, and
> +				 * this contains the cost.
> +				 */
> +				smp_mb(); /* ACQUIRE */

As we discussed on another thread recently, a failed cmpxchg_acquire
will always give you ACQUIRE semantics in practice. Maybe we should update
the documentation to allow this? The only special case is the full-barrier
version.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-07 14:52 ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Peter Zijlstra
@ 2016-10-13 15:17   ` Will Deacon
  2016-10-17 10:44     ` Peter Zijlstra
  2016-10-19 17:34     ` Peter Zijlstra
  2016-10-17 23:16   ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Waiman Long
  1 sibling, 2 replies; 59+ messages in thread
From: Will Deacon @ 2016-10-13 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

Hi Peter,

I'm struggling to get my head around the handoff code after this change...

On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>  
>  	lock_contended(&lock->dep_map, ip);
>  
> +	set_task_state(task, state);
>  	for (;;) {
> +		/*
> +		 * Once we hold wait_lock, we're serialized against
> +		 * mutex_unlock() handing the lock off to us, do a trylock
> +		 * before testing the error conditions to make sure we pick up
> +		 * the handoff.
> +		 */
>  		if (__mutex_trylock(lock, first))
> -			break;
> +			goto acquired;
>  
>  		/*
> -		 * got a signal? (This code gets eliminated in the
> -		 * TASK_UNINTERRUPTIBLE case.)
> +		 * Check for signals and wound conditions while holding
> +		 * wait_lock. This ensures the lock cancellation is ordered
> +		 * against mutex_unlock() and wake-ups do not go missing.
>  		 */
>  		if (unlikely(signal_pending_state(state, task))) {
>  			ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
>  				goto err;
>  		}
>  
> -		__set_task_state(task, state);
>  		spin_unlock_mutex(&lock->wait_lock, flags);
>  		schedule_preempt_disabled();
> -		spin_lock_mutex(&lock->wait_lock, flags);
>  
>  		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
>  			first = true;
>  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
>  		}
> +
> +		set_task_state(task, state);

With this change, we no longer hold the lock wit_hen we set the task
state, and it's ordered strictly *after* setting the HANDOFF flag.
Doesn't that mean that the unlock code can see the HANDOFF flag, issue
the wakeup, but then we come in and overwrite the task state?

I'm struggling to work out whether that's an issue, but it certainly
feels odd and is a change from the previous behaviour.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 2/8] locking/mutex: Rework mutex::owner
  2016-10-07 14:52 ` [PATCH -v4 2/8] locking/mutex: Rework mutex::owner Peter Zijlstra
  2016-10-12 17:59   ` Davidlohr Bueso
@ 2016-10-13 15:18   ` Will Deacon
  1 sibling, 0 replies; 59+ messages in thread
From: Will Deacon @ 2016-10-13 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Fri, Oct 07, 2016 at 04:52:45PM +0200, Peter Zijlstra wrote:
> The current mutex implementation has an atomic lock word and a
> non-atomic owner field.
> 
> This disparity leads to a number of issues with the current mutex code
> as it means that we can have a locked mutex without an explicit owner
> (because the owner field has not been set, or already cleared).
> 
> This leads to a number of weird corner cases, esp. between the
> optimistic spinning and debug code. Where the optimistic spinning
> code needs the owner field updated inside the lock region, the debug
> code is more relaxed because the whole lock is serialized by the
> wait_lock.
> 
> Also, the spinning code itself has a few corner cases where we need to
> deal with a held lock without an owner field.
> 
> Furthermore, it becomes even more of a problem when trying to fix
> starvation cases in the current code. We end up stacking special case
> on special case.
> 
> To solve this rework the basic mutex implementation to be a single
> atomic word that contains the owner and uses the low bits for extra
> state.
> 
> This matches how PI futexes and rt_mutex already work. By having the
> owner an integral part of the lock state a lot of the problems
> dissapear and we get a better option to deal with starvation cases,
> direct owner handoff.
> 
> Changing the basic mutex does however invalidate all the arch specific
> mutex code; this patch leaves that unused in-place, a later patch will
> remove that.
> 
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mutex-debug.h  |   24 --
>  include/linux/mutex.h        |   46 +++--
>  kernel/locking/mutex-debug.c |   13 -
>  kernel/locking/mutex-debug.h |   10 -
>  kernel/locking/mutex.c       |  371 ++++++++++++++++++-------------------------
>  kernel/locking/mutex.h       |   26 ---
>  kernel/sched/core.c          |    2 
>  7 files changed, 187 insertions(+), 305 deletions(-)

Looks good to me:

Reviewed-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter
  2016-10-07 14:52 ` [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter Peter Zijlstra
@ 2016-10-13 15:28   ` Will Deacon
  2016-10-17  9:32     ` Peter Zijlstra
  2016-10-17 23:21   ` Waiman Long
  1 sibling, 1 reply; 59+ messages in thread
From: Will Deacon @ 2016-10-13 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Fri, Oct 07, 2016 at 04:52:51PM +0200, Peter Zijlstra wrote:
> From: Waiman Long <Waiman.Long@hpe.com>
> 
> This patch makes the waiter that sets the HANDOFF flag start spinning
> instead of sleeping until the handoff is complete or the owner
> sleeps. Otherwise, the handoff will cause the optimistic spinners to
> abort spinning as the handed-off owner may not be running.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Jason Low <jason.low2@hpe.com>
> Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
> Cc: Ding Tianhong <dingtianhong@huawei.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Will Deacon <Will.Deacon@arm.com>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/mutex.c |   77 ++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 54 insertions(+), 23 deletions(-)
> 
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -416,24 +416,39 @@ static inline int mutex_can_spin_on_owne
>   *
>   * Returns true when the lock was taken, otherwise false, indicating
>   * that we need to jump to the slowpath and sleep.
> + *
> + * The waiter flag is set to true if the spinner is a waiter in the wait
> + * queue. The waiter-spinner will spin on the lock directly and concurrently
> + * with the spinner at the head of the OSQ, if present, until the owner is
> + * changed to itself.
>   */
>  static bool mutex_optimistic_spin(struct mutex *lock,
> -				  struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
> +				  struct ww_acquire_ctx *ww_ctx,
> +				  const bool use_ww_ctx, const bool waiter)
>  {
>  	struct task_struct *task = current;
>  
> -	if (!mutex_can_spin_on_owner(lock))
> -		goto done;
> +	if (!waiter) {
> +		/*
> +		 * The purpose of the mutex_can_spin_on_owner() function is
> +		 * to eliminate the overhead of osq_lock() and osq_unlock()
> +		 * in case spinning isn't possible. As a waiter-spinner
> +		 * is not going to take OSQ lock anyway, there is no need
> +		 * to call mutex_can_spin_on_owner().
> +		 */
> +		if (!mutex_can_spin_on_owner(lock))
> +			goto fail;
>  
> -	/*
> -	 * In order to avoid a stampede of mutex spinners trying to
> -	 * acquire the mutex all at once, the spinners need to take a
> -	 * MCS (queued) lock first before spinning on the owner field.
> -	 */
> -	if (!osq_lock(&lock->osq))
> -		goto done;
> +		/*
> +		 * In order to avoid a stampede of mutex spinners trying to
> +		 * acquire the mutex all at once, the spinners need to take a
> +		 * MCS (queued) lock first before spinning on the owner field.
> +		 */
> +		if (!osq_lock(&lock->osq))
> +			goto fail;
> +	}
>  
> -	while (true) {
> +	for (;;) {
>  		struct task_struct *owner;
>  
>  		if (use_ww_ctx && ww_ctx->acquired > 0) {
> @@ -449,7 +464,7 @@ static bool mutex_optimistic_spin(struct
>  			 * performed the optimistic spinning cannot be done.
>  			 */
>  			if (READ_ONCE(ww->ctx))
> -				break;
> +				goto fail_unlock;
>  		}
>  
>  		/*
> @@ -457,15 +472,20 @@ static bool mutex_optimistic_spin(struct
>  		 * release the lock or go to sleep.
>  		 */
>  		owner = __mutex_owner(lock);
> -		if (owner && !mutex_spin_on_owner(lock, owner))
> -			break;
> +		if (owner) {
> +			if (waiter && owner == task) {
> +				smp_mb(); /* ACQUIRE */

Hmm, is this barrier actually needed? This only happens on the handoff path,
and we take the wait_lock immediately after this succeeds anyway. That
control dependency, coupled with the acquire semantics of the spin_lock,
should be sufficient, no?

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-13 15:14   ` Will Deacon
@ 2016-10-17  9:22     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-17  9:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Thu, Oct 13, 2016 at 04:14:47PM +0100, Will Deacon wrote:

> > +		if (__owner_task(owner)) {
> > +			if (handoff && unlikely(__owner_task(owner) == current)) {
> > +				/*
> > +				 * Provide ACQUIRE semantics for the lock-handoff.
> > +				 *
> > +				 * We cannot easily use load-acquire here, since
> > +				 * the actual load is a failed cmpxchg, which
> > +				 * doesn't imply any barriers.
> > +				 *
> > +				 * Also, this is a fairly unlikely scenario, and
> > +				 * this contains the cost.
> > +				 */
> > +				smp_mb(); /* ACQUIRE */
> 
> As we discussed on another thread recently, a failed cmpxchg_acquire
> will always give you ACQUIRE semantics in practice. Maybe we should update
> the documentation to allow this? The only special case is the full-barrier
> version.

So on PPC we do:

static __always_inline unsigned long
__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
{
        unsigned long prev;

        __asm__ __volatile__ (
"1:     lwarx   %0,0,%2         # __cmpxchg_u32_acquire\n"
"       cmpw    0,%0,%3\n"
"       bne-    2f\n"
        PPC405_ERR77(0, %2)
"       stwcx.  %4,0,%2\n"
"       bne-    1b\n"
        PPC_ACQUIRE_BARRIER
        "\n"
"2:"
        : "=&r" (prev), "+m" (*p)
        : "r" (p), "r" (old), "r" (new)
        : "cc", "memory");

        return prev;
}

which I read to skip over the ACQUIRE_BARRIER on fail.


Similarly, we _could_ make the generic version skip the barrier entirely
(we currently do not it seems).


And while I agree that it makes semantic sense, in that we always issue
the LOAD, and since we defined the ACQUIRE to apply to the LOADs only,
and we always issue the LOAD, we should also always provide ACQUIRE
semantics. I'm not entirely convinced we should go there just yet. It
would make failed cmpxchg_acquire()'s more expensive, and this really is
the only place we care about those.


So I would propose for now we keep these explicit barriers; both here
and the other place you mentioned, but keep this in mind.

Also, I don't feel we need more complexity in this patch set just now.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter
  2016-10-13 15:28   ` Will Deacon
@ 2016-10-17  9:32     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-17  9:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Thu, Oct 13, 2016 at 04:28:01PM +0100, Will Deacon wrote:
> On Fri, Oct 07, 2016 at 04:52:51PM +0200, Peter Zijlstra wrote:

> > @@ -457,15 +472,20 @@ static bool mutex_optimistic_spin(struct
> >  		 * release the lock or go to sleep.
> >  		 */
> >  		owner = __mutex_owner(lock);
> > -		if (owner && !mutex_spin_on_owner(lock, owner))
> > -			break;
> > +		if (owner) {
> > +			if (waiter && owner == task) {
> > +				smp_mb(); /* ACQUIRE */
> 
> Hmm, is this barrier actually needed? This only happens on the handoff path,
> and we take the wait_lock immediately after this succeeds anyway. That
> control dependency, coupled with the acquire semantics of the spin_lock,
> should be sufficient, no?

Yes, I think you're right. But like said in that earlier email, I'd like
to keep this for now.

Once this code has settled we can reconsider this.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-13 15:17   ` Will Deacon
@ 2016-10-17 10:44     ` Peter Zijlstra
  2016-10-17 13:24       ` Peter Zijlstra
  2016-10-19 17:34     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-17 10:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> I'm struggling to get my head around the handoff code after this change...
> 
> On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >  
> >  	lock_contended(&lock->dep_map, ip);
> >  
> > +	set_task_state(task, state);
> >  	for (;;) {
> > +		/*
> > +		 * Once we hold wait_lock, we're serialized against
> > +		 * mutex_unlock() handing the lock off to us, do a trylock
> > +		 * before testing the error conditions to make sure we pick up
> > +		 * the handoff.
> > +		 */
> >  		if (__mutex_trylock(lock, first))
> > -			break;
> > +			goto acquired;
> >  
> >  		/*
> > -		 * got a signal? (This code gets eliminated in the
> > -		 * TASK_UNINTERRUPTIBLE case.)
> > +		 * Check for signals and wound conditions while holding
> > +		 * wait_lock. This ensures the lock cancellation is ordered
> > +		 * against mutex_unlock() and wake-ups do not go missing.
> >  		 */
> >  		if (unlikely(signal_pending_state(state, task))) {
> >  			ret = -EINTR;
> > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> >  				goto err;
> >  		}
> >  
> > -		__set_task_state(task, state);
> >  		spin_unlock_mutex(&lock->wait_lock, flags);
> >  		schedule_preempt_disabled();
> > -		spin_lock_mutex(&lock->wait_lock, flags);
> >  
> >  		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> >  			first = true;
> >  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> >  		}
> > +
> > +		set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

Right, so I think the code is fine, since in that case the
__mutex_trylock() must see the handoff and we'll break the loop and
(re)set the state to RUNNING.

But you're right in that its slightly odd. I'll reorder them and put the
set_task_state() above the !first thing.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-17 10:44     ` Peter Zijlstra
@ 2016-10-17 13:24       ` Peter Zijlstra
  2016-10-17 13:45         ` Boqun Feng
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-17 13:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > Hi Peter,
> > 
> > I'm struggling to get my head around the handoff code after this change...
> > 
> > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > --- a/kernel/locking/mutex.c
> > > +++ b/kernel/locking/mutex.c
> > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > >  
> > >  	lock_contended(&lock->dep_map, ip);
> > >  
> > > +	set_task_state(task, state);
> > >  	for (;;) {
> > > +		/*
> > > +		 * Once we hold wait_lock, we're serialized against
> > > +		 * mutex_unlock() handing the lock off to us, do a trylock
> > > +		 * before testing the error conditions to make sure we pick up
> > > +		 * the handoff.
> > > +		 */
> > >  		if (__mutex_trylock(lock, first))
> > > -			break;
> > > +			goto acquired;
> > >  
> > >  		/*
> > > -		 * got a signal? (This code gets eliminated in the
> > > -		 * TASK_UNINTERRUPTIBLE case.)
> > > +		 * Check for signals and wound conditions while holding
> > > +		 * wait_lock. This ensures the lock cancellation is ordered
> > > +		 * against mutex_unlock() and wake-ups do not go missing.
> > >  		 */
> > >  		if (unlikely(signal_pending_state(state, task))) {
> > >  			ret = -EINTR;
> > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > >  				goto err;
> > >  		}
> > >  
> > > -		__set_task_state(task, state);
> > >  		spin_unlock_mutex(&lock->wait_lock, flags);
> > >  		schedule_preempt_disabled();
> > > -		spin_lock_mutex(&lock->wait_lock, flags);
> > >  
> > >  		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > >  			first = true;
> > >  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > >  		}
> > > +
> > > +		set_task_state(task, state);
> > 
> > With this change, we no longer hold the lock wit_hen we set the task
> > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > the wakeup, but then we come in and overwrite the task state?
> > 
> > I'm struggling to work out whether that's an issue, but it certainly
> > feels odd and is a change from the previous behaviour.
> 
> Right, so I think the code is fine, since in that case the
> __mutex_trylock() must see the handoff and we'll break the loop and
> (re)set the state to RUNNING.
> 
> But you're right in that its slightly odd. I'll reorder them and put the
> set_task_state() above the !first thing.


Humm,.. we might actually rely on this order, since the MB implied by
set_task_state() is the only thing that separates the store of
__mutex_set_flag() from the load of __mutex_trylock(), and those should
be ordered I think.

Argh, completely messed up my brain. I'll not touch it and think on this
again tomorrow.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-17 13:24       ` Peter Zijlstra
@ 2016-10-17 13:45         ` Boqun Feng
  2016-10-17 15:49           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Boqun Feng @ 2016-10-17 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

[-- Attachment #1: Type: text/plain, Size: 3271 bytes --]

On Mon, Oct 17, 2016 at 03:24:08PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 12:44:49PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> > > Hi Peter,
> > > 
> > > I'm struggling to get my head around the handoff code after this change...
> > > 
> > > On Fri, Oct 07, 2016 at 04:52:49PM +0200, Peter Zijlstra wrote:
> > > > --- a/kernel/locking/mutex.c
> > > > +++ b/kernel/locking/mutex.c
> > > > @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> > > >  
> > > >  	lock_contended(&lock->dep_map, ip);
> > > >  
> > > > +	set_task_state(task, state);
> > > >  	for (;;) {
> > > > +		/*
> > > > +		 * Once we hold wait_lock, we're serialized against
> > > > +		 * mutex_unlock() handing the lock off to us, do a trylock
> > > > +		 * before testing the error conditions to make sure we pick up
> > > > +		 * the handoff.
> > > > +		 */
> > > >  		if (__mutex_trylock(lock, first))
> > > > -			break;
> > > > +			goto acquired;
> > > >  
> > > >  		/*
> > > > -		 * got a signal? (This code gets eliminated in the
> > > > -		 * TASK_UNINTERRUPTIBLE case.)
> > > > +		 * Check for signals and wound conditions while holding
> > > > +		 * wait_lock. This ensures the lock cancellation is ordered
> > > > +		 * against mutex_unlock() and wake-ups do not go missing.
> > > >  		 */
> > > >  		if (unlikely(signal_pending_state(state, task))) {
> > > >  			ret = -EINTR;
> > > > @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> > > >  				goto err;
> > > >  		}
> > > >  
> > > > -		__set_task_state(task, state);
> > > >  		spin_unlock_mutex(&lock->wait_lock, flags);
> > > >  		schedule_preempt_disabled();
> > > > -		spin_lock_mutex(&lock->wait_lock, flags);
> > > >  
> > > >  		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> > > >  			first = true;
> > > >  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> > > >  		}
> > > > +
> > > > +		set_task_state(task, state);
> > > 
> > > With this change, we no longer hold the lock wit_hen we set the task
> > > state, and it's ordered strictly *after* setting the HANDOFF flag.
> > > Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> > > the wakeup, but then we come in and overwrite the task state?
> > > 
> > > I'm struggling to work out whether that's an issue, but it certainly
> > > feels odd and is a change from the previous behaviour.
> > 
> > Right, so I think the code is fine, since in that case the
> > __mutex_trylock() must see the handoff and we'll break the loop and
> > (re)set the state to RUNNING.
> > 
> > But you're right in that its slightly odd. I'll reorder them and put the
> > set_task_state() above the !first thing.
> 
> 
> Humm,.. we might actually rely on this order, since the MB implied by
> set_task_state() is the only thing that separates the store of
> __mutex_set_flag() from the load of __mutex_trylock(), and those should
> be ordered I think.
> 

But __mutex_set_flag() and __mutex_trylock() actually touch the same
atomic word? So we don't need extra things to order them?

Regards,
Boqun

> Argh, completely messed up my brain. I'll not touch it and think on this
> again tomorrow.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-17 13:45         ` Boqun Feng
@ 2016-10-17 15:49           ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-17 15:49 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Will Deacon, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 09:45:01PM +0800, Boqun Feng wrote:
> But __mutex_set_flag() and __mutex_trylock() actually touch the same
> atomic word? So we don't need extra things to order them?

Right.. in any case brain is confused. I'll look again at it later.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
  2016-10-13 15:14   ` Will Deacon
@ 2016-10-17 18:45   ` Waiman Long
  2016-10-17 19:07     ` Waiman Long
  2016-10-18 12:36     ` Peter Zijlstra
  2016-12-27 13:55   ` Chris Wilson
  2017-01-09 11:52   ` [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt Chris Wilson
  3 siblings, 2 replies; 59+ messages in thread
From: Waiman Long @ 2016-10-17 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
>   /*
>    * Actual trylock that will work on any unlocked state.
> + *
> + * When setting the owner field, we must preserve the low flag bits.
> + *
> + * Be careful with @handoff, only set that in a wait-loop (where you set
> + * HANDOFF) to avoid recursive lock attempts.
>    */
> -static inline bool __mutex_trylock(struct mutex *lock)
> +static inline bool __mutex_trylock(struct mutex *lock, const bool handoff)
>   {
>   	unsigned long owner, curr = (unsigned long)current;
>
>   	owner = atomic_long_read(&lock->owner);
>   	for (;;) { /* must loop, can race against a flag */
> -		unsigned long old;
> +		unsigned long old, flags = __owner_flags(owner);
> +
> +		if (__owner_task(owner)) {
> +			if (handoff&&  unlikely(__owner_task(owner) == current)) {
> +				/*
> +				 * Provide ACQUIRE semantics for the lock-handoff.
> +				 *
> +				 * We cannot easily use load-acquire here, since
> +				 * the actual load is a failed cmpxchg, which
> +				 * doesn't imply any barriers.
> +				 *
> +				 * Also, this is a fairly unlikely scenario, and
> +				 * this contains the cost.
> +				 */

I am not so sure about your comment here. I guess you are referring to 
the atomic_long_cmpxchg_acquire below for the failed cmpxchg. However, 
it is also possible that the path can be triggered on the first round 
without cmpxchg. Maybe we can do a load_acquire on the owner again to 
satisfy this requirement without a smp_mb().

> +				smp_mb(); /* ACQUIRE */
> +				return true;
> +			}
>
> -		if (__owner_task(owner))
>   			return false;
> +		}
>
> -		old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
> -						  curr | __owner_flags(owner));
> +		/*
> +		 * We set the HANDOFF bit, we must make sure it doesn't live
> +		 * past the point where we acquire it. This would be possible
> +		 * if we (accidentally) set the bit on an unlocked mutex.
> +		 */
> +		if (handoff)
> +			flags&= ~MUTEX_FLAG_HANDOFF;
> +
> +		old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr | flags);
>   		if (old == owner)
>   			return true;
>
>

Other than that, the code is fine.

Reviewed-by: Waiman Long <Waiman.Long@hpe.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-17 18:45   ` Waiman Long
@ 2016-10-17 19:07     ` Waiman Long
  2016-10-18 13:02       ` Peter Zijlstra
  2016-10-18 12:36     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Waiman Long @ 2016-10-17 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On 10/17/2016 02:45 PM, Waiman Long wrote:
> On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
>>   /*
>>    * Actual trylock that will work on any unlocked state.
>> + *
>> + * When setting the owner field, we must preserve the low flag bits.
>> + *
>> + * Be careful with @handoff, only set that in a wait-loop (where you 
>> set
>> + * HANDOFF) to avoid recursive lock attempts.
>>    */
>> -static inline bool __mutex_trylock(struct mutex *lock)
>> +static inline bool __mutex_trylock(struct mutex *lock, const bool 
>> handoff)
>>   {
>>       unsigned long owner, curr = (unsigned long)current;
>>
>>       owner = atomic_long_read(&lock->owner);
>>       for (;;) { /* must loop, can race against a flag */
>> -        unsigned long old;
>> +        unsigned long old, flags = __owner_flags(owner);
>> +
>> +        if (__owner_task(owner)) {
>> +            if (handoff&&  unlikely(__owner_task(owner) == current)) {
>> +                /*
>> +                 * Provide ACQUIRE semantics for the lock-handoff.
>> +                 *
>> +                 * We cannot easily use load-acquire here, since
>> +                 * the actual load is a failed cmpxchg, which
>> +                 * doesn't imply any barriers.
>> +                 *
>> +                 * Also, this is a fairly unlikely scenario, and
>> +                 * this contains the cost.
>> +                 */
>
> I am not so sure about your comment here. I guess you are referring to 
> the atomic_long_cmpxchg_acquire below for the failed cmpxchg. However, 
> it is also possible that the path can be triggered on the first round 
> without cmpxchg. Maybe we can do a load_acquire on the owner again to 
> satisfy this requirement without a smp_mb().
>
>> +                smp_mb(); /* ACQUIRE */
>> +                return true;
>> +            }
>>
>> -        if (__owner_task(owner))
>>               return false;
>> +        }
>>
>> -        old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
>> -                          curr | __owner_flags(owner));
>> +        /*
>> +         * We set the HANDOFF bit, we must make sure it doesn't live
>> +         * past the point where we acquire it. This would be possible
>> +         * if we (accidentally) set the bit on an unlocked mutex.
>> +         */
>> +        if (handoff)
>> +            flags&= ~MUTEX_FLAG_HANDOFF;
>> +
>> +        old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr 
>> | flags);
>>           if (old == owner)
>>               return true;
>>
>>
>
> Other than that, the code is fine.
>
> Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
>

One more thing, I think it may be worthwhile to add another comment 
about what happens when the HANDOFF bit was set while we take the error 
path (goto err). As the actual handoff is serialized by the wait_lock, 
the code will still do the right thing. Either the next one in the queue 
will be handed off or it will be unlocked if the queue is empty.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-07 14:52 ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Peter Zijlstra
  2016-10-13 15:17   ` Will Deacon
@ 2016-10-17 23:16   ` Waiman Long
  2016-10-18 13:14     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Waiman Long @ 2016-10-17 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
> Doesn't really matter yet, but pull the HANDOFF and trylock out from
> under the wait_lock.
>
> The intention is to add an optimistic spin loop here, which requires
> we do not hold the wait_lock, so shuffle code around in preparation.
>
> Also clarify the purpose of taking the wait_lock in the wait loop, its
> tempting to want to avoid it altogether, but the cancellation cases
> need to to avoid losing wakeups.
>
> Suggested-by: Waiman Long<waiman.long@hpe.com>
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
> ---
>   kernel/locking/mutex.c |   30 +++++++++++++++++++++++++-----
>   1 file changed, 25 insertions(+), 5 deletions(-)

> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
>
>   	lock_contended(&lock->dep_map, ip);
>
> +	set_task_state(task, state);

Do we want to set the state here? I am not sure if it is OK to set the 
task state without ever calling schedule().

>   	for (;;) {
> +		/*
> +		 * Once we hold wait_lock, we're serialized against
> +		 * mutex_unlock() handing the lock off to us, do a trylock
> +		 * before testing the error conditions to make sure we pick up
> +		 * the handoff.
> +		 */
>   		if (__mutex_trylock(lock, first))
> -			break;
> +			goto acquired;
>
>   		/*
> -		 * got a signal? (This code gets eliminated in the
> -		 * TASK_UNINTERRUPTIBLE case.)
> +		 * Check for signals and wound conditions while holding
> +		 * wait_lock. This ensures the lock cancellation is ordered
> +		 * against mutex_unlock() and wake-ups do not go missing.
>   		 */
>   		if (unlikely(signal_pending_state(state, task))) {
>   			ret = -EINTR;
> @@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
>   				goto err;
>   		}
>
> -		__set_task_state(task, state);
>   		spin_unlock_mutex(&lock->wait_lock, flags);
>   		schedule_preempt_disabled();
> -		spin_lock_mutex(&lock->wait_lock, flags);
>
>   		if (!first&&  __mutex_waiter_is_first(lock,&waiter)) {
>   			first = true;
>   			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
>   		}
> +
> +		set_task_state(task, state);

I would suggest keep the __set_task_state() above and change 
set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to 
provide the memory barrier. Then we don't need adding __set_task_state() 
calls below.

> +		/*
> +		 * Here we order against unlock; we must either see it change
> +		 * state back to RUNNING and fall through the next schedule(),
> +		 * or we must see its unlock and acquire.
> +		 */
> +		if (__mutex_trylock(lock, first))
> +			break;
> +

I don't think we need a trylock here since we are going to do it at the 
top of the loop within wait_lock anyway.

> +		spin_lock_mutex(&lock->wait_lock, flags);
>   	}
> +	spin_lock_mutex(&lock->wait_lock, flags);
> +acquired:
>   	__set_task_state(task, TASK_RUNNING);
>
>   	mutex_remove_waiter(lock,&waiter, task);
> @@ -682,6 +701,7 @@ __mutex_lock_common(struct mutex *lock,
>   	return 0;
>
>   err:
> +	__set_task_state(task, TASK_RUNNING);
>   	mutex_remove_waiter(lock,&waiter, task);
>   	spin_unlock_mutex(&lock->wait_lock, flags);
>   	debug_mutex_free_waiter(&waiter);
>
>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter
  2016-10-07 14:52 ` [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter Peter Zijlstra
  2016-10-13 15:28   ` Will Deacon
@ 2016-10-17 23:21   ` Waiman Long
  2016-10-18 12:19     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Waiman Long @ 2016-10-17 23:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
> @@ -600,7 +630,7 @@ __mutex_lock_common(struct mutex *lock,
>   	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
>
>   	if (__mutex_trylock(lock, false) ||
> -	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
> +	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, false)) {
>   		/* got the lock, yay! */
>   		lock_acquired(&lock->dep_map, ip);
>   		if (use_ww_ctx)
> @@ -669,7 +699,8 @@ __mutex_lock_common(struct mutex *lock,
>   		 * state back to RUNNING and fall through the next schedule(),
>   		 * or we must see its unlock and acquire.
>   		 */
> -		if (__mutex_trylock(lock, first))
> +		if ((first&&  mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, true)) ||
> +		     __mutex_trylock(lock, first))

Do we need a __mutex_trylock() here? mutex_optimistic_spin() will do the 
trylock and we have one at the top of the loop.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter
  2016-10-17 23:21   ` Waiman Long
@ 2016-10-18 12:19     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 12:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 07:21:28PM -0400, Waiman Long wrote:
> On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
> >@@ -600,7 +630,7 @@ __mutex_lock_common(struct mutex *lock,
> >  	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
> >
> >  	if (__mutex_trylock(lock, false) ||
> >-	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
> >+	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, false)) {
> >  		/* got the lock, yay! */
> >  		lock_acquired(&lock->dep_map, ip);
> >  		if (use_ww_ctx)
> >@@ -669,7 +699,8 @@ __mutex_lock_common(struct mutex *lock,
> >  		 * state back to RUNNING and fall through the next schedule(),
> >  		 * or we must see its unlock and acquire.
> >  		 */
> >-		if (__mutex_trylock(lock, first))
> >+		if ((first&&  mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, true)) ||
> >+		     __mutex_trylock(lock, first))
> 
> Do we need a __mutex_trylock() here? mutex_optimistic_spin() will do the
> trylock and we have one at the top of the loop.

Yes, mutex_optimistic_spin() can be a no-op.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-17 18:45   ` Waiman Long
  2016-10-17 19:07     ` Waiman Long
@ 2016-10-18 12:36     ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 12:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 02:45:50PM -0400, Waiman Long wrote:
> On 10/07/2016 10:52 AM, Peter Zijlstra wrote:
> >  /*
> >   * Actual trylock that will work on any unlocked state.
> >+ *
> >+ * When setting the owner field, we must preserve the low flag bits.
> >+ *
> >+ * Be careful with @handoff, only set that in a wait-loop (where you set
> >+ * HANDOFF) to avoid recursive lock attempts.
> >   */
> >-static inline bool __mutex_trylock(struct mutex *lock)
> >+static inline bool __mutex_trylock(struct mutex *lock, const bool handoff)
> >  {
> >  	unsigned long owner, curr = (unsigned long)current;
> >
> >  	owner = atomic_long_read(&lock->owner);
> >  	for (;;) { /* must loop, can race against a flag */
> >-		unsigned long old;
> >+		unsigned long old, flags = __owner_flags(owner);
> >+
> >+		if (__owner_task(owner)) {
> >+			if (handoff&&  unlikely(__owner_task(owner) == current)) {
> >+				/*
> >+				 * Provide ACQUIRE semantics for the lock-handoff.
> >+				 *
> >+				 * We cannot easily use load-acquire here, since
> >+				 * the actual load is a failed cmpxchg, which
> >+				 * doesn't imply any barriers.
> >+				 *
> >+				 * Also, this is a fairly unlikely scenario, and
> >+				 * this contains the cost.
> >+				 */
> 
> I am not so sure about your comment here. I guess you are referring to the
> atomic_long_cmpxchg_acquire below for the failed cmpxchg. However, it is
> also possible that the path can be triggered on the first round without
> cmpxchg. Maybe we can do a load_acquire on the owner again to satisfy this
> requirement without a smp_mb().

Yes, I refer to the atomic_long_cmpxchg_acquire() below. If that cmpxchg
fails, no barriers are implied.

Yes we could fix that, but that would make all cmpxchg_acquire() loops
more expensive. And only fixing the initial load doesn't help, since we
still need to deal with the cmpxchg case failing.

We _could_ re-issue the load I suppose, because since if owner==current,
nobody else is going to change it, but I feel slightly uneasy with that.
Also, its a relative slow path, so for now, lets keep it simple and
explicit like so.

> 
> >+				smp_mb(); /* ACQUIRE */
> >+				return true;
> >+			}
> >
> >-		if (__owner_task(owner))
> >  			return false;
> >+		}
> >
> >-		old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
> >-						  curr | __owner_flags(owner));
> >+		/*
> >+		 * We set the HANDOFF bit, we must make sure it doesn't live
> >+		 * past the point where we acquire it. This would be possible
> >+		 * if we (accidentally) set the bit on an unlocked mutex.
> >+		 */
> >+		if (handoff)
> >+			flags&= ~MUTEX_FLAG_HANDOFF;
> >+
> >+		old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr | flags);
> >  		if (old == owner)
> >  			return true;
> >
> >
> 
> Other than that, the code is fine.
> 
> Reviewed-by: Waiman Long <Waiman.Long@hpe.com>

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
  2016-10-07 15:43   ` Peter Zijlstra
@ 2016-10-18 12:48   ` Peter Zijlstra
  2016-10-18 12:57     ` Peter Zijlstra
  2016-10-18 12:57     ` Chris Wilson
  1 sibling, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 12:48 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low
  Cc: Chris Wilson, Daniel Vetter, Rob Clark

On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> Poking at lock internals is not cool. Since I'm going to change the
> implementation this will break, take it out.
> 
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Rob Clark <robdclark@gmail.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  drivers/gpu/drm/i915/i915_gem_shrinker.c |   26 +++-----------------------
>  drivers/gpu/drm/msm/msm_gem_shrinker.c   |   23 +++--------------------
>  2 files changed, 6 insertions(+), 43 deletions(-)

OK, so it appears that i915 changed their locking around and got rid of
this thing entirely. Much appreciated Chris!!

Rob, is there any chance you can do the same for msm?

Not having to provide a replacement function and opening the door to
recursive locking would be ever so good.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-18 12:48   ` Peter Zijlstra
@ 2016-10-18 12:57     ` Peter Zijlstra
  2016-11-11 11:22       ` Daniel Vetter
  2016-10-18 12:57     ` Chris Wilson
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 12:57 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low
  Cc: Chris Wilson, Daniel Vetter, Rob Clark

On Tue, Oct 18, 2016 at 02:48:41PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> > Poking at lock internals is not cool. Since I'm going to change the
> > implementation this will break, take it out.
> > 
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Rob Clark <robdclark@gmail.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_shrinker.c |   26 +++-----------------------
> >  drivers/gpu/drm/msm/msm_gem_shrinker.c   |   23 +++--------------------
> >  2 files changed, 6 insertions(+), 43 deletions(-)
> 
> OK, so it appears that i915 changed their locking around and got rid of
> this thing entirely. Much appreciated Chris!!

Hmm, I might have spoken too soon. My patch conflicted and I seem to
have read too much in the Changelog of 3b4e896f14b1 ("drm/i915: Remove
unused no-shrinker-steal").

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-18 12:48   ` Peter Zijlstra
  2016-10-18 12:57     ` Peter Zijlstra
@ 2016-10-18 12:57     ` Chris Wilson
  1 sibling, 0 replies; 59+ messages in thread
From: Chris Wilson @ 2016-10-18 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Daniel Vetter, Rob Clark

On Tue, Oct 18, 2016 at 02:48:41PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> > Poking at lock internals is not cool. Since I'm going to change the
> > implementation this will break, take it out.
> > 
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Rob Clark <robdclark@gmail.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_shrinker.c |   26 +++-----------------------
> >  drivers/gpu/drm/msm/msm_gem_shrinker.c   |   23 +++--------------------
> >  2 files changed, 6 insertions(+), 43 deletions(-)
> 
> OK, so it appears that i915 changed their locking around and got rid of
> this thing entirely. Much appreciated Chris!!

It's not dead yet! Sorry.

It's close though, in the next cycle we may be at a point where we don't
rely on recursion locking in the shrinker.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-17 19:07     ` Waiman Long
@ 2016-10-18 13:02       ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 13:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 03:07:54PM -0400, Waiman Long wrote:
> One more thing, I think it may be worthwhile to add another comment about
> what happens when the HANDOFF bit was set while we take the error path (goto
> err). As the actual handoff is serialized by the wait_lock, the code will
> still do the right thing. Either the next one in the queue will be handed
> off or it will be unlocked if the queue is empty.

Doesn't the next patch add just such a comment?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-17 23:16   ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Waiman Long
@ 2016-10-18 13:14     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-18 13:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Jason Low, Ding Tianhong, Thomas Gleixner,
	Will Deacon, Ingo Molnar, Imre Deak, Linux Kernel Mailing List,
	Davidlohr Bueso, Tim Chen, Terry Rudd, Paul E. McKenney,
	Jason Low, Chris Wilson, Daniel Vetter

On Mon, Oct 17, 2016 at 07:16:50PM -0400, Waiman Long wrote:
> >+++ b/kernel/locking/mutex.c
> >@@ -631,13 +631,21 @@ __mutex_lock_common(struct mutex *lock,
> >
> >  	lock_contended(&lock->dep_map, ip);
> >
> >+	set_task_state(task, state);
> 
> Do we want to set the state here? I am not sure if it is OK to set the task
> state without ever calling schedule().

That's entirely fine, note how we'll set it back to RUNNING at the end.

> >  	for (;;) {
> >+		/*
> >+		 * Once we hold wait_lock, we're serialized against
> >+		 * mutex_unlock() handing the lock off to us, do a trylock
> >+		 * before testing the error conditions to make sure we pick up
> >+		 * the handoff.
> >+		 */
> >  		if (__mutex_trylock(lock, first))
> >-			break;
> >+			goto acquired;
> >
> >  		/*
> >-		 * got a signal? (This code gets eliminated in the
> >-		 * TASK_UNINTERRUPTIBLE case.)
> >+		 * Check for signals and wound conditions while holding
> >+		 * wait_lock. This ensures the lock cancellation is ordered
> >+		 * against mutex_unlock() and wake-ups do not go missing.
> >  		 */
> >  		if (unlikely(signal_pending_state(state, task))) {
> >  			ret = -EINTR;
> >@@ -650,16 +658,27 @@ __mutex_lock_common(struct mutex *lock,
> >  				goto err;
> >  		}
> >
> >-		__set_task_state(task, state);
> >  		spin_unlock_mutex(&lock->wait_lock, flags);
> >  		schedule_preempt_disabled();
> >-		spin_lock_mutex(&lock->wait_lock, flags);
> >
> >  		if (!first&&  __mutex_waiter_is_first(lock,&waiter)) {
> >  			first = true;
> >  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> >  		}
> >+
> >+		set_task_state(task, state);
> 
> I would suggest keep the __set_task_state() above and change
> set_task_state(task, state) to set_task_state(task, TASK_RUNNING) to provide
> the memory barrier. Then we don't need adding __set_task_state() calls
> below.

set_task_state(RUNNING) doesn't make sense, ever.

See the comment near set_task_state() for the reason it has a barrier.

We need it here because when we do that trylock (or optimistic spin) we
need to have set the state and done a barrier, otherwise we can miss a
wakeup and get stuck.

> >+		/*
> >+		 * Here we order against unlock; we must either see it change
> >+		 * state back to RUNNING and fall through the next schedule(),
> >+		 * or we must see its unlock and acquire.
> >+		 */
> >+		if (__mutex_trylock(lock, first))
> >+			break;
> >+
> 
> I don't think we need a trylock here since we are going to do it at the top
> of the loop within wait_lock anyway.

The idea was to avoid the wait-time of that lock acquire, also, this is
a place-holder for the optimistic spin site for the next patch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop
  2016-10-13 15:17   ` Will Deacon
  2016-10-17 10:44     ` Peter Zijlstra
@ 2016-10-19 17:34     ` Peter Zijlstra
  2016-10-24  1:57       ` ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop) Davidlohr Bueso
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-10-19 17:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter

On Thu, Oct 13, 2016 at 04:17:21PM +0100, Will Deacon wrote:
> >  		if (!first && __mutex_waiter_is_first(lock, &waiter)) {
> >  			first = true;
> >  			__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
> >  		}
> > +
> > +		set_task_state(task, state);
> 
> With this change, we no longer hold the lock wit_hen we set the task
> state, and it's ordered strictly *after* setting the HANDOFF flag.
> Doesn't that mean that the unlock code can see the HANDOFF flag, issue
> the wakeup, but then we come in and overwrite the task state?
> 
> I'm struggling to work out whether that's an issue, but it certainly
> feels odd and is a change from the previous behaviour.

OK, so after a discussion on IRC the problem appears to have been
unfamiliarity with the basic sleep/wakeup scheme. Mutex used to be the
odd duck out for being fully serialized by wait_lock.

The below adds a few words on how the 'normal' sleep/wakeup scheme
works.


---
Subject: sched: Better explain sleep/wakeup
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Oct 19 15:45:27 CEST 2016

There were a few questions wrt how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   52 ++++++++++++++++++++++++++++++++++----------------
 kernel/sched/core.c   |   15 +++++++-------
 2 files changed, 44 insertions(+), 23 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
 #define set_task_state(tsk, state_value)			\
 	do {							\
 		(tsk)->task_state_change = _THIS_IP_;		\
-		smp_store_mb((tsk)->state, (state_value));		\
+		smp_store_mb((tsk)->state, (state_value));	\
 	} while (0)
 
-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- *	set_current_state(TASK_UNINTERRUPTIBLE);
- *	if (do_i_need_to_sleep())
- *		schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
 #define __set_current_state(state_value)			\
 	do {							\
 		current->task_state_change = _THIS_IP_;		\
@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
 #define set_current_state(state_value)				\
 	do {							\
 		current->task_state_change = _THIS_IP_;		\
-		smp_store_mb(current->state, (state_value));		\
+		smp_store_mb(current->state, (state_value));	\
 	} while (0)
 
 #else
 
+/*
+ * @tsk had better be current, or you get to keep the pieces.
+ *
+ * The only reason is that computing current can be more expensive than
+ * using a pointer that's already available.
+ *
+ * Therefore, see set_current_state().
+ */
 #define __set_task_state(tsk, state_value)		\
 	do { (tsk)->state = (state_value); } while (0)
 #define set_task_state(tsk, state_value)		\
@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
  * is correctly serialised wrt the caller's subsequent test of whether to
  * actually sleep:
  *
+ *   for (;;) {
  *	set_current_state(TASK_UNINTERRUPTIBLE);
- *	if (do_i_need_to_sleep())
- *		schedule();
+ *	if (!need_sleep)
+ *		break;
+ *
+ *	schedule();
+ *   }
+ *   __set_current_state(TASK_RUNNING);
+ *
+ * If the caller does not need such serialisation (because, for instance, the
+ * condition test and condition change and wakeup are under the same lock) then
+ * use __set_current_state().
+ *
+ * The above is typically ordered against the wakeup, which does:
+ *
+ *	need_sleep = false;
+ *	wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *
+ * Where wake_up_state() (and all other wakeup primitives) imply enough
+ * barriers to order the store of the variable against wakeup.
+ *
+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
+ *
+ * This is obviously fine, since they both store the exact same value.
  *
- * If the caller does not need such serialisation then use __set_current_state()
+ * Also see the comments of try_to_wake_up().
  */
 #define __set_current_state(state_value)		\
 	do { current->state = (state_value); } while (0)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
  * @state: the mask of task states that can be woken
  * @wake_flags: wake modifier flags (WF_*)
  *
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
+ * If (@state & @p->state) @p->state = TASK_RUNNING.
  *
- * Return: %true if @p was woken up, %false if it was already running.
- * or @state didn't match @p's state.
+ * If the task was not queued/runnable, also place it back on a runqueue.
+ *
+ * Atomic against schedule() which would dequeue a task, also see
+ * set_current_state().
+ *
+ * Return: %true if @p->state changes (an actual wakeup was done),
+ *	   %false otherwise.
  */
 static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
  2016-10-19 17:34     ` Peter Zijlstra
@ 2016-10-24  1:57       ` Davidlohr Bueso
  2016-10-24 13:26         ` Kent Overstreet
  2016-10-24 14:27         ` Kent Overstreet
  0 siblings, 2 replies; 59+ messages in thread
From: Davidlohr Bueso @ 2016-10-24  1:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter,
	kent.overstreet, axboe, linux-bcache

On Wed, 19 Oct 2016, Peter Zijlstra wrote:

>Subject: sched: Better explain sleep/wakeup
>From: Peter Zijlstra <peterz@infradead.org>
>Date: Wed Oct 19 15:45:27 CEST 2016
>
>There were a few questions wrt how sleep-wakeup works. Try and explain
>it more.
>
>Requested-by: Will Deacon <will.deacon@arm.com>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>---
> include/linux/sched.h |   52 ++++++++++++++++++++++++++++++++++----------------
> kernel/sched/core.c   |   15 +++++++-------
> 2 files changed, 44 insertions(+), 23 deletions(-)
>
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> #define set_task_state(tsk, state_value)			\
> 	do {							\
> 		(tsk)->task_state_change = _THIS_IP_;		\
>-		smp_store_mb((tsk)->state, (state_value));		\
>+		smp_store_mb((tsk)->state, (state_value));	\
> 	} while (0)
>
>-/*
>- * set_current_state() includes a barrier so that the write of current->state
>- * is correctly serialised wrt the caller's subsequent test of whether to
>- * actually sleep:
>- *
>- *	set_current_state(TASK_UNINTERRUPTIBLE);
>- *	if (do_i_need_to_sleep())
>- *		schedule();
>- *
>- * If the caller does not need such serialisation then use __set_current_state()
>- */
> #define __set_current_state(state_value)			\
> 	do {							\
> 		current->task_state_change = _THIS_IP_;		\
>@@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> #define set_current_state(state_value)				\
> 	do {							\
> 		current->task_state_change = _THIS_IP_;		\
>-		smp_store_mb(current->state, (state_value));		\
>+		smp_store_mb(current->state, (state_value));	\
> 	} while (0)
>
> #else
>
>+/*
>+ * @tsk had better be current, or you get to keep the pieces.

That reminds me we were getting rid of the set_task_state() calls. Bcache was
pending, being only user in the kernel that doesn't actually use current; but
instead breaks newly (yet blocked/uninterruptible) created garbage collection
kthread. I cannot figure out why this is done (ie purposely accounting the
load avg. Furthermore gc kicks in in very specific scenarios obviously, such
as as by the allocator task, so I don't see why bcache gc should want to be
interruptible.

Kent, Jens, can we get rid of this?

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534d1dd1..6e3c358b5759 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
        if (IS_ERR(c->gc_thread))
                return PTR_ERR(c->gc_thread);
 
-       set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
        return 0;
 }

Thanks,
Davidlohr

>+ *
>+ * The only reason is that computing current can be more expensive than
>+ * using a pointer that's already available.
>+ *
>+ * Therefore, see set_current_state().
>+ */
> #define __set_task_state(tsk, state_value)		\
> 	do { (tsk)->state = (state_value); } while (0)
> #define set_task_state(tsk, state_value)		\
>@@ -299,11 +296,34 @@ extern char ___assert_task_state[1 - 2*!
>  * is correctly serialised wrt the caller's subsequent test of whether to
>  * actually sleep:
>  *
>+ *   for (;;) {
>  *	set_current_state(TASK_UNINTERRUPTIBLE);
>- *	if (do_i_need_to_sleep())
>- *		schedule();
>+ *	if (!need_sleep)
>+ *		break;
>+ *
>+ *	schedule();
>+ *   }
>+ *   __set_current_state(TASK_RUNNING);
>+ *
>+ * If the caller does not need such serialisation (because, for instance, the
>+ * condition test and condition change and wakeup are under the same lock) then
>+ * use __set_current_state().
>+ *
>+ * The above is typically ordered against the wakeup, which does:
>+ *
>+ *	need_sleep = false;
>+ *	wake_up_state(p, TASK_UNINTERRUPTIBLE);
>+ *
>+ * Where wake_up_state() (and all other wakeup primitives) imply enough
>+ * barriers to order the store of the variable against wakeup.
>+ *
>+ * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
>+ * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
>+ * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
>+ *
>+ * This is obviously fine, since they both store the exact same value.
>  *
>- * If the caller does not need such serialisation then use __set_current_state()
>+ * Also see the comments of try_to_wake_up().
>  */
> #define __set_current_state(state_value)		\
> 	do { current->state = (state_value); } while (0)
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -2000,14 +2000,15 @@ static void ttwu_queue(struct task_struc
>  * @state: the mask of task states that can be woken
>  * @wake_flags: wake modifier flags (WF_*)
>  *
>- * Put it on the run-queue if it's not already there. The "current"
>- * thread is always on the run-queue (except when the actual
>- * re-schedule is in progress), and as such you're allowed to do
>- * the simpler "current->state = TASK_RUNNING" to mark yourself
>- * runnable without the overhead of this.
>+ * If (@state & @p->state) @p->state = TASK_RUNNING.
>  *
>- * Return: %true if @p was woken up, %false if it was already running.
>- * or @state didn't match @p's state.
>+ * If the task was not queued/runnable, also place it back on a runqueue.
>+ *
>+ * Atomic against schedule() which would dequeue a task, also see
>+ * set_current_state().
>+ *
>+ * Return: %true if @p->state changes (an actual wakeup was done),
>+ *	   %false otherwise.
>  */
> static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
  2016-10-24  1:57       ` ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop) Davidlohr Bueso
@ 2016-10-24 13:26         ` Kent Overstreet
  2016-10-24 14:27         ` Kent Overstreet
  1 sibling, 0 replies; 59+ messages in thread
From: Kent Overstreet @ 2016-10-24 13:26 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Will Deacon, Linus Torvalds, Waiman Long,
	Jason Low, Ding Tianhong, Thomas Gleixner, Ingo Molnar,
	Imre Deak, Linux Kernel Mailing List, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Daniel Vetter, axboe,
	linux-bcache

On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > include/linux/sched.h |   52 ++++++++++++++++++++++++++++++++++----------------
> > kernel/sched/core.c   |   15 +++++++-------
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)			\
> > 	do {							\
> > 		(tsk)->task_state_change = _THIS_IP_;		\
> > -		smp_store_mb((tsk)->state, (state_value));		\
> > +		smp_store_mb((tsk)->state, (state_value));	\
> > 	} while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - *	set_current_state(TASK_UNINTERRUPTIBLE);
> > - *	if (do_i_need_to_sleep())
> > - *		schedule();
> > - *
> > - * If the caller does not need such serialisation then use __set_current_state()
> > - */
> > #define __set_current_state(state_value)			\
> > 	do {							\
> > 		current->task_state_change = _THIS_IP_;		\
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)				\
> > 	do {							\
> > 		current->task_state_change = _THIS_IP_;		\
> > -		smp_store_mb(current->state, (state_value));		\
> > +		smp_store_mb(current->state, (state_value));	\
> > 	} while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534d1dd1..6e3c358b5759 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1798,7 +1798,6 @@ int bch_gc_thread_start(struct cache_set *c)
>        if (IS_ERR(c->gc_thread))
>                return PTR_ERR(c->gc_thread);
> 
> -       set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
>        return 0;
> }

Actually, that code looks broken, or at least stupid. Let me do a proper fix...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
  2016-10-24  1:57       ` ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop) Davidlohr Bueso
  2016-10-24 13:26         ` Kent Overstreet
@ 2016-10-24 14:27         ` Kent Overstreet
  2016-10-25 16:55           ` Eric Wheeler
  1 sibling, 1 reply; 59+ messages in thread
From: Kent Overstreet @ 2016-10-24 14:27 UTC (permalink / raw)
  To: Davidlohr Bueso, Eric Wheeler; +Cc: Linux Kernel Mailing List, linux-bcache

On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> 
> > Subject: sched: Better explain sleep/wakeup
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Wed Oct 19 15:45:27 CEST 2016
> > 
> > There were a few questions wrt how sleep-wakeup works. Try and explain
> > it more.
> > 
> > Requested-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > include/linux/sched.h |   52 ++++++++++++++++++++++++++++++++++----------------
> > kernel/sched/core.c   |   15 +++++++-------
> > 2 files changed, 44 insertions(+), 23 deletions(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_task_state(tsk, state_value)			\
> > 	do {							\
> > 		(tsk)->task_state_change = _THIS_IP_;		\
> > -		smp_store_mb((tsk)->state, (state_value));		\
> > +		smp_store_mb((tsk)->state, (state_value));	\
> > 	} while (0)
> > 
> > -/*
> > - * set_current_state() includes a barrier so that the write of current->state
> > - * is correctly serialised wrt the caller's subsequent test of whether to
> > - * actually sleep:
> > - *
> > - *	set_current_state(TASK_UNINTERRUPTIBLE);
> > - *	if (do_i_need_to_sleep())
> > - *		schedule();
> > - *
> > - * If the caller does not need such serialisation then use __set_current_state()
> > - */
> > #define __set_current_state(state_value)			\
> > 	do {							\
> > 		current->task_state_change = _THIS_IP_;		\
> > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > #define set_current_state(state_value)				\
> > 	do {							\
> > 		current->task_state_change = _THIS_IP_;		\
> > -		smp_store_mb(current->state, (state_value));		\
> > +		smp_store_mb(current->state, (state_value));	\
> > 	} while (0)
> > 
> > #else
> > 
> > +/*
> > + * @tsk had better be current, or you get to keep the pieces.
> 
> That reminds me we were getting rid of the set_task_state() calls. Bcache was
> pending, being only user in the kernel that doesn't actually use current; but
> instead breaks newly (yet blocked/uninterruptible) created garbage collection
> kthread. I cannot figure out why this is done (ie purposely accounting the
> load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> as as by the allocator task, so I don't see why bcache gc should want to be
> interruptible.
> 
> Kent, Jens, can we get rid of this?

Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
this?

-- >8 --
Subject: [PATCH] bcache: Make gc wakeup saner

This lets us ditch a set_task_state() call.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/bcache.h  |  4 ++--
 drivers/md/bcache/btree.c   | 39 ++++++++++++++++++++-------------------
 drivers/md/bcache/btree.h   |  3 +--
 drivers/md/bcache/request.c |  4 +---
 drivers/md/bcache/super.c   |  2 ++
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a55c7..c3ea03c9a1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -425,7 +425,7 @@ struct cache {
 	 * until a gc finishes - otherwise we could pointlessly burn a ton of
 	 * cpu
 	 */
-	unsigned		invalidate_needs_gc:1;
+	unsigned		invalidate_needs_gc;
 
 	bool			discard; /* Get rid of? */
 
@@ -593,8 +593,8 @@ struct cache_set {
 
 	/* Counts how many sectors bio_insert has added to the cache */
 	atomic_t		sectors_to_gc;
+	wait_queue_head_t	gc_wait;
 
-	wait_queue_head_t	moving_gc_wait;
 	struct keybuf		moving_gc_keys;
 	/* Number of moving GC bios in flight */
 	struct semaphore	moving_in_flight;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81d3db40cd..2efdce0724 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
 	bch_moving_gc(c);
 }
 
-static int bch_gc_thread(void *arg)
+static bool gc_should_run(struct cache_set *c)
 {
-	struct cache_set *c = arg;
 	struct cache *ca;
 	unsigned i;
 
-	while (1) {
-again:
-		bch_btree_gc(c);
+	for_each_cache(ca, c, i)
+		if (ca->invalidate_needs_gc)
+			return true;
 
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (kthread_should_stop())
-			break;
+	if (atomic_read(&c->sectors_to_gc) < 0)
+		return true;
 
-		mutex_lock(&c->bucket_lock);
+	return false;
+}
 
-		for_each_cache(ca, c, i)
-			if (ca->invalidate_needs_gc) {
-				mutex_unlock(&c->bucket_lock);
-				set_current_state(TASK_RUNNING);
-				goto again;
-			}
+static int bch_gc_thread(void *arg)
+{
+	struct cache_set *c = arg;
 
-		mutex_unlock(&c->bucket_lock);
+	while (1) {
+		wait_event_interruptible(c->gc_wait,
+			   kthread_should_stop() || gc_should_run(c));
 
-		schedule();
+		if (kthread_should_stop())
+			break;
+
+		set_gc_sectors(c);
+		bch_btree_gc(c);
 	}
 
 	return 0;
@@ -1790,11 +1792,10 @@ static int bch_gc_thread(void *arg)
 
 int bch_gc_thread_start(struct cache_set *c)
 {
-	c->gc_thread = kthread_create(bch_gc_thread, c, "bcache_gc");
+	c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
 	if (IS_ERR(c->gc_thread))
 		return PTR_ERR(c->gc_thread);
 
-	set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
 	return 0;
 }
 
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 5c391fa01b..9b80417cd5 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -260,8 +260,7 @@ void bch_initial_mark_key(struct cache_set *, int, struct bkey *);
 
 static inline void wake_up_gc(struct cache_set *c)
 {
-	if (c->gc_thread)
-		wake_up_process(c->gc_thread);
+	wake_up(&c->gc_wait);
 }
 
 #define MAP_DONE	0
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 40ffe5e424..a37c1776f2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -196,10 +196,8 @@ static void bch_data_insert_start(struct closure *cl)
 	struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
 	struct bio *bio = op->bio, *n;
 
-	if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0) {
-		set_gc_sectors(op->c);
+	if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0)
 		wake_up_gc(op->c);
-	}
 
 	if (op->bypass)
 		return bch_data_invalidate(cl);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 849ad441cd..66669c8f41 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1491,6 +1491,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	mutex_init(&c->bucket_lock);
 	init_waitqueue_head(&c->btree_cache_wait);
 	init_waitqueue_head(&c->bucket_wait);
+	init_waitqueue_head(&c->gc_wait);
 	sema_init(&c->uuid_write_mutex, 1);
 
 	spin_lock_init(&c->btree_gc_time.lock);
@@ -1550,6 +1551,7 @@ static void run_cache_set(struct cache_set *c)
 
 	for_each_cache(ca, c, i)
 		c->nbuckets += ca->sb.nbuckets;
+	set_gc_sectors(c);
 
 	if (CACHE_SYNC(&c->sb)) {
 		LIST_HEAD(journal);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
  2016-10-24 14:27         ` Kent Overstreet
@ 2016-10-25 16:55           ` Eric Wheeler
  2016-10-25 17:45             ` Kent Overstreet
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Wheeler @ 2016-10-25 16:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, linux-bcache

On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> > 
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > > 
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > > 
> > > Requested-by: Will Deacon <will.deacon@arm.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > > include/linux/sched.h |   52 ++++++++++++++++++++++++++++++++++----------------
> > > kernel/sched/core.c   |   15 +++++++-------
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value)			\
> > > 	do {							\
> > > 		(tsk)->task_state_change = _THIS_IP_;		\
> > > -		smp_store_mb((tsk)->state, (state_value));		\
> > > +		smp_store_mb((tsk)->state, (state_value));	\
> > > 	} while (0)
> > > 
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - *	set_current_state(TASK_UNINTERRUPTIBLE);
> > > - *	if (do_i_need_to_sleep())
> > > - *		schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use __set_current_state()
> > > - */
> > > #define __set_current_state(state_value)			\
> > > 	do {							\
> > > 		current->task_state_change = _THIS_IP_;		\
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value)				\
> > > 	do {							\
> > > 		current->task_state_change = _THIS_IP_;		\
> > > -		smp_store_mb(current->state, (state_value));		\
> > > +		smp_store_mb(current->state, (state_value));	\
> > > 	} while (0)
> > > 
> > > #else
> > > 
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> > 
> > That reminds me we were getting rid of the set_task_state() calls. Bcache was
> > pending, being only user in the kernel that doesn't actually use current; but
> > instead breaks newly (yet blocked/uninterruptible) created garbage collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> > 
> > Kent, Jens, can we get rid of this?
> 
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.  

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call 
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Does bch_cache_set_alloc() even need to call set_gc_sectors since 
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield? 

-Eric

> 
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
> 
> This lets us ditch a set_task_state() call.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  drivers/md/bcache/bcache.h  |  4 ++--
>  drivers/md/bcache/btree.c   | 39 ++++++++++++++++++++-------------------
>  drivers/md/bcache/btree.h   |  3 +--
>  drivers/md/bcache/request.c |  4 +---
>  drivers/md/bcache/super.c   |  2 ++
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
>  	 * until a gc finishes - otherwise we could pointlessly burn a ton of
>  	 * cpu
>  	 */
> -	unsigned		invalidate_needs_gc:1;
> +	unsigned		invalidate_needs_gc;
>  
>  	bool			discard; /* Get rid of? */
>  
> @@ -593,8 +593,8 @@ struct cache_set {
>  
>  	/* Counts how many sectors bio_insert has added to the cache */
>  	atomic_t		sectors_to_gc;
> +	wait_queue_head_t	gc_wait;
>  
> -	wait_queue_head_t	moving_gc_wait;
>  	struct keybuf		moving_gc_keys;
>  	/* Number of moving GC bios in flight */
>  	struct semaphore	moving_in_flight;
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81d3db40cd..2efdce0724 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1757,32 +1757,34 @@ static void bch_btree_gc(struct cache_set *c)
>  	bch_moving_gc(c);
>  }
>  
> -static int bch_gc_thread(void *arg)
> +static bool gc_should_run(struct cache_set *c)
>  {
> -	struct cache_set *c = arg;
>  	struct cache *ca;
>  	unsigned i;
>  
> -	while (1) {
> -again:
> -		bch_btree_gc(c);
> +	for_each_cache(ca, c, i)
> +		if (ca->invalidate_needs_gc)
> +			return true;
>  
> -		set_current_state(TASK_INTERRUPTIBLE);
> -		if (kthread_should_stop())
> -			break;
> +	if (atomic_read(&c->sectors_to_gc) < 0)
> +		return true;
>  
> -		mutex_lock(&c->bucket_lock);
> +	return false;
> +}
>  
> -		for_each_cache(ca, c, i)
> -			if (ca->invalidate_needs_gc) {
> -				mutex_unlock(&c->bucket_lock);
> -				set_current_state(TASK_RUNNING);
> -				goto again;
> -			}
> +static int bch_gc_thread(void *arg)
> +{
> +	struct cache_set *c = arg;
>  
> -		mutex_unlock(&c->bucket_lock);
> +	while (1) {
> +		wait_event_interruptible(c->gc_wait,
> +			   kthread_should_stop() || gc_should_run(c));
>  
> -		schedule();
> +		if (kthread_should_stop())
> +			break;
> +
> +		set_gc_sectors(c);
> +		bch_btree_gc(c);
>  	}
>  
>  	return 0;
> @@ -1790,11 +1792,10 @@ static int bch_gc_thread(void *arg)
>  
>  int bch_gc_thread_start(struct cache_set *c)
>  {
> -	c->gc_thread = kthread_create(bch_gc_thread, c, "bcache_gc");
> +	c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
>  	if (IS_ERR(c->gc_thread))
>  		return PTR_ERR(c->gc_thread);
>  
> -	set_task_state(c->gc_thread, TASK_INTERRUPTIBLE);
>  	return 0;
>  }
>  
> diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
> index 5c391fa01b..9b80417cd5 100644
> --- a/drivers/md/bcache/btree.h
> +++ b/drivers/md/bcache/btree.h
> @@ -260,8 +260,7 @@ void bch_initial_mark_key(struct cache_set *, int, struct bkey *);
>  
>  static inline void wake_up_gc(struct cache_set *c)
>  {
> -	if (c->gc_thread)
> -		wake_up_process(c->gc_thread);
> +	wake_up(&c->gc_wait);
>  }
>  
>  #define MAP_DONE	0
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 40ffe5e424..a37c1776f2 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -196,10 +196,8 @@ static void bch_data_insert_start(struct closure *cl)
>  	struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
>  	struct bio *bio = op->bio, *n;
>  
> -	if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0) {
> -		set_gc_sectors(op->c);
> +	if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0)
>  		wake_up_gc(op->c);
> -	}
>  
>  	if (op->bypass)
>  		return bch_data_invalidate(cl);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 849ad441cd..66669c8f41 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1491,6 +1491,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>  	mutex_init(&c->bucket_lock);
>  	init_waitqueue_head(&c->btree_cache_wait);
>  	init_waitqueue_head(&c->bucket_wait);
> +	init_waitqueue_head(&c->gc_wait);
>  	sema_init(&c->uuid_write_mutex, 1);
>  
>  	spin_lock_init(&c->btree_gc_time.lock);
> @@ -1550,6 +1551,7 @@ static void run_cache_set(struct cache_set *c)
>  
>  	for_each_cache(ca, c, i)
>  		c->nbuckets += ca->sb.nbuckets;
> +	set_gc_sectors(c);
>  
>  	if (CACHE_SYNC(&c->sb)) {
>  		LIST_HEAD(journal);
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)
  2016-10-25 16:55           ` Eric Wheeler
@ 2016-10-25 17:45             ` Kent Overstreet
  0 siblings, 0 replies; 59+ messages in thread
From: Kent Overstreet @ 2016-10-25 17:45 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, linux-bcache

On Tue, Oct 25, 2016 at 09:55:21AM -0700, Eric Wheeler wrote:
> Sure, I'll put it up with my -rc2 pull request to Jens.  
> 
> A couple of sanity checks (for my understanding at least):
> 
> Why does bch_data_insert_start() no longer need to call 
> set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Before, the gc thread wasn't being explicitly signalled that it was time to run
- it was just being woken up, meaning that a spurious wakeup would cause gc to
run. At best it was sketchy and fragile, for multiple reasons - wait_event() is
a much better mechanism.

So with wait_event() gc is explicitly checking if it's time to run - the wakeup
doesn't really make anything happen by itself, it just pokes the gc thread so
it's able to notice that it should run.

When you're signalling a thread this way - we're in effect setting a global
variable that says "gc should run now", then kicking the gc thread so it can
check the "gc should run" variable. But wakeups aren't synchronous - our call to
wake_up() doesn't make the gc thread check that variable before it returns, all
we know when the wake_up() call returns is that the gc thread is going to check
that variable some point in the future. So we can't set the "gc should run"
varible, wake up the gc thread, and then set it back to "gc shouldn't run yet" -
what'll happen most of the time is that the gc thread won't run before we set
the variable back to "gc shouldn't run yet", it'll never see that it was
supposed to run and it'll go back to sleep.

So the way you make this work is you have the gc thread has to set the variable
back to "gc shouldn't run yet" _after_ it's seen it and decided to run.

> Does bch_cache_set_alloc() even need to call set_gc_sectors since 
> bch_gc_thread() does before calling bch_btree_gc?

Yes, because the gc thread only resets the counter when it's decided to run - we
don't want it to run right away at startup.

> Also I'm curious, why change invalidate_needs_gc from a bitfield? 

Bitfields are particularly unsafe for multiple threads to access - the compiler
has to emit instructions to do read/modify/write, which will clobber adjacent
data. A bare int is also not in _general_ safe for multiple threads to access
without a lock, but for what it's doing here it's fine.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-07 15:43   ` Peter Zijlstra
  2016-10-07 15:58     ` Linus Torvalds
  2016-10-08 11:58     ` Thomas Gleixner
@ 2016-11-09 10:38     ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-11-09 10:38 UTC (permalink / raw)
  To: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney
  Cc: Chris Wilson, Daniel Vetter, Rob Clark

On Fri, Oct 07, 2016 at 05:43:51PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
> > Poking at lock internals is not cool. Since I'm going to change the
> > implementation this will break, take it out.
> 
> 
> So something like the below would serve as a replacement for your
> previous hacks. Is this API something acceptable to people?

Compile tested only.. Daniel reminded me in another thread.

---
Subject: locking/mutex,drm: Introduce mutex_trylock_recursive()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 7 Oct 2016 17:43:51 +0200

By popular DRM demand, introduce mutex_trylock_recursive() to fix up the
two GEM users.

Without this it is very easy for these drivers to get stuck in
low-memory situations and trigger OOM. Work is in progress to remove the
need for this in at least i915.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Rob Clark <robdclark@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 15 ++++++++++++---
 drivers/gpu/drm/msm/msm_gem_shrinker.c   | 16 ++++++++++++----
 include/linux/mutex.h                    | 31 +++++++++++++++++++++++++++++++
 scripts/checkpatch.pl                    |  6 ++++++
 4 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index e9bd2a81d03a..5543d993a50e 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -227,11 +227,20 @@ unsigned long i915_gem_shrink_all(struct drm_i915_private *dev_priv)
 
 static bool i915_gem_shrinker_lock(struct drm_device *dev, bool *unlock)
 {
-	if (!mutex_trylock(&dev->struct_mutex))
+	switch (mutex_trylock_recursive(&dev->struct_mutex)) {
+	case MUTEX_TRYLOCK_FAILED:
 		return false;
 
-	*unlock = true;
-	return true;
+	case MUTEX_TRYLOCK_SUCCESS:
+		*unlock = true;
+		return true;
+
+	case MUTEX_TRYLOCK_RECURSIVE:
+		*unlock = false;
+		return true;
+	}
+
+	BUG();
 }
 
 static unsigned long
diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c b/drivers/gpu/drm/msm/msm_gem_shrinker.c
index 6d2e885bd58e..62b8cc653823 100644
--- a/drivers/gpu/drm/msm/msm_gem_shrinker.c
+++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c
@@ -20,13 +20,21 @@
 
 static bool msm_gem_shrinker_lock(struct drm_device *dev, bool *unlock)
 {
-	if (!mutex_trylock(&dev->struct_mutex))
+	switch (mutex_trylock_recursive(&dev->struct_mutex)) {
+	case MUTEX_TRYLOCK_FAILED:
 		return false;
 
-	*unlock = true;
-	return true;
-}
+	case MUTEX_TRYLOCK_SUCCESS:
+		*unlock = true;
+		return true;
 
+	case MUTEX_TRYLOCK_RECURSIVE:
+		*unlock = false;
+		return true;
+	}
+
+	BUG();
+}
 
 static unsigned long
 msm_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 4d3bccabbea5..6a902f0a2148 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -189,4 +189,35 @@ extern void mutex_unlock(struct mutex *lock);
 
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
+/*
+ * These values are chosen such that FAIL and SUCCESS match the
+ * values of the regular mutex_trylock().
+ */
+enum mutex_trylock_recursive_enum {
+	MUTEX_TRYLOCK_FAILED    = 0,
+	MUTEX_TRYLOCK_SUCCESS   = 1,
+	MUTEX_TRYLOCK_RECURSIVE,
+};
+
+/**
+ * mutex_trylock_recursive - trylock variant that allows recursive locking
+ * @lock: mutex to be locked
+ *
+ * This function should not be used, _ever_. It is purely for hysterical GEM
+ * raisins, and once those are gone this will be removed.
+ *
+ * Returns:
+ *  MUTEX_TRYLOCK_FAILED    - trylock failed,
+ *  MUTEX_TRYLOCK_SUCCESS   - lock acquired,
+ *  MUTEX_TRYLOCK_RECURSIVE - we already owned the lock.
+ */
+static inline __deprecated __must_check enum mutex_trylock_recursive_enum
+mutex_trylock_recursive(struct mutex *lock)
+{
+	if (unlikely(__mutex_owner(lock) == current))
+		return MUTEX_TRYLOCK_RECURSIVE;
+
+	return mutex_trylock(lock);
+}
+
 #endif /* __LINUX_MUTEX_H */
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index a8368d1c4348..23f462f64a3f 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -6076,6 +6076,12 @@ sub process {
 			}
 		}
 
+# check for mutex_trylock_recursive usage
+		if ($line =~ /mutex_trylock_recursive/) {
+			ERROR("LOCKING",
+			      "recursive locking is bad, do not use this ever.\n" . $herecurr);
+		}
+
 # check for lockdep_set_novalidate_class
 		if ($line =~ /^.\s*lockdep_set_novalidate_class\s*\(/ ||
 		    $line =~ /__lockdep_no_validate__\s*\)/ ) {

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-10-18 12:57     ` Peter Zijlstra
@ 2016-11-11 11:22       ` Daniel Vetter
  2016-11-11 11:38         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Daniel Vetter @ 2016-11-11 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Rob Clark

On Tue, Oct 18, 2016 at 2:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 18, 2016 at 02:48:41PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 07, 2016 at 04:52:44PM +0200, Peter Zijlstra wrote:
>> > Poking at lock internals is not cool. Since I'm going to change the
>> > implementation this will break, take it out.
>> >
>> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Rob Clark <robdclark@gmail.com>
>> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> > ---
>> >  drivers/gpu/drm/i915/i915_gem_shrinker.c |   26 +++-----------------------
>> >  drivers/gpu/drm/msm/msm_gem_shrinker.c   |   23 +++--------------------
>> >  2 files changed, 6 insertions(+), 43 deletions(-)
>>
>> OK, so it appears that i915 changed their locking around and got rid of
>> this thing entirely. Much appreciated Chris!!
>
> Hmm, I might have spoken too soon. My patch conflicted and I seem to
> have read too much in the Changelog of 3b4e896f14b1 ("drm/i915: Remove
> unused no-shrinker-steal").

Once all your locking rework is assembled it might be good to have a
topic branch I could pull in. Both for testing and to handle conflicts
before it goes boom in the merge window ;-) Not necessary ofc, but I
think it'd be useful.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-11-11 11:22       ` Daniel Vetter
@ 2016-11-11 11:38         ` Peter Zijlstra
  2016-11-12 10:58           ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-11-11 11:38 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Chris Wilson, Rob Clark

On Fri, Nov 11, 2016 at 12:22:02PM +0100, Daniel Vetter wrote:

> Once all your locking rework is assembled it might be good to have a
> topic branch I could pull in. Both for testing and to handle conflicts
> before it goes boom in the merge window ;-) Not necessary ofc, but I
> think it'd be useful.

Everything except the trylock_recursive is already in tip/locking/core.

Ingo, is that all there is in that branch?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-11-11 11:38         ` Peter Zijlstra
@ 2016-11-12 10:58           ` Ingo Molnar
  2016-11-14 14:04             ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2016-11-12 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Vetter, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Will Deacon, Ingo Molnar,
	Imre Deak, Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen,
	Terry Rudd, Paul E. McKenney, Jason Low, Chris Wilson, Rob Clark


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 11, 2016 at 12:22:02PM +0100, Daniel Vetter wrote:
> 
> > Once all your locking rework is assembled it might be good to have a
> > topic branch I could pull in. Both for testing and to handle conflicts
> > before it goes boom in the merge window ;-) Not necessary ofc, but I
> > think it'd be useful.
> 
> Everything except the trylock_recursive is already in tip/locking/core.
> 
> Ingo, is that all there is in that branch?

It has other bits as well - but I can create a separate topic branch for this, 
which would be c7faee2109f9 plus trylock_recursive().

Could you send a final version of trylock_recursive() patch for me to apply?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-11-12 10:58           ` Ingo Molnar
@ 2016-11-14 14:04             ` Peter Zijlstra
  2016-11-14 14:27               ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2016-11-14 14:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Vetter, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Will Deacon, Ingo Molnar,
	Imre Deak, Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen,
	Terry Rudd, Paul E. McKenney, Jason Low, Chris Wilson, Rob Clark

On Sat, Nov 12, 2016 at 11:58:49AM +0100, Ingo Molnar wrote:

> Could you send a final version of trylock_recursive() patch for me to apply?

The latest lives here:

 lkml.kernel.org/r/20161109103813.GN3157@twins.programming.kicks-ass.net

But I would really like someone to actually test that before you stick
it in.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 1/8] locking/drm: Kill mutex trickery
  2016-11-14 14:04             ` Peter Zijlstra
@ 2016-11-14 14:27               ` Ingo Molnar
  0 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2016-11-14 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Vetter, Linus Torvalds, Waiman Long, Jason Low,
	Ding Tianhong, Thomas Gleixner, Will Deacon, Ingo Molnar,
	Imre Deak, Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen,
	Terry Rudd, Paul E. McKenney, Jason Low, Chris Wilson, Rob Clark


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Nov 12, 2016 at 11:58:49AM +0100, Ingo Molnar wrote:
> 
> > Could you send a final version of trylock_recursive() patch for me to apply?
> 
> The latest lives here:
> 
>  lkml.kernel.org/r/20161109103813.GN3157@twins.programming.kicks-ass.net
> 
> But I would really like someone to actually test that before you stick
> it in.

Agreed ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
  2016-10-13 15:14   ` Will Deacon
  2016-10-17 18:45   ` Waiman Long
@ 2016-12-27 13:55   ` Chris Wilson
  2017-01-09 11:52   ` [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt Chris Wilson
  3 siblings, 0 replies; 59+ messages in thread
From: Chris Wilson @ 2016-12-27 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Jason Low, Ding Tianhong,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Imre Deak,
	Linux Kernel Mailing List, Davidlohr Bueso, Tim Chen, Terry Rudd,
	Paul E. McKenney, Jason Low, Daniel Vetter

On Fri, Oct 07, 2016 at 04:52:48PM +0200, Peter Zijlstra wrote:
> Implement lock handoff to avoid lock starvation.
> 
> Lock starvation is possible because mutex_lock() allows lock stealing,
> where a running (or optimistic spinning) task beats the woken waiter
> to the acquire.
> 
> Lock stealing is an important performance optimization because waiting
> for a waiter to wake up and get runtime can take a significant time,
> during which everyboy would stall on the lock.
> 
> The down-side is of course that it allows for starvation.
> 
> This patch has the waiter requesting a handoff if it fails to acquire
> the lock upon waking. This re-introduces some of the wait time,
> because once we do a handoff we have to wait for the waiter to wake up
> again.
> 
> A future patch will add a round of optimistic spinning to attempt to
> alleviate this penalty, but if that turns out to not be enough, we can
> add a counter and only request handoff after multiple failed wakeups.
> 
> There are a few tricky implementation details:
> 
>  - accepting a handoff must only be done in the wait-loop. Since the
>    handoff condition is owner == current, it can easily cause
>    recursive locking trouble.
> 
>  - accepting the handoff must be careful to provide the ACQUIRE
>    semantics.
> 
>  - having the HANDOFF bit set on unlock requires care, we must not
>    clear the owner.
> 
>  - we must be careful to not leave HANDOFF set after we've acquired
>    the lock. The tricky scenario is setting the HANDOFF bit on an
>    unlocked mutex.

There's a hole along the interruptible path - we leave the HANDOFF bit
set, even though the first waiter returns with -EINTR. The unlock then
sees the HANDOFF, assigns it to the next waiter, but that waiter does a
racy check to decide if it is first, decides it is not and so skips the
trylock and also returns with -EINTR. (i.e. invalidating the

                /*
                 * Here we order against unlock; we must either see it change
                 * state back to RUNNING and fall through the next schedule(),
                 * or we must see its unlock and acquire.
                 */

as we may not reach the next schedule). Repeating the
__mutex_waiter_is_first() after acquiring the wait_lock is sufficient,
as is clearing the HANDOFF bit before -EINTR.

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 9b349619f431..6f7e3bf0d595 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -684,6 +684,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
                 * against mutex_unlock() and wake-ups do not go missing.
                 */
                if (unlikely(signal_pending_state(state, task))) {
+                       if (first)
+                               __mutex_clear_flag(lock, MUTEX_FLAG_HANDOFF);
                        ret = -EINTR;
                        goto err;
                }

Though I expect you will be able to find a better solution.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt
  2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
                     ` (2 preceding siblings ...)
  2016-12-27 13:55   ` Chris Wilson
@ 2017-01-09 11:52   ` Chris Wilson
  2017-01-11 16:43     ` Peter Zijlstra
  3 siblings, 1 reply; 59+ messages in thread
From: Chris Wilson @ 2017-01-09 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chris Wilson, Peter Zijlstra, Waiman Long, Andrew Morton,
	Linus Torvalds, Paul E . McKenney, Thomas Gleixner

If we abort the mutex_lock() due to an interrupt, or other error from
ww_mutex, we need to relinquish the handoff flag if we applied it.
Otherwise, we may cause missed wakeups as the current owner may try to
handoff to a new thread that is not expecting the handoff and so sleep
thinking the lock is already claimed (and since the owner unlocked there
may never be a new wakeup).

Fixes: 9d659ae14b54 ("locking/mutex: Add lock handoff to avoid starvation")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
---
 kernel/locking/mutex.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 9b349619f431..087bbc082ad7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -738,6 +738,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 err:
 	__set_task_state(task, TASK_RUNNING);
 	mutex_remove_waiter(lock, &waiter, task);
+	if (first)
+		__mutex_clear_flag(lock, MUTEX_FLAG_HANDOFF);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt
  2017-01-09 11:52   ` [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt Chris Wilson
@ 2017-01-11 16:43     ` Peter Zijlstra
  2017-01-11 16:57       ` Chris Wilson
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2017-01-11 16:43 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-kernel, Waiman Long, Andrew Morton, Linus Torvalds,
	Paul E . McKenney, Thomas Gleixner

On Mon, Jan 09, 2017 at 11:52:03AM +0000, Chris Wilson wrote:
> If we abort the mutex_lock() due to an interrupt, or other error from

s/interrupt/signal/, right?

> ww_mutex, we need to relinquish the handoff flag if we applied it.
> Otherwise, we may cause missed wakeups as the current owner may try to
> handoff to a new thread that is not expecting the handoff and so sleep
> thinking the lock is already claimed (and since the owner unlocked there
> may never be a new wakeup).

Isn't that the exact same scenario as Nicolai fixed here:

  http://lkml.kernel.org/r/1482346000-9927-3-git-send-email-nhaehnle@gmail.com

Did you, like Nicolai, find this by inspection, or can you reproduce?


FWIW, I have the below patch that should also solve this problem afaict.


---
Subject: locking/mutex: Fix mutex handoff
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jan 11 14:17:48 CET 2017

While reviewing the ww_mutex patches, I noticed that it was still
possible to (incorrectly) succeed for (incorrect) code like:

	mutex_lock(&a);
	mutex_lock(&a);

This was possible if the second mutex_lock() would block (as expected)
but then receive a spurious wakeup. At that point it would find itself
at the front of the queue, request a handoff and instantly claim
ownership and continue, since owner would point to itself.

Avoid this scenario and simplify the code by introducing a third low
bit to signal handoff pickup. So once we request handoff, unlock
clears the handoff bit and sets the pickup bit along with the new
owner.

This also removes the need for the .handoff argument to
__mutex_trylock(), since that becomes superfluous with PICKUP.

In order to guarantee enough low bits, ensure task_struct alignment is
at least L1_CACHE_BYTES (which seems a good ideal regardless).

Fixes: 9d659ae14b54 ("locking/mutex: Add lock handoff to avoid starvation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mutex.h  |    2 
 kernel/fork.c          |    6 +-
 kernel/locking/mutex.c |  106 +++++++++++++++++++++++--------------------------
 3 files changed, 56 insertions(+), 58 deletions(-)

--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -65,7 +65,7 @@ struct mutex {
 
 static inline struct task_struct *__mutex_owner(struct mutex *lock)
 {
-	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x03);
+	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x07);
 }
 
 /*
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -432,11 +432,13 @@ void __init fork_init(void)
 	int i;
 #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
 #ifndef ARCH_MIN_TASKALIGN
-#define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
+#define ARCH_MIN_TASKALIGN	0
 #endif
+	int align = min_t(int, L1_CACHE_BYTES, ARCH_MIN_TASKALIGN);
+
 	/* create a slab on which task_structs can be allocated */
 	task_struct_cachep = kmem_cache_create("task_struct",
-			arch_task_struct_size, ARCH_MIN_TASKALIGN,
+			arch_task_struct_size, align,
 			SLAB_PANIC|SLAB_NOTRACK|SLAB_ACCOUNT, NULL);
 #endif
 
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -50,16 +50,17 @@ EXPORT_SYMBOL(__mutex_init);
 /*
  * @owner: contains: 'struct task_struct *' to the current lock owner,
  * NULL means not owned. Since task_struct pointers are aligned at
- * ARCH_MIN_TASKALIGN (which is at least sizeof(void *)), we have low
- * bits to store extra state.
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
  *
  * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
  * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
  */
 #define MUTEX_FLAG_WAITERS	0x01
 #define MUTEX_FLAG_HANDOFF	0x02
+#define MUTEX_FLAG_PICKUP	0x04
 
-#define MUTEX_FLAGS		0x03
+#define MUTEX_FLAGS		0x07
 
 static inline struct task_struct *__owner_task(unsigned long owner)
 {
@@ -72,38 +73,29 @@ static inline unsigned long __owner_flag
 }
 
 /*
- * Actual trylock that will work on any unlocked state.
- *
- * When setting the owner field, we must preserve the low flag bits.
- *
- * Be careful with @handoff, only set that in a wait-loop (where you set
- * HANDOFF) to avoid recursive lock attempts.
+ * Trylock variant that retuns the owning task on failure.
  */
-static inline bool __mutex_trylock(struct mutex *lock, const bool handoff)
+static inline struct task_struct *__mutex_trylock_or_owner(struct mutex *lock)
 {
 	unsigned long owner, curr = (unsigned long)current;
 
 	owner = atomic_long_read(&lock->owner);
 	for (;;) { /* must loop, can race against a flag */
 		unsigned long old, flags = __owner_flags(owner);
+		unsigned long task = owner & ~MUTEX_FLAGS;
 
-		if (__owner_task(owner)) {
-			if (handoff && unlikely(__owner_task(owner) == current)) {
-				/*
-				 * Provide ACQUIRE semantics for the lock-handoff.
-				 *
-				 * We cannot easily use load-acquire here, since
-				 * the actual load is a failed cmpxchg, which
-				 * doesn't imply any barriers.
-				 *
-				 * Also, this is a fairly unlikely scenario, and
-				 * this contains the cost.
-				 */
-				smp_mb(); /* ACQUIRE */
-				return true;
-			}
+		if (task) {
+			if (likely(task != curr))
+				break;
+
+			if (likely(!(flags & MUTEX_FLAG_PICKUP)))
+				break;
 
-			return false;
+			flags &= ~MUTEX_FLAG_PICKUP;
+		} else {
+#ifdef CONFIG_DEBUG_MUTEXES
+			DEBUG_LOCKS_WARN_ON(flags & MUTEX_FLAG_PICKUP);
+#endif
 		}
 
 		/*
@@ -111,15 +103,24 @@ static inline bool __mutex_trylock(struc
 		 * past the point where we acquire it. This would be possible
 		 * if we (accidentally) set the bit on an unlocked mutex.
 		 */
-		if (handoff)
-			flags &= ~MUTEX_FLAG_HANDOFF;
+		flags &= ~MUTEX_FLAG_HANDOFF;
 
 		old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr | flags);
 		if (old == owner)
-			return true;
+			return NULL;
 
 		owner = old;
 	}
+
+	return __owner_task(owner);
+}
+
+/*
+ * Actual trylock that will work on any unlocked state.
+ */
+static inline bool __mutex_trylock(struct mutex *lock)
+{
+	return !__mutex_trylock_or_owner(lock);
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
@@ -171,9 +172,9 @@ static inline bool __mutex_waiter_is_fir
 
 /*
  * Give up ownership to a specific task, when @task = NULL, this is equivalent
- * to a regular unlock. Clears HANDOFF, preserves WAITERS. Provides RELEASE
- * semantics like a regular unlock, the __mutex_trylock() provides matching
- * ACQUIRE semantics for the handoff.
+ * to a regular unlock. Sets PICKUP on a handoff, clears HANDOF, preserves
+ * WAITERS. Provides RELEASE semantics like a regular unlock, the
+ * __mutex_trylock() provides a matching ACQUIRE semantics for the handoff.
  */
 static void __mutex_handoff(struct mutex *lock, struct task_struct *task)
 {
@@ -184,10 +185,13 @@ static void __mutex_handoff(struct mutex
 
 #ifdef CONFIG_DEBUG_MUTEXES
 		DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
+		DEBUG_LOCKS_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 #endif
 
 		new = (owner & MUTEX_FLAG_WAITERS);
 		new |= (unsigned long)task;
+		if (task)
+			new |= MUTEX_FLAG_PICKUP;
 
 		old = atomic_long_cmpxchg_release(&lock->owner, owner, new);
 		if (old == owner)
@@ -435,8 +439,6 @@ static bool mutex_optimistic_spin(struct
 				  struct ww_acquire_ctx *ww_ctx,
 				  const bool use_ww_ctx, const bool waiter)
 {
-	struct task_struct *task = current;
-
 	if (!waiter) {
 		/*
 		 * The purpose of the mutex_can_spin_on_owner() function is
@@ -476,24 +478,17 @@ static bool mutex_optimistic_spin(struct
 				goto fail_unlock;
 		}
 
+		/* Try to acquire the mutex... */
+		owner = __mutex_trylock_or_owner(lock);
+		if (!owner)
+			break;
+
 		/*
-		 * If there's an owner, wait for it to either
+		 * There's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		owner = __mutex_owner(lock);
-		if (owner) {
-			if (waiter && owner == task) {
-				smp_mb(); /* ACQUIRE */
-				break;
-			}
-
-			if (!mutex_spin_on_owner(lock, owner))
-				goto fail_unlock;
-		}
-
-		/* Try to acquire the mutex if it is unlocked. */
-		if (__mutex_trylock(lock, waiter))
-			break;
+		if (!mutex_spin_on_owner(lock, owner))
+			goto fail_unlock;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
@@ -637,7 +632,7 @@ __mutex_lock_common(struct mutex *lock,
 	preempt_disable();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
-	if (__mutex_trylock(lock, false) ||
+	if (__mutex_trylock(lock) ||
 	    mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, false)) {
 		/* got the lock, yay! */
 		lock_acquired(&lock->dep_map, ip);
@@ -651,7 +646,7 @@ __mutex_lock_common(struct mutex *lock,
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
-	if (__mutex_trylock(lock, false))
+	if (__mutex_trylock(lock))
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -674,7 +669,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * before testing the error conditions to make sure we pick up
 		 * the handoff.
 		 */
-		if (__mutex_trylock(lock, first))
+		if (__mutex_trylock(lock))
 			goto acquired;
 
 		/*
@@ -707,8 +702,8 @@ __mutex_lock_common(struct mutex *lock,
 		 * state back to RUNNING and fall through the next schedule(),
 		 * or we must see its unlock and acquire.
 		 */
-		if ((first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, true)) ||
-		     __mutex_trylock(lock, first))
+		if (__mutex_trylock(lock) ||
+		    (first && mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, true)))
 			break;
 
 		spin_lock_mutex(&lock->wait_lock, flags);
@@ -865,6 +860,7 @@ static noinline void __sched __mutex_unl
 
 #ifdef CONFIG_DEBUG_MUTEXES
 		DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
+		DEBUG_LOCKS_WARN_ON(owner & MUTEX_FLAG_PICKUP);
 #endif
 
 		if (owner & MUTEX_FLAG_HANDOFF)
@@ -1003,7 +999,7 @@ __ww_mutex_lock_interruptible_slowpath(s
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	bool locked = __mutex_trylock(lock, false);
+	bool locked = __mutex_trylock(lock);
 
 	if (locked)
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt
  2017-01-11 16:43     ` Peter Zijlstra
@ 2017-01-11 16:57       ` Chris Wilson
  2017-01-12 20:58         ` Chris Wilson
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Wilson @ 2017-01-11 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, Andrew Morton, Linus Torvalds,
	Paul E . McKenney, Thomas Gleixner

On Wed, Jan 11, 2017 at 05:43:02PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 09, 2017 at 11:52:03AM +0000, Chris Wilson wrote:
> > If we abort the mutex_lock() due to an interrupt, or other error from
> 
> s/interrupt/signal/, right?

Yes. EINTR is ingrained.
 
> > ww_mutex, we need to relinquish the handoff flag if we applied it.
> > Otherwise, we may cause missed wakeups as the current owner may try to
> > handoff to a new thread that is not expecting the handoff and so sleep
> > thinking the lock is already claimed (and since the owner unlocked there
> > may never be a new wakeup).
> 
> Isn't that the exact same scenario as Nicolai fixed here:
> 
>   http://lkml.kernel.org/r/1482346000-9927-3-git-send-email-nhaehnle@gmail.com
> 
> Did you, like Nicolai, find this by inspection, or can you reproduce?

Looks like it should be. It takes about 20 minutes of running a stress
test, but it is very reliably hit on the current kernel.

> FWIW, I have the below patch that should also solve this problem afaict.

Thanks, I shall see if makes my machines happy.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt
  2017-01-11 16:57       ` Chris Wilson
@ 2017-01-12 20:58         ` Chris Wilson
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Wilson @ 2017-01-12 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, Andrew Morton, Linus Torvalds,
	Paul E . McKenney, Thomas Gleixner

On Wed, Jan 11, 2017 at 04:57:32PM +0000, Chris Wilson wrote:
> On Wed, Jan 11, 2017 at 05:43:02PM +0100, Peter Zijlstra wrote:
> > FWIW, I have the below patch that should also solve this problem afaict.
> 
> Thanks, I shall see if makes my machines happy.

I have had a few machines run for over 24 hours on this patch.
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2017-01-12 20:58 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-07 14:52 [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Peter Zijlstra
2016-10-07 14:52 ` [PATCH -v4 1/8] locking/drm: Kill mutex trickery Peter Zijlstra
2016-10-07 15:43   ` Peter Zijlstra
2016-10-07 15:58     ` Linus Torvalds
2016-10-07 16:13       ` Peter Zijlstra
2016-10-07 21:58         ` Waiman Long
2016-10-08 11:58     ` Thomas Gleixner
2016-10-08 14:01       ` Peter Zijlstra
2016-10-08 14:11         ` Thomas Gleixner
2016-10-08 16:42           ` Peter Zijlstra
2016-11-09 10:38     ` Peter Zijlstra
2016-10-18 12:48   ` Peter Zijlstra
2016-10-18 12:57     ` Peter Zijlstra
2016-11-11 11:22       ` Daniel Vetter
2016-11-11 11:38         ` Peter Zijlstra
2016-11-12 10:58           ` Ingo Molnar
2016-11-14 14:04             ` Peter Zijlstra
2016-11-14 14:27               ` Ingo Molnar
2016-10-18 12:57     ` Chris Wilson
2016-10-07 14:52 ` [PATCH -v4 2/8] locking/mutex: Rework mutex::owner Peter Zijlstra
2016-10-12 17:59   ` Davidlohr Bueso
2016-10-12 19:52     ` Jason Low
2016-10-13 15:18   ` Will Deacon
2016-10-07 14:52 ` [PATCH -v4 3/8] locking/mutex: Kill arch specific code Peter Zijlstra
2016-10-07 14:52 ` [PATCH -v4 4/8] locking/mutex: Allow MUTEX_SPIN_ON_OWNER when DEBUG_MUTEXES Peter Zijlstra
2016-10-07 14:52 ` [PATCH -v4 5/8] locking/mutex: Add lock handoff to avoid starvation Peter Zijlstra
2016-10-13 15:14   ` Will Deacon
2016-10-17  9:22     ` Peter Zijlstra
2016-10-17 18:45   ` Waiman Long
2016-10-17 19:07     ` Waiman Long
2016-10-18 13:02       ` Peter Zijlstra
2016-10-18 12:36     ` Peter Zijlstra
2016-12-27 13:55   ` Chris Wilson
2017-01-09 11:52   ` [PATCH] locking/mutex: Clear mutex-handoff flag on interrupt Chris Wilson
2017-01-11 16:43     ` Peter Zijlstra
2017-01-11 16:57       ` Chris Wilson
2017-01-12 20:58         ` Chris Wilson
2016-10-07 14:52 ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Peter Zijlstra
2016-10-13 15:17   ` Will Deacon
2016-10-17 10:44     ` Peter Zijlstra
2016-10-17 13:24       ` Peter Zijlstra
2016-10-17 13:45         ` Boqun Feng
2016-10-17 15:49           ` Peter Zijlstra
2016-10-19 17:34     ` Peter Zijlstra
2016-10-24  1:57       ` ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop) Davidlohr Bueso
2016-10-24 13:26         ` Kent Overstreet
2016-10-24 14:27         ` Kent Overstreet
2016-10-25 16:55           ` Eric Wheeler
2016-10-25 17:45             ` Kent Overstreet
2016-10-17 23:16   ` [PATCH -v4 6/8] locking/mutex: Restructure wait loop Waiman Long
2016-10-18 13:14     ` Peter Zijlstra
2016-10-07 14:52 ` [PATCH -v4 7/8] locking/mutex: Simplify some ww_mutex code in __mutex_lock_common() Peter Zijlstra
2016-10-07 14:52 ` [PATCH -v4 8/8] locking/mutex: Enable optimistic spinning of woken waiter Peter Zijlstra
2016-10-13 15:28   ` Will Deacon
2016-10-17  9:32     ` Peter Zijlstra
2016-10-17 23:21   ` Waiman Long
2016-10-18 12:19     ` Peter Zijlstra
2016-10-07 15:20 ` [PATCH -v4 0/8] locking/mutex: Rewrite basic mutex Linus Torvalds
2016-10-11 18:42 ` Jason Low

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).