linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] locking/core patches
@ 2014-02-10 19:58 Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 1/8] locking: Move mcs_spinlock.h into kernel/locking/ Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

Hi all,

I would propose merging the following patches...

The first set is mostly from Jason and tweaks the mutex adaptive
spinning, AIM7 throughput numbers:

PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99

The second set is the qrwlock, although mostly rewritten by me. I didn't do
much with it other than boot and build a kernel. But I like them because
of the much better worst case preformance.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/8] locking: Move mcs_spinlock.h into kernel/locking/
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched() Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: move-mcs_lock.patch --]
[-- Type: text/plain, Size: 8103 bytes --]

The mcs_spinlock code is not meant (or suitable) as a generic locking
primitive, therefore take it away from the normal includes and place
it in kernel/locking/.

This way the locking primitives implemented there can use it as part
of their implementation but we do not risk it getting used
inapropriately.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/mcs_spinlock.h  |  114 ------------------------------------------
 kernel/locking/mcs_spinlock.h |  114 ++++++++++++++++++++++++++++++++++++++++++
 kernel/locking/mutex.c        |    2 
 3 files changed, 115 insertions(+), 115 deletions(-)

--- a/include/linux/mcs_spinlock.h
+++ /dev/null
@@ -1,114 +0,0 @@
-/*
- * MCS lock defines
- *
- * This file contains the main data structure and API definitions of MCS lock.
- *
- * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
- * with the desirable properties of being fair, and with each cpu trying
- * to acquire the lock spinning on a local variable.
- * It avoids expensive cache bouncings that common test-and-set spin-lock
- * implementations incur.
- */
-#ifndef __LINUX_MCS_SPINLOCK_H
-#define __LINUX_MCS_SPINLOCK_H
-
-#include <asm/mcs_spinlock.h>
-
-struct mcs_spinlock {
-	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
-};
-
-#ifndef arch_mcs_spin_lock_contended
-/*
- * Using smp_load_acquire() provides a memory barrier that ensures
- * subsequent operations happen after the lock is acquired.
- */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	while (!(smp_load_acquire(l)))					\
-		arch_mutex_cpu_relax();					\
-} while (0)
-#endif
-
-#ifndef arch_mcs_spin_unlock_contended
-/*
- * smp_store_release() provides a memory barrier to ensure all
- * operations in the critical section has been completed before
- * unlocking.
- */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
-#endif
-
-/*
- * Note: the smp_load_acquire/smp_store_release pair is not
- * sufficient to form a full memory barrier across
- * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
- * For applications that need a full barrier across multiple cpus
- * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
- * used after mcs_lock.
- */
-
-/*
- * In order to acquire the lock, the caller should declare a local node and
- * pass a reference of the node to this function in addition to the lock.
- * If the lock has already been acquired, then this will proceed to spin
- * on this node->locked until the previous lock holder sets the node->locked
- * in mcs_spin_unlock().
- *
- * We don't inline mcs_spin_lock() so that perf can correctly account for the
- * time spent in this lock function.
- */
-static inline
-void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/*
-		 * Lock acquired, don't need to set node->locked to 1. Threads
-		 * only spin on its own node->locked value for lock acquisition.
-		 * However, since this thread can immediately acquire the lock
-		 * and does not proceed to spin on its own node->locked, this
-		 * value won't be used. If a debug mode is needed to
-		 * audit lock status, then set node->locked value here.
-		 */
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-
-	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
-}
-
-/*
- * Releases the lock. The caller should pass in the corresponding node that
- * was used to acquire the lock.
- */
-static inline
-void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (likely(cmpxchg(lock, node, NULL) == node))
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-
-	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
-}
-
-#endif /* __LINUX_MCS_SPINLOCK_H */
--- /dev/null
+++ b/kernel/locking/mcs_spinlock.h
@@ -0,0 +1,114 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+#include <asm/mcs_spinlock.h>
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+};
+
+#ifndef arch_mcs_spin_lock_contended
+/*
+ * Using smp_load_acquire() provides a memory barrier that ensures
+ * subsequent operations happen after the lock is acquired.
+ */
+#define arch_mcs_spin_lock_contended(l)					\
+do {									\
+	while (!(smp_load_acquire(l)))					\
+		arch_mutex_cpu_relax();					\
+} while (0)
+#endif
+
+#ifndef arch_mcs_spin_unlock_contended
+/*
+ * smp_store_release() provides a memory barrier to ensure all
+ * operations in the critical section has been completed before
+ * unlocking.
+ */
+#define arch_mcs_spin_unlock_contended(l)				\
+	smp_store_release((l), 1)
+#endif
+
+/*
+ * Note: the smp_load_acquire/smp_store_release pair is not
+ * sufficient to form a full memory barrier across
+ * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
+ * For applications that need a full barrier across multiple cpus
+ * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
+ * used after mcs_lock.
+ */
+
+/*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ *
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static inline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/*
+		 * Lock acquired, don't need to set node->locked to 1. Threads
+		 * only spin on its own node->locked value for lock acquisition.
+		 * However, since this thread can immediately acquire the lock
+		 * and does not proceed to spin on its own node->locked, this
+		 * value won't be used. If a debug mode is needed to
+		 * audit lock status, then set node->locked value here.
+		 */
+		return;
+	}
+	ACCESS_ONCE(prev->next) = node;
+
+	/* Wait until the lock holder passes the lock down. */
+	arch_mcs_spin_lock_contended(&node->locked);
+}
+
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
+static inline
+void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (likely(cmpxchg(lock, node, NULL) == node))
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+	}
+
+	/* Pass lock to next waiter. */
+	arch_mcs_spin_unlock_contended(&next->locked);
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -25,7 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
-#include <linux/mcs_spinlock.h>
+#include "mcs_spinlock.h"
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 1/8] locking: Move mcs_spinlock.h into kernel/locking/ Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 21:02   ` Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 3/8] mutex: Modify the way optimistic spinners are queued Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: jason_low-mutex-in_mutex_can_spin_on_owner_return_false_if_task_need_resched.patch --]
[-- Type: text/plain, Size: 999 bytes --]

The mutex_can_spin_on_owner() function should also return false if the
task needs to be rescheduled to avoid entering the MCS queue when it
needs to reschedule.

Cc: chegu_vinod@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: mingo@kernel.org
Cc: scott.norton@hp.com
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-2-git-send-email-jason.low2@hp.com
---
 kernel/locking/mutex.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owne
 	struct task_struct *owner;
 	int retval = 1;
 
+	if (need_resched())
+		return 0;
+
 	rcu_read_lock();
 	owner = ACCESS_ONCE(lock->owner);
 	if (owner)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/8] mutex: Modify the way optimistic spinners are queued
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 1/8] locking: Move mcs_spinlock.h into kernel/locking/ Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched() Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-11  1:33   ` Jason Low
  2014-02-10 19:58 ` [PATCH 4/8] mutex: Unlock the mutex without the wait_lock Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: jason_low-mutex-modify_the_way_optimistic_spinners_are_queued.patch --]
[-- Type: text/plain, Size: 3199 bytes --]

The mutex->spin_mlock was introduced in order to ensure that only 1 thread
spins for lock acquisition at a time to reduce cache line contention. When
lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
continually release and obtain the lock->spin_mlock. This can generate
quite a bit of overhead/contention, and also might just delay the spinner
from getting the lock.

This patch modifies the way optimistic spinners are queued by queuing before
entering the optimistic spinning loop as oppose to acquiring before every
call to mutex_spin_on_owner(). So in situations where the spinner requires
a few extra spins before obtaining the lock, then there will only be 1 spinner
trying to get the lock and it will avoid the overhead from unnecessarily
unlocking and locking the spin_mlock.

Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: mingo@kernel.org
Cc: scott.norton@hp.com
Cc: chegu_vinod@hp.com
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-3-git-send-email-jason.low2@hp.com
---
 kernel/locking/mutex.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -403,9 +403,9 @@ __mutex_lock_common(struct mutex *lock,
 	if (!mutex_can_spin_on_owner(lock))
 		goto slowpath;
 
+	mcs_spin_lock(&lock->mcs_lock);
 	for (;;) {
 		struct task_struct *owner;
-		struct mcs_spinlock  node;
 
 		if (use_ww_ctx && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -420,19 +420,16 @@ __mutex_lock_common(struct mutex *lock,
 			 * performed the optimistic spinning cannot be done.
 			 */
 			if (ACCESS_ONCE(ww->ctx))
-				goto slowpath;
+				break;
 		}
 
 		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mcs_spin_unlock(&lock->mcs_lock, &node);
-			goto slowpath;
-		}
+		if (owner && !mutex_spin_on_owner(lock, owner))
+			break;
 
 		if ((atomic_read(&lock->count) == 1) &&
 		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -445,11 +442,10 @@ __mutex_lock_common(struct mutex *lock,
 			}
 
 			mutex_set_owner(lock);
-			mcs_spin_unlock(&lock->mcs_lock, &node);
+			mcs_spin_unlock(&lock->mcs_lock);
 			preempt_enable();
 			return 0;
 		}
-		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
@@ -458,7 +454,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * the owner complete.
 		 */
 		if (!owner && (need_resched() || rt_task(task)))
-			goto slowpath;
+			break;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
@@ -468,6 +464,7 @@ __mutex_lock_common(struct mutex *lock,
 		 */
 		arch_mutex_cpu_relax();
 	}
+	mcs_spin_unlock(&lock->mcs_lock);
 slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/8] mutex: Unlock the mutex without the wait_lock
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 3/8] mutex: Modify the way optimistic spinners are queued Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 19:58 ` [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: jason_low-mutex-unlock_the_mutex_without_the_wait_lock.patch --]
[-- Type: text/plain, Size: 1980 bytes --]

When running workloads that have high contention in mutexes on an 8 socket
machine, mutex spinners would often spin for a long time with no lock owner.

The main reason why this is occuring is in __mutex_unlock_common_slowpath(),
if __mutex_slowpath_needs_to_unlock(), then the owner needs to acquire the
mutex->wait_lock before releasing the mutex (setting lock->count to 1). When
the wait_lock is contended, this delays the mutex from being released.
We should be able to release the mutex without holding the wait_lock.

Cc: chegu_vinod@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: mingo@kernel.org
Cc: scott.norton@hp.com
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-4-git-send-email-jason.low2@hp.com
---
 kernel/locking/mutex.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -673,10 +673,6 @@ __mutex_unlock_common_slowpath(atomic_t
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 	unsigned long flags;
 
-	spin_lock_mutex(&lock->wait_lock, flags);
-	mutex_release(&lock->dep_map, nested, _RET_IP_);
-	debug_mutex_unlock(lock);
-
 	/*
 	 * some architectures leave the lock unlocked in the fastpath failure
 	 * case, others need to leave it locked. In the later case we have to
@@ -685,6 +681,10 @@ __mutex_unlock_common_slowpath(atomic_t
 	if (__mutex_slowpath_needs_to_unlock())
 		atomic_set(&lock->count, 1);
 
+	spin_lock_mutex(&lock->wait_lock, flags);
+	mutex_release(&lock->dep_map, nested, _RET_IP_);
+	debug_mutex_unlock(lock);
+
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 4/8] mutex: Unlock the mutex without the wait_lock Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 21:15   ` Jason Low
  2014-02-25 19:56   ` Jason Low
  2014-02-10 19:58 ` [PATCH 6/8] mutex: Extra reschedule point Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: peter_zijlstra-_patch_v2_5_5__mutex-give_spinners_a_chance_to.patch --]
[-- Type: text/plain, Size: 7669 bytes --]

Since we want a task waiting for a mutex_lock() to go to sleep and
reschedule on need_resched() we must be able to abort the
mcs_spin_lock() around the adaptive spin.

Therefore implement a cancelable mcs lock.

Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: Jason Low <jason.low2@hp.com>
Cc: chegu_vinod@hp.com
Cc: mingo@kernel.org
Cc: paulmck@linux.vnet.ibm.com
Cc: Waiman.Long@hp.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/mutex.h         |    4 -
 kernel/locking/Makefile       |    2 
 kernel/locking/mcs_spinlock.c |  167 ++++++++++++++++++++++++++++++++++++++++++
 kernel/locking/mcs_spinlock.h |   15 +++
 kernel/locking/mutex.c        |   10 +-
 5 files changed, 191 insertions(+), 7 deletions(-)

--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,7 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
-struct mcs_spinlock;
+struct optimistic_spin_queue;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -56,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
+	struct optimistic_spin_queue	*osq;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -1,5 +1,5 @@
 
-obj-y += mutex.o semaphore.o rwsem.o lglock.o
+obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = -pg
--- /dev/null
+++ b/kernel/locking/mcs_spinlock.c
@@ -0,0 +1,167 @@
+
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "mcs_spinlock.h"
+
+#ifdef CONFIG_SMP
+
+/*
+ * An MCS like lock especially tailored for optimistic spinning for sleeping
+ * lock implementations (mutex, rwsem, etc).
+ *
+ * Using a single mcs node per CPU is safe because sleeping locks should not be
+ * called from interrupt context and we have preemption disabled while
+ * spinning.
+ */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_queue, osq_node);
+
+/*
+ * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
+ * Can return NULL in case we were the last queued and we updated @lock instead.
+ */
+static inline struct optimistic_spin_queue *
+osq_wait_next(struct optimistic_spin_queue **lock,
+	      struct optimistic_spin_queue *node,
+	      struct optimistic_spin_queue *prev)
+{
+	struct optimistic_spin_queue *next = NULL;
+
+	for (;;) {
+		if (*lock == node && cmpxchg(lock, node, prev) == node) {
+			/*
+			 * We were the last queued, we moved @lock back. @prev
+			 * will now observe @lock and will complete its
+			 * unlock()/unqueue().
+			 */
+			break;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent unlock()/unqueue() from
+		 * @node->next might complete Step-A and think its @prev is
+		 * still valid.
+		 *
+		 * If the concurrent unlock()/unqueue() wins the race, we'll
+		 * wait for either @lock to point to us, through its Step-B, or
+		 * wait for a new @node->next from its Step-C.
+		 */
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next)
+				break;
+		}
+
+		arch_mutex_cpu_relax();
+	}
+
+	return next;
+}
+
+bool osq_lock(struct optimistic_spin_queue **lock)
+{
+	struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_queue *prev, *next;
+
+	node->locked = 0;
+	node->next = NULL;
+
+	node->prev = prev = xchg(lock, node);
+	if (likely(prev == NULL))
+		return true;
+
+	ACCESS_ONCE(prev->next) = node;
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	while (!smp_load_acquire(&node->locked)) {
+		/*
+		 * If we need to reschedule bail... so we can block.
+		 */
+		if (need_resched())
+			goto unqueue;
+
+		arch_mutex_cpu_relax();
+	}
+	return true;
+
+unqueue:
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/unqueue() wait for a next pointer since @lock points to us
+	 * (or later).
+	 */
+
+	for (;;) {
+		if (prev->next == node &&
+		    cmpxchg(&prev->next, node, NULL) == node)
+			break;
+
+		/*
+		 * We can only fail the cmpxchg() racing against an unlock(),
+		 * in which case we should observe @node->locked becomming
+		 * true.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return true;
+
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		prev = ACCESS_ONCE(node->prev);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+
+	next = osq_wait_next(lock, node, prev);
+	if (!next)
+		return false;
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+
+	ACCESS_ONCE(next->prev) = prev;
+	ACCESS_ONCE(prev->next) = next;
+
+	return false;
+}
+
+void osq_unlock(struct optimistic_spin_queue **lock)
+{
+	struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_queue *next;
+
+	/*
+	 * Fast path for the uncontended case.
+	 */
+	if (likely(cmpxchg(lock, node, NULL) == node))
+		return;
+
+	next = osq_wait_next(lock, node, NULL);
+	if (next)
+		ACCESS_ONCE(next->locked) = 1;
+}
+
+#endif
+
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -111,4 +111,19 @@ void mcs_spin_unlock(struct mcs_spinlock
 	arch_mcs_spin_unlock_contended(&next->locked);
 }
 
+/*
+ * Cancellable version of the MCS lock above.
+ *
+ * Intended for adaptive spinning of sleeping locks:
+ * mutex_lock()/rwsem_down_{read,write}() etc.
+ */
+
+struct optimistic_spin_queue {
+	struct optimistic_spin_queue *next, *prev;
+	int locked; /* 1 if lock acquired */
+};
+
+extern bool osq_lock(struct optimistic_spin_queue **lock);
+extern void osq_unlock(struct optimistic_spin_queue **lock);
+
 #endif /* __LINUX_MCS_SPINLOCK_H */
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -53,7 +53,7 @@ __mutex_init(struct mutex *lock, const c
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->mcs_lock = NULL;
+	lock->osq = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -403,7 +403,9 @@ __mutex_lock_common(struct mutex *lock,
 	if (!mutex_can_spin_on_owner(lock))
 		goto slowpath;
 
-	mcs_spin_lock(&lock->mcs_lock);
+	if (!osq_lock(&lock->osq))
+		goto slowpath;
+
 	for (;;) {
 		struct task_struct *owner;
 
@@ -442,7 +444,7 @@ __mutex_lock_common(struct mutex *lock,
 			}
 
 			mutex_set_owner(lock);
-			mcs_spin_unlock(&lock->mcs_lock);
+			osq_unlock(&lock->osq);
 			preempt_enable();
 			return 0;
 		}
@@ -464,7 +466,7 @@ __mutex_lock_common(struct mutex *lock,
 		 */
 		arch_mutex_cpu_relax();
 	}
-	mcs_spin_unlock(&lock->mcs_lock);
+	osq_unlock(&lock->osq);
 slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/8] mutex: Extra reschedule point
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 22:59   ` Andrew Morton
  2014-02-10 19:58 ` [PATCH 7/8] locking: Introduce qrwlock Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

[-- Attachment #1: peterz-mutex-need_resched.patch --]
[-- Type: text/plain, Size: 716 bytes --]

Add in an extra reschedule in an attempt to avoid getting reschedule
the moment we've acquired the lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/locking/mutex.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -468,6 +468,13 @@ __mutex_lock_common(struct mutex *lock,
 	}
 	osq_unlock(&lock->osq);
 slowpath:
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock the mutex. This avoids getting
+	 * scheduled out right after we obtained the mutex.
+	 */
+	if (unlikely(need_resched()))
+		schedule_preempt_disabled();
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);
 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 7/8] locking: Introduce qrwlock
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 6/8] mutex: Extra reschedule point Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-11 18:17   ` Waiman Long
  2014-02-10 19:58 ` [PATCH 8/8] x86,locking: Enable qrwlock Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod, Waiman Long

[-- Attachment #1: qrwlock1.patch --]
[-- Type: text/plain, Size: 10798 bytes --]

This rwlock uses the arch_spin_lock_t as a waitqueue, and assuming the
arch_spin_lock_t is a fair lock (ticket,mcs etc..) the resulting
rwlock is a fair lock.

It fits in the same 8 bytes as the regular rwlock_t by folding the
reader and writer count into a single integer, using the remaining 4
bytes for the arch_spinlock_t.

Architectures that can single-copy adress bytes can optimize
queue_write_unlock() with a 0 write to the LSB (the write count).

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
[peterz: near complete rewrite]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qrwlock.h       |  174 ++++++++++++++++++++++++++++++++++++
 include/asm-generic/qrwlock_types.h |   16 +++
 kernel/Kconfig.locks                |    7 +
 kernel/locking/Makefile             |    1 
 kernel/locking/qrwlock.c            |  133 +++++++++++++++++++++++++++
 5 files changed, 331 insertions(+)

--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,174 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+#include <asm-generic/qrwlock_types.h>
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define	_QW_WAITING	1		/* A writer is waiting	   */
+#define	_QW_LOCKED	0xff		/* A writer holds the lock */
+#define	_QW_WMASK	0xff		/* Writer mask		   */
+#define	_QR_SHIFT	8		/* Reader count shift	   */
+#define _QR_BIAS	(1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+	return !(atomic_read(&lock->cnts) & _QW_WMASK);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+	return !atomic_read(&lock->cnts);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+	u32 cnts;
+
+	cnts = atomic_read(&lock->cnts);
+	if (likely(!(cnts & _QW_WMASK))) {
+		cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts);
+		if (likely(!(cnts & _QW_WMASK)))
+			return 1;
+		atomic_sub(_QR_BIAS, &lock->cnts);
+	}
+	return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+	u32 cnts;
+
+	cnts = atomic_read(&lock->cnts);
+	if (unlikely(cnts))
+		return 0;
+
+	return likely(atomic_cmpxchg(&lock->cnts,
+				     cnts, cnts | _QW_LOCKED) == cnts);
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+	u32 cnts;
+
+	cnts = atomic_add_return(_QR_BIAS, &lock->cnts);
+	if (likely(!(cnts & _QW_WMASK)))
+		return;
+
+	/* The slowpath will decrement the reader count, if necessary. */
+	queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+	/* Optimize for the unfair lock case where the fair flag is 0. */
+	if (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0)
+		return;
+
+	queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Atomically decrement the reader count
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QR_BIAS, &lock->cnts);
+}
+
+#ifndef queue_write_unlock
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+	/*
+	 * If the writer field is atomic, it can be cleared directly.
+	 * Otherwise, an atomic subtraction will be used to clear it.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QW_LOCKED, &lock->cnts);
+}
+#endif
+
+typedef struct qrwlock arch_rwlock_t;
+
+#define	__ARCH_RW_LOCK_UNLOCKED {		\
+	.cnts = ATOMIC_INIT(0),			\
+	.lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+}
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l)	queue_read_can_lock(l)
+#define arch_write_can_lock(l)	queue_write_can_lock(l)
+#define arch_read_lock(l)	queue_read_lock(l)
+#define arch_write_lock(l)	queue_write_lock(l)
+#define arch_read_trylock(l)	queue_read_trylock(l)
+#define arch_write_trylock(l)	queue_write_trylock(l)
+#define arch_read_unlock(l)	queue_read_unlock(l)
+#define arch_write_unlock(l)	queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
--- /dev/null
+++ b/include/asm-generic/qrwlock_types.h
@@ -0,0 +1,16 @@
+#ifndef __ASM_GENERIC_QRWLOCK_TYPES_H
+#define __ASM_GENERIC_QRWLOCK_TYPES_H
+
+#include <linux/atomic.h>
+#include <linux/spinlock_types.h>
+
+/*
+ * The queue read/write lock data structure
+ */
+
+struct qrwlock {
+	atomic_t		cnts;
+	arch_spinlock_t		lock;
+};
+
+#endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_RWLOCK
+	bool
+
+config QUEUE_RWLOCK
+	def_bool y if ARCH_USE_QUEUE_RWLOCK
+	depends on SMP
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,133 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm/qrwlock.h>
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock  : Pointer to queue rwlock structure
+ * @writer: Current queue rwlock writer status byte
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
+{
+	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
+		arch_mutex_cpu_relax();
+		cnts = smp_load_acquire((u32 *)&lock->cnts);
+	}
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+	u32 cnts;
+
+	/*
+	 * Readers come here when they cannot get the lock without waiting
+	 */
+	if (unlikely(in_interrupt())) {
+		/*
+		 * Readers in interrupt context will spin until the lock is
+		 * available without waiting in the queue.
+		 */
+		cnts = smp_load_acquire((u32 *)&lock->cnts);
+		rspin_until_writer_unlock(lock, cnts);
+		return;
+	}
+	atomic_sub(_QR_BIAS, &lock->cnts);
+
+	/*
+	 * Put the reader into the wait queue
+	 */
+	arch_spin_lock(&lock->lock);
+
+	/*
+	 * At the head of the wait queue now, wait until the writer state
+	 * goes to 0 and then try to increment the reader count and get
+	 * the lock. It is possible that an incoming writer may steal the
+	 * lock in the interim, so it is necessary to check the writer byte
+	 * to make sure that the write lock isn't taken.
+	 */
+	while (atomic_read(&lock->cnts) & _QW_WMASK)
+		arch_mutex_cpu_relax();
+
+	cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
+	rspin_until_writer_unlock(lock, cnts);
+
+	/*
+	 * Signal the next one in queue to become queue head
+	 */
+	arch_spin_unlock(&lock->lock);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+	u32 cnts;
+
+	/* Put the writer into the wait queue */
+	arch_spin_lock(&lock->lock);
+
+	/* Try to acquire the lock directly if no reader is present */
+	if (!atomic_read(&lock->cnts) &&
+	    (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0))
+		goto unlock;
+
+	/*
+	 * Set the waiting flag to notify readers that a writer is pending,
+	 * or wait for a previous writer to go away.
+	 */
+	for (;;) {
+		cnts = atomic_read(&lock->cnts);
+		if (!(cnts & _QW_WMASK) &&
+		    (atomic_cmpxchg(&lock->cnts, cnts,
+				    cnts | _QW_WAITING) == cnts))
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+
+	/* When no more readers, set the locked flag */
+	for (;;) {
+		cnts = atomic_read(&lock->cnts);
+		if ((cnts == _QW_WAITING) &&
+		    (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
+				    _QW_LOCKED) == _QW_WAITING))
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+unlock:
+	arch_spin_unlock(&lock->lock);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 8/8] x86,locking: Enable qrwlock
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 7/8] locking: Introduce qrwlock Peter Zijlstra
@ 2014-02-10 19:58 ` Peter Zijlstra
  2014-02-10 23:02 ` [PATCH 0/8] locking/core patches Andrew Morton
  2014-02-26 21:40 ` Paul E. McKenney
  9 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 19:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, Peter Zijlstra, mingo, paulmck, torvalds,
	tglx, riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod, Waiman Long

[-- Attachment #1: qrwlock2.patch --]
[-- Type: text/plain, Size: 2266 bytes --]

Make x86 use the fair rwlock_t.

Implement the custom queue_write_unlock() for best performance.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
[peterz: near complete rewrite]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qrwlock.h        |   18 ++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 4 files changed, 25 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
 	select ARCH_USE_BUILTIN_BSWAP
+	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
 	select OLD_SIGACTION if X86_32
 	select COMPAT_OLD_SIGACTION if IA32_EMULATION
--- /dev/null
+++ b/arch/x86/include/asm/qrwlock.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_X86_QRWLOCK_H
+#define _ASM_X86_QRWLOCK_H
+
+#include <asm-generic/qrwlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#define queue_write_unlock queue_write_unlock
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+        barrier();
+        ACCESS_ONCE(*(u8 *)&lock->cnts) = 0;
+}
+#endif
+
+#include <asm-generic/qrwlock.h>
+
+#endif /* _ASM_X86_QRWLOCK_H */
+
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arc
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()
  2014-02-10 19:58 ` [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched() Peter Zijlstra
@ 2014-02-10 21:02   ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 21:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jason Low, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, Feb 10, 2014 at 08:58:22PM +0100, Peter Zijlstra wrote:

Bah, I forgot Quilt eats the From: headers and I forgot to re-add them.
They're still present in the queue, just lost in mailing :/

> The mutex_can_spin_on_owner() function should also return false if the
> task needs to be rescheduled to avoid entering the MCS queue when it
> needs to reschedule.
> 
> Cc: chegu_vinod@hp.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: Waiman.Long@hp.com
> Cc: torvalds@linux-foundation.org
> Cc: tglx@linutronix.de
> Cc: riel@redhat.com
> Cc: akpm@linux-foundation.org
> Cc: davidlohr@hp.com
> Cc: hpa@zytor.com
> Cc: andi@firstfloor.org
> Cc: aswin@hp.com
> Cc: mingo@kernel.org
> Cc: scott.norton@hp.com
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1390936396-3962-2-git-send-email-jason.low2@hp.com
> ---
>  kernel/locking/mutex.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owne
>  	struct task_struct *owner;
>  	int retval = 1;
>  
> +	if (need_resched())
> +		return 0;
> +
>  	rcu_read_lock();
>  	owner = ACCESS_ONCE(lock->owner);
>  	if (owner)
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 19:58 ` [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning Peter Zijlstra
@ 2014-02-10 21:15   ` Jason Low
  2014-02-10 21:32     ` Peter Zijlstra
  2014-02-25 19:56   ` Jason Low
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Low @ 2014-02-10 21:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> +void osq_unlock(struct optimistic_spin_queue **lock)
> +{
> +	struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_queue *next;
> +
> +	/*
> +	 * Fast path for the uncontended case.
> +	 */
> +	if (likely(cmpxchg(lock, node, NULL) == node))
> +		return;

Can we can also add the following code here as I'm noticing next != NULL
is the much more likely scenario on my box:

        next = xchg(&node->next, NULL);
        if (next) {
                ACCESS_ONCE(next->locked) = 1;
                return;


> +	next = osq_wait_next(lock, node, NULL);
> +	if (next)
> +		ACCESS_ONCE(next->locked) = 1;
> +}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 21:15   ` Jason Low
@ 2014-02-10 21:32     ` Peter Zijlstra
  2014-02-10 22:04       ` Jason Low
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-10 21:32 UTC (permalink / raw)
  To: Jason Low
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, Feb 10, 2014 at 01:15:59PM -0800, Jason Low wrote:
> On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> > +void osq_unlock(struct optimistic_spin_queue **lock)
> > +{
> > +	struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node);
> > +	struct optimistic_spin_queue *next;
> > +
> > +	/*
> > +	 * Fast path for the uncontended case.
> > +	 */
> > +	if (likely(cmpxchg(lock, node, NULL) == node))
> > +		return;
> 
> Can we can also add the following code here as I'm noticing next != NULL
> is the much more likely scenario on my box:
> 
>         next = xchg(&node->next, NULL);
>         if (next) {
>                 ACCESS_ONCE(next->locked) = 1;
>                 return;

Is adding that really much faster than the relatively straight path
oqs_wait_next() would walk to bit the same exit?

The only reason I pulled out the above cmpxchg() is because its the
uncontended fast path, which seems like a special enough case.

> > +	next = osq_wait_next(lock, node, NULL);
> > +	if (next)
> > +		ACCESS_ONCE(next->locked) = 1;
> > +}
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 21:32     ` Peter Zijlstra
@ 2014-02-10 22:04       ` Jason Low
  2014-02-11  9:18         ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Low @ 2014-02-10 22:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, 2014-02-10 at 22:32 +0100, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 01:15:59PM -0800, Jason Low wrote:
> > On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> > > +void osq_unlock(struct optimistic_spin_queue **lock)
> > > +{
> > > +	struct optimistic_spin_queue *node = this_cpu_ptr(&osq_node);
> > > +	struct optimistic_spin_queue *next;
> > > +
> > > +	/*
> > > +	 * Fast path for the uncontended case.
> > > +	 */
> > > +	if (likely(cmpxchg(lock, node, NULL) == node))
> > > +		return;
> > 
> > Can we can also add the following code here as I'm noticing next != NULL
> > is the much more likely scenario on my box:
> > 
> >         next = xchg(&node->next, NULL);
> >         if (next) {
> >                 ACCESS_ONCE(next->locked) = 1;
> >                 return;
> 
> Is adding that really much faster than the relatively straight path
> oqs_wait_next() would walk to bit the same exit?
> 
> The only reason I pulled out the above cmpxchg() is because its the
> uncontended fast path, which seems like a special enough case.

So it would avoid 2 extra checks (*lock == node) and (node->next) in the
oqs_wait_next() path, which aren't necessary when node->next != NULL.

And I think node->next != NULL can be considered a special enough case
after the cmpxchg() fails because in the contended case, we're expecting
the node->next to be pointing at something. The only times node->next is
NULL after cmpxchg() fails are during a very small race window with the
osq_lock(), and when the next node is unqueuing due to need_resched,
which is also a very small window.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mutex: Extra reschedule point
  2014-02-10 19:58 ` [PATCH 6/8] mutex: Extra reschedule point Peter Zijlstra
@ 2014-02-10 22:59   ` Andrew Morton
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2014-02-10 22:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, Waiman Long, mingo, paulmck, torvalds,
	tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Mon, 10 Feb 2014 20:58:26 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> Add in an extra reschedule in an attempt to avoid getting reschedule
> the moment we've acquired the lock.
> 
> ...
>
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -468,6 +468,13 @@ __mutex_lock_common(struct mutex *lock,
>  	}
>  	osq_unlock(&lock->osq);
>  slowpath:
> +	/*
> +	 * If we fell out of the spin path because of need_resched(),
> +	 * reschedule now, before we try-lock the mutex. This avoids getting
> +	 * scheduled out right after we obtained the mutex.
> +	 */
> +	if (unlikely(need_resched()))
> +		schedule_preempt_disabled();

need_resched() already does unlikely().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-02-10 19:58 ` [PATCH 8/8] x86,locking: Enable qrwlock Peter Zijlstra
@ 2014-02-10 23:02 ` Andrew Morton
  2014-02-11  7:17   ` Peter Zijlstra
  2014-02-25 19:26   ` Jason Low
  2014-02-26 21:40 ` Paul E. McKenney
  9 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2014-02-10 23:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, Waiman Long, mingo, paulmck, torvalds,
	tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Mon, 10 Feb 2014 20:58:20 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> Hi all,
> 
> I would propose merging the following patches...
> 
> The first set is mostly from Jason and tweaks the mutex adaptive
> spinning, AIM7 throughput numbers:
> 
> PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99

What do these columns represent?  I'm guessing the large improvement
was in context switches?

> The second set is the qrwlock, although mostly rewritten by me. I didn't do
> much with it other than boot and build a kernel. But I like them because
> of the much better worst case preformance.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/8] mutex: Modify the way optimistic spinners are queued
  2014-02-10 19:58 ` [PATCH 3/8] mutex: Modify the way optimistic spinners are queued Peter Zijlstra
@ 2014-02-11  1:33   ` Jason Low
  2014-02-11  7:20     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Low @ 2014-02-11  1:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> Cc: tglx@linutronix.de
> Cc: riel@redhat.com
> Cc: akpm@linux-foundation.org
> Cc: davidlohr@hp.com
> Cc: hpa@zytor.com
> Cc: andi@firstfloor.org
> Cc: aswin@hp.com
> Cc: mingo@kernel.org
> Cc: scott.norton@hp.com
> Cc: chegu_vinod@hp.com
> Cc: Waiman.Long@hp.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: torvalds@linux-foundation.org
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1390936396-3962-3-git-send-email-jason.low2@hp.com
> ---
>  kernel/locking/mutex.c |   17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -403,9 +403,9 @@ __mutex_lock_common(struct mutex *lock,
>  	if (!mutex_can_spin_on_owner(lock))
>  		goto slowpath;
>  
> +	mcs_spin_lock(&lock->mcs_lock);

Where did the mcs node go? :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-10 23:02 ` [PATCH 0/8] locking/core patches Andrew Morton
@ 2014-02-11  7:17   ` Peter Zijlstra
  2014-02-11  8:03     ` Andrew Morton
  2014-02-25 19:26   ` Jason Low
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-11  7:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Jason Low, Waiman Long, mingo, paulmck, torvalds,
	tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Mon, Feb 10, 2014 at 03:02:30PM -0800, Andrew Morton wrote:
> On Mon, 10 Feb 2014 20:58:20 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Hi all,
> > 
> > I would propose merging the following patches...
> > 
> > The first set is mostly from Jason and tweaks the mutex adaptive
> > spinning, AIM7 throughput numbers:
> > 

                         Jobs/min/ Jobs/sec/ Time:   Time:  Time:   Time:           Running child time
         Forks  Jobs/min   child    child parent  childU childS  std_dev   JTI   :max  :min

> > PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> > POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99
> 
> What do these columns represent?  I'm guessing the large improvement
> was in context switches?

I pasted the header from reaim above; I'm not entirely sure what the
bloody thing does and I hate that it takes hours to get these numbers :/

Bloody stupid benchmark if you ask me.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/8] mutex: Modify the way optimistic spinners are queued
  2014-02-11  1:33   ` Jason Low
@ 2014-02-11  7:20     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-11  7:20 UTC (permalink / raw)
  To: Jason Low
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, Feb 10, 2014 at 05:33:18PM -0800, Jason Low wrote:

> > +	mcs_spin_lock(&lock->mcs_lock);
> 
> Where did the mcs node go? :)

Bugger, that's what I get for not compiling each patch in the series..
:-/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-11  7:17   ` Peter Zijlstra
@ 2014-02-11  8:03     ` Andrew Morton
  2014-02-11  8:45       ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2014-02-11  8:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, Waiman Long, mingo, paulmck, torvalds,
	tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Tue, 11 Feb 2014 08:17:00 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 10, 2014 at 03:02:30PM -0800, Andrew Morton wrote:
> > On Mon, 10 Feb 2014 20:58:20 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Hi all,
> > > 
> > > I would propose merging the following patches...
> > > 
> > > The first set is mostly from Jason and tweaks the mutex adaptive
> > > spinning, AIM7 throughput numbers:
> > > 
> 
>                          Jobs/min/ Jobs/sec/ Time:   Time:  Time:   Time:           Running child time
>          Forks  Jobs/min   child    child parent  childU childS  std_dev   JTI   :max  :min
> 
> > > PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> > > POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99
> > 
> > What do these columns represent?  I'm guessing the large improvement
> > was in context switches?
> 
> I pasted the header from reaim above;

hmpf.  I wonder what's the difference between Jobs/min, Jobs/min(child)
and Jobs/sec(child), which is not Jobs/min(child) / 60.

> I'm not entirely sure what the
> bloody thing does and I hate that it takes hours to get these numbers :/
>
> Bloody stupid benchmark if you ask me.

heh, yes, it's stupid how long many benchmarks take.  Ditch it.  A
change like this should be testable with a 30-line microbenchmark which
runs in 5 seconds tops.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-11  8:03     ` Andrew Morton
@ 2014-02-11  8:45       ` Ingo Molnar
  2014-02-11  8:57         ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2014-02-11  8:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, linux-kernel, Jason Low, Waiman Long, paulmck,
	torvalds, tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 11 Feb 2014 08:17:00 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Feb 10, 2014 at 03:02:30PM -0800, Andrew Morton wrote:
> > > On Mon, 10 Feb 2014 20:58:20 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > I would propose merging the following patches...
> > > > 
> > > > The first set is mostly from Jason and tweaks the mutex adaptive
> > > > spinning, AIM7 throughput numbers:
> > > > 
> > 
> >                          Jobs/min/ Jobs/sec/ Time:   Time:  Time:   Time:           Running child time
> >          Forks  Jobs/min   child    child parent  childU childS  std_dev   JTI   :max  :min
> > 
> > > > PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> > > > POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99
> > > 
> > > What do these columns represent?  I'm guessing the large improvement
> > > was in context switches?
> > 
> > I pasted the header from reaim above;
> 
> hmpf.  I wonder what's the difference between Jobs/min, Jobs/min(child)
> and Jobs/sec(child), which is not Jobs/min(child) / 60.
> 
> > I'm not entirely sure what the bloody thing does and I hate that 
> > it takes hours to get these numbers :/
> >
> > Bloody stupid benchmark if you ask me.
> 
> heh, yes, it's stupid how long many benchmarks take.  Ditch it.  A 
> change like this should be testable with a 30-line microbenchmark 
> which runs in 5 seconds tops.

Another very nice option would be to stick the relevant workload 
patterns into 'perf bench', calibrate it to emit similar figures (and 
double check the speedup is similar as well) and thus make it an AIM7 
work-alike microbenchmark.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-11  8:45       ` Ingo Molnar
@ 2014-02-11  8:57         ` Peter Zijlstra
  2014-02-11 21:37           ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-11  8:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, linux-kernel, Jason Low, Waiman Long, paulmck,
	torvalds, tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Tue, Feb 11, 2014 at 09:45:02AM +0100, Ingo Molnar wrote:
> > heh, yes, it's stupid how long many benchmarks take.  Ditch it.  A 
> > change like this should be testable with a 30-line microbenchmark 
> > which runs in 5 seconds tops.
> 
> Another very nice option would be to stick the relevant workload 
> patterns into 'perf bench', calibrate it to emit similar figures (and 
> double check the speedup is similar as well) and thus make it an AIM7 
> work-alike microbenchmark.

/me stares at Jason and Waiman.. :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 22:04       ` Jason Low
@ 2014-02-11  9:18         ` Peter Zijlstra
  2014-02-11  9:38           ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-11  9:18 UTC (permalink / raw)
  To: Jason Low
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, Feb 10, 2014 at 02:04:22PM -0800, Jason Low wrote:
> On Mon, 2014-02-10 at 22:32 +0100, Peter Zijlstra wrote:
> > Is adding that really much faster than the relatively straight path
> > oqs_wait_next() would walk to bit the same exit?
> > 
> > The only reason I pulled out the above cmpxchg() is because its the
> > uncontended fast path, which seems like a special enough case.
> 
> So it would avoid 2 extra checks (*lock == node) and (node->next) in the
> oqs_wait_next() path, which aren't necessary when node->next != NULL.
> 
> And I think node->next != NULL can be considered a special enough case
> after the cmpxchg() fails because in the contended case, we're expecting
> the node->next to be pointing at something. The only times node->next is
> NULL after cmpxchg() fails are during a very small race window with the
> osq_lock(), and when the next node is unqueuing due to need_resched,
> which is also a very small window.

True all; now if only we had a useful benchmark so we could test if it
makes a difference or not :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-11  9:18         ` Peter Zijlstra
@ 2014-02-11  9:38           ` Ingo Molnar
  0 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2014-02-11  9:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, linux-kernel, Waiman Long, paulmck, torvalds, tglx,
	riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 10, 2014 at 02:04:22PM -0800, Jason Low wrote:
> > On Mon, 2014-02-10 at 22:32 +0100, Peter Zijlstra wrote:
> > > Is adding that really much faster than the relatively straight path
> > > oqs_wait_next() would walk to bit the same exit?
> > > 
> > > The only reason I pulled out the above cmpxchg() is because its the
> > > uncontended fast path, which seems like a special enough case.
> > 
> > So it would avoid 2 extra checks (*lock == node) and (node->next) in the
> > oqs_wait_next() path, which aren't necessary when node->next != NULL.
> > 
> > And I think node->next != NULL can be considered a special enough case
> > after the cmpxchg() fails because in the contended case, we're expecting
> > the node->next to be pointing at something. The only times node->next is
> > NULL after cmpxchg() fails are during a very small race window with the
> > osq_lock(), and when the next node is unqueuing due to need_resched,
> > which is also a very small window.
> 
> True all; now if only we had a useful benchmark so we could test if 
> it makes a difference or not :-)

Having useful 'perf bench lock' sub-test(s) that mimic the AIM7 
workload (and other workloads that excercise locking) would address 
that concern to a large degree.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-10 19:58 ` [PATCH 7/8] locking: Introduce qrwlock Peter Zijlstra
@ 2014-02-11 18:17   ` Waiman Long
  2014-02-11 20:12     ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2014-02-11 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

Peter,

I had written a test program to repetitively take a single rwlock with
programmable number of threads and count their execution times. Each
thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
system. I bound all the threads to different CPUs with the following
3 configurations:

  1) Both CPUs and lock are in the same node
  2) CPUs and lock are in different nodes
  3) Half of the CPUs are in same node as the lock & the other half
     are remote

The results of the test run are as follows (execution times in ms,
smaller is better, qrwlock uses MCS lock for queuing):

R:W ratio = 0:1 (write lock only)
---------------------------------
  # of            rwlock mean/std          qrwlock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           75/   0,   76/   0      44/  0,   44/  0
    2          924/   6,  818/  11     233/ 22,  685/  1
    3         1680/  67, 1676/  50     422/ 24,  640/305
    4         1727/ 402, 1661/ 620     563/ 85,  694/384
    5         2634/ 861, 2746/ 426     787/276,  829/401
    6         2969/1196, 2892/ 398     810/307,  968/451
    7         3532/1512, 3137/1109     936/387, 1060/512
    8         3979/1698, 4190/1526    1134/590, 1120/503

R:W ratio = 1:1
---------------
  # of            rwlock mean/std          qrwlock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           82/   0,   81/   0      65/  0,   65/  0
    2          844/   6,  829/   1     752/  4,  894/  2
    3         1683/  65, 1628/  29     980/ 14, 1046/ 24
    4         2661/  65, 1747/ 609    1356/ 16, 1405/ 13
    5         2227/ 868, 2887/ 395    1625/ 32, 1679/ 29
    6         2553/ 968, 3924/1319    1935/ 18, 1986/ 65
    7         3392/1325, 3674/1430    2302/ 41, 2313/ 83
    8         3822/1545, 5163/2103    2654/ 31, 2654/ 67

R:W ratio = 10:1
----------------
  # of            rwlock mean/std          qrwlock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           86/   0,   86/   0      83/  0,   83/  0
    2          656/   2,  640/   3     546/  2,  635/  1
    3         1154/  39, 1053/  64     934/ 27,  968/ 31
    4         1810/  26, 1535/ 399    1357/ 34, 1382/ 52
    5         1994/ 495, 2618/  27    1713/ 52, 1785/ 47
    6         2371/ 488, 3099/ 105    2093/ 80, 2131/ 77
    7         2846/ 751, 3068/ 689    2495/103, 2527/107
    8         3392/1009, 3566/1017    2899/130, 2928/126

For reference, I also made the same measurement for ticket spinlock with
essentially no variation in execution time between different threads:

  # of          spinlock mean
threads         Local, Remote
-------         -------------
    1            70,   70
    2           757,  779
    3          1892, 1858
    4          2818, 2794
    5          3829, 3851
    6          4957, 5069
    7          6209, 6251
    8          7524, 7564

In summary,

1) The qrwlock is always faster than the rwlock with the exception
    of about 8% slowdown with 1:1 read-write lock ratio and 2 contending
    threads on a remote lock. In the most common case of 1-socket
    system, qrwlock will not be slower than rwlock.

2) qrwlock has much less variation in execution times (more consistent
    performance) when compared with rwlock.

3) For rwlock with 2-3 contending threads, remote lock actually
    performs slightly better than local lock.

4) For qrwlock, local lock performs slightly better than remote lock
    with less variation in execution time.

As for the use of ticket lock for queuing purpose, the spinlock
performance figures above seem to suggest that it will perform worse
than MCS lock with even a few contending threads. This is especially
a problem with contending threads coming from different nodes.

Below are the performance data with half local and half remote CPUs:

# of threads = 8
R/W ratio     rwlock mean/std        qrwlock mean/std
---------    ---------------        ----------------
   0:1            3476/868           1304/484
   1:1           3452/975           2984/777
  10:1            3215/914           2963/107

The rwlock actually performs slightly better when CPUs have different access
times to the lock. The qrwlock, on the other hand, performs slightly worse
with much larger variation in execution times.

In the case of ticket spinlock with asymmetric access time (half
local CPUs, half remote CPUs), the performance data were:

# of threads         spinlock mean
------------         -------------
      2             4570
      4             9927
      6            15197
      8            20573

Ticket spinlock seems to be 3-5X slower with asymmetric access time. So
it may not be such a good idea to replace the MCS lock by a ticket
lock in qrwlock. I will take some additional measurement with your patch
to see how it perform.

-Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-11 18:17   ` Waiman Long
@ 2014-02-11 20:12     ` Waiman Long
  2014-02-13 16:35       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2014-02-11 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On 02/11/2014 01:17 PM, Waiman Long wrote:
> Peter,
>
> I had written a test program to repetitively take a single rwlock with
> programmable number of threads and count their execution times. Each
> thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
> system. I bound all the threads to different CPUs with the following
> 3 configurations:
>
>  1) Both CPUs and lock are in the same node
>  2) CPUs and lock are in different nodes
>  3) Half of the CPUs are in same node as the lock & the other half
>     are remote
>
> The results of the test run are as follows (execution times in ms,
> smaller is better, qrwlock uses MCS lock for queuing):
>
> R:W ratio = 0:1 (write lock only)
> ---------------------------------
>  # of            rwlock mean/std          qrwlock mean/std
> threads           Local  ,    Remote          Local  ,  Remote
> -------         ---------------------    ----------------------
>    1           75/   0,   76/   0      44/  0,   44/  0
>    2          924/   6,  818/  11     233/ 22,  685/  1
>    3         1680/  67, 1676/  50     422/ 24,  640/305
>    4         1727/ 402, 1661/ 620     563/ 85,  694/384
>    5         2634/ 861, 2746/ 426     787/276,  829/401
>    6         2969/1196, 2892/ 398     810/307,  968/451
>    7         3532/1512, 3137/1109     936/387, 1060/512
>    8         3979/1698, 4190/1526    1134/590, 1120/503
>
> R:W ratio = 1:1
> ---------------
>  # of            rwlock mean/std          qrwlock mean/std
> threads           Local  ,    Remote          Local  ,  Remote
> -------         ---------------------    ----------------------
>    1           82/   0,   81/   0      65/  0,   65/  0
>    2          844/   6,  829/   1     752/  4,  894/  2
>    3         1683/  65, 1628/  29     980/ 14, 1046/ 24
>    4         2661/  65, 1747/ 609    1356/ 16, 1405/ 13
>    5         2227/ 868, 2887/ 395    1625/ 32, 1679/ 29
>    6         2553/ 968, 3924/1319    1935/ 18, 1986/ 65
>    7         3392/1325, 3674/1430    2302/ 41, 2313/ 83
>    8         3822/1545, 5163/2103    2654/ 31, 2654/ 67
>
> R:W ratio = 10:1
> ----------------
>  # of            rwlock mean/std          qrwlock mean/std
> threads           Local  ,    Remote          Local  ,  Remote
> -------         ---------------------    ----------------------
>    1           86/   0,   86/   0      83/  0,   83/  0
>    2          656/   2,  640/   3     546/  2,  635/  1
>    3         1154/  39, 1053/  64     934/ 27,  968/ 31
>    4         1810/  26, 1535/ 399    1357/ 34, 1382/ 52
>    5         1994/ 495, 2618/  27    1713/ 52, 1785/ 47
>    6         2371/ 488, 3099/ 105    2093/ 80, 2131/ 77
>    7         2846/ 751, 3068/ 689    2495/103, 2527/107
>    8         3392/1009, 3566/1017    2899/130, 2928/126
>
> For reference, I also made the same measurement for ticket spinlock with
> essentially no variation in execution time between different threads:
>
>  # of          spinlock mean
> threads         Local, Remote
> -------         -------------
>    1            70,   70
>    2           757,  779
>    3          1892, 1858
>    4          2818, 2794
>    5          3829, 3851
>    6          4957, 5069
>    7          6209, 6251
>    8          7524, 7564
>
> In summary,
>
> 1) The qrwlock is always faster than the rwlock with the exception
>    of about 8% slowdown with 1:1 read-write lock ratio and 2 contending
>    threads on a remote lock. In the most common case of 1-socket
>    system, qrwlock will not be slower than rwlock.
>
> 2) qrwlock has much less variation in execution times (more consistent
>    performance) when compared with rwlock.
>
> 3) For rwlock with 2-3 contending threads, remote lock actually
>    performs slightly better than local lock.
>
> 4) For qrwlock, local lock performs slightly better than remote lock
>    with less variation in execution time.
>
> As for the use of ticket lock for queuing purpose, the spinlock
> performance figures above seem to suggest that it will perform worse
> than MCS lock with even a few contending threads. This is especially
> a problem with contending threads coming from different nodes.
>
> Below are the performance data with half local and half remote CPUs:
>
> # of threads = 8
> R/W ratio     rwlock mean/std        qrwlock mean/std
> ---------    ---------------        ----------------
>   0:1            3476/868           1304/484
>   1:1           3452/975           2984/777
>  10:1            3215/914           2963/107
>
> The rwlock actually performs slightly better when CPUs have different 
> access
> times to the lock. The qrwlock, on the other hand, performs slightly 
> worse
> with much larger variation in execution times.
>
> In the case of ticket spinlock with asymmetric access time (half
> local CPUs, half remote CPUs), the performance data were:
>
> # of threads         spinlock mean
> ------------         -------------
>      2             4570
>      4             9927
>      6            15197
>      8            20573
>
> Ticket spinlock seems to be 3-5X slower with asymmetric access time. So
> it may not be such a good idea to replace the MCS lock by a ticket
> lock in qrwlock. I will take some additional measurement with your patch
> to see how it perform.
>
> -Longman
>
Using the same locktest program to repetitively take a single rwlock with
programmable number of threads and count their execution times. Each
thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
system. I bound all the threads to different CPUs with the following
3 configurations:

  1) Both CPUs and lock are in the same node
  2) CPUs and lock are in different nodes
  3) Half of the CPUs are in same node as the lock & the other half
     are remote

Two types of qrwlock are tested:
  1) Use MCS lock
  2) Use ticket lock

The results of the test run are as follows (execution times in ms,
smaller is better):

R:W ratio = 0:1 (write lock only)
---------------------------------
  # of          ticket lock mean/std      MCS lock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           44/   0,   43/   0      44/  0,   44/  0
    2          376/  21,  504/   3     231/ 21,  233/ 22
    3          969/  66,  509/ 176     363/ 88,  359/107
    4         1689/ 103,  960/ 417     693/263,  502/141
    5         2330/ 167, 1575/ 668     769/273,  649/238
    6         2802/ 474, 2203/ 884     866/334,  864/329
    7         3390/ 499, 2812/1040    1065/461, 1075/316
    8         3725/ 613, 3359/1210    1257/552, 1288/493

R:W ratio = 1:1
---------------
  # of          ticket lock mean/std      MCS lock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           66/   0,   66/   0      66/  0,   65/  0
    2          824/   4,  609/  13     761/  2,  774/  4
    3         1440/   4, 1305/  57     926/ 19,  989/  7
    4         2080/  53, 1991/  58    1371/ 29, 1322/ 33
    5         2786/  71, 2635/ 116    1632/ 82, 1604/ 22
    6         3808/ 147, 3584/ 165    1938/102, 1912/ 98
    7         4795/ 107, 4593/ 116    2378/ 79, 2279/ 93
    8         5498/ 387, 5428/ 370    2614/ 84, 2602/ 63

R:W ratio = 10:1
----------------
  # of            rwlock mean/std          qrwlock mean/std
threads           Local  ,    Remote          Local  ,  Remote
-------         ---------------------    ----------------------
    1           83/   0,   84/   0      83/  0,   83/  0
    2          577/  13,  421/  25     492/  9,  554/  1
    3         1246/  16, 1009/ 112     866/ 10,  902/ 49
    4         1956/  29, 1656/ 171    1240/ 33, 1340/ 43
    5         2788/  26, 2480/ 208    1661/ 78, 1685/ 59
    6         3723/  74, 3429/ 207    2068/107, 2064/ 86
    7         4629/ 197, 4295/ 303    2484/132, 2464/108
    8         5565/ 277, 5406/ 301    2903/161, 2864/135

There are varations in different runs especially for low thread counts.
With ticket lock, remote lock access seems to benefit performance
especially at low thread counts (2 or 3) with corresponding larger
execution time variation. With MCS lock remote lock access generally
yield slightly worse performance. Overall speaking, the only case
where ticket lock performs better is when with 2 contending threads
on a remote lock.

With half local CPUs and half remote CPUs, the performance data were:

R:W ratio = 0:1 (write lock only)
---------------------------------
# of threads     ticket lock mean/std      MCS lock mean/std
------------    --------------------    ----------------------
    2            518/ 22              294/ 11
    4            973/295             604/221
    6           1465/580             925/446
    8           1721/763            1155/528

R:W ratio = 1:1
---------------
# of threads     ticket lock mean/std      MCS lock mean/std
------------    --------------------    ----------------------
    2            578/  2             605/ 11
    4           1436/177            1518/182
    6           2254/327            2248/206
    8           3097/648            2536/838

R:W ratio = 10:1
----------------
# of threads     ticket lock mean/std      MCS lock mean/std
------------    --------------------    ----------------------
    2            632/  5             718/  3
    4           1783/153            1733/ 82
    6           2815/213            2612/110
    8           3911/380            3460/152

Asymmetric access time benefit ticket lock, which performs slightly
better than MCS lock at 2 & 4 threads (1:1) and 2 threads (10:1). For
MCS lock, asymmetric access time makes the performance slightly worse.

-Longman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-11  8:57         ` Peter Zijlstra
@ 2014-02-11 21:37           ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2014-02-11 21:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, linux-kernel, Jason Low, paulmck,
	torvalds, tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On 02/11/2014 03:57 AM, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 09:45:02AM +0100, Ingo Molnar wrote:
>>> heh, yes, it's stupid how long many benchmarks take.  Ditch it.  A
>>> change like this should be testable with a 30-line microbenchmark
>>> which runs in 5 seconds tops.
>> Another very nice option would be to stick the relevant workload
>> patterns into 'perf bench', calibrate it to emit similar figures (and
>> double check the speedup is similar as well) and thus make it an AIM7
>> work-alike microbenchmark.
> /me stares at Jason and Waiman.. :-)

It shouldn't be too hard to get the mutex exercising portion of AIM7 
into perf. I will discuss with Jason about this when we finish some of 
the high-priority tasks we currently have.

-Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-11 20:12     ` Waiman Long
@ 2014-02-13 16:35       ` Peter Zijlstra
  2014-02-13 17:26         ` Peter Zijlstra
  2014-02-14 18:48         ` Waiman Long
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-13 16:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Tue, Feb 11, 2014 at 03:12:59PM -0500, Waiman Long wrote:
> Using the same locktest program to repetitively take a single rwlock with
> programmable number of threads and count their execution times. Each
> thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
> system. I bound all the threads to different CPUs with the following
> 3 configurations:
> 
>  1) Both CPUs and lock are in the same node
>  2) CPUs and lock are in different nodes
>  3) Half of the CPUs are in same node as the lock & the other half
>     are remote

I can't find these configurations in the below numbers; esp the first is
interesting because most computers out there have no nodes.

> Two types of qrwlock are tested:
>  1) Use MCS lock
>  2) Use ticket lock

arch_spinlock_t; you forget that if you change that to an MCS style lock
this one goes along for free.

On that; I had a look at your qspinlock and got a massive head-ache so I
rewrote it. Aside from being very mess code it also suffered from a few
fairness issues in that it is possible (albeit highly unlikely) to steal
a lock instead of being properly queued; per your xchg() usage.

The below boots; but I've not done much else with it, so it will
probably explode in your face.

---
 arch/x86/Kconfig                      |    1 
 arch/x86/include/asm/spinlock.h       |    2 
 arch/x86/include/asm/spinlock_types.h |    4 
 b/arch/x86/include/asm/qspinlock.h    |   13 ++
 b/include/asm-generic/qspinlock.h     |  128 ++++++++++++++++++++++++
 b/kernel/locking/qspinlock.c          |  178 ++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 
 kernel/locking/mcs_spinlock.h         |    1 
 9 files changed, 335 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#define queue_spin_unlock queue_spin_unlock
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	barrier();
+	ACCESS_ONCE(*(u8 *)&lock->val) = 0;
+}
+#endif
+
+#endif /* _ASM_X86_QSPINLOCK_H */
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,7 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +182,7 @@ static __always_inline void arch_spin_lo
 {
 	arch_spin_lock(lock);
 }
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -11,6 +11,9 @@
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock.h>
+#else
 #if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #ifdef CONFIG_QUEUE_RWLOCK
 #include <asm/qrwlock.h>
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,128 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+#include <asm/byteorder.h>
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ * Bits 0-7 : Set if locked
+ * Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	val;
+} arch_spinlock_t;
+
+#define _QSPINLOCK_LOCKED	1U
+#define _QLOCKED_MASK		0xffU
+#define	_QCODE_OFFSET		8
+
+#include <asm/qspinlock.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & _QLOCKED_MASK;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !queue_spin_is_locked(&lock);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) >> _QCODE_OFFSET;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	return !atomic_read(&lock->val) &&
+	       atomic_cmpxchg(&lock->val, 0, _QSPINLOCK_LOCKED) == 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	if (likely(atomic_cmpxchg(&lock->val, 0, _QSPINLOCK_LOCKED) == 0))
+		return;
+	queue_spin_lock_slowpath(lock);
+}
+
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+#ifndef queue_spin_unlock
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode_a);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0), }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -224,6 +224,13 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
 
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
+
 config ARCH_USE_QUEUE_RWLOCK
 	bool
 
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,4 +23,5 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,6 +17,7 @@
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
 	int locked; /* 1 if lock acquired */
+	int count; /* see qspinlock.c */
 };
 
 #ifndef arch_mcs_spin_lock_contended
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,178 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm-generic/qspinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Exactly fills one cacheline on 64bit.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+
+static inline u32 encode(int cpu, int idx)
+{
+	u32 code;
+
+        code  = (cpu + 1) << 10;
+	code |= idx << 8; /* assume < 4 */
+
+	return code;
+}
+
+static inline struct mcs_spinlock *decode(u32 code)
+{
+	int cpu = (code >> 10) - 1;
+	int idx = (code >> 8) & 0x3;
+
+	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock)
+{
+	struct mcs_spinlock *prev, *next, *node = this_cpu_ptr(mcs_nodes);
+	int idx = node->count++;
+	u32 val, new, old;
+
+	u32 code = encode(smp_processor_id(), idx);
+
+	node += idx;
+	node->locked = 0;
+	node->next = NULL;
+
+	/*
+	 * trylock || xchg(lock, node)
+	 *
+	 * 0,0 -> 0,1 ; trylock
+	 * p,x -> n,x ; prev = xchg(lock, node)
+	 */
+	val = atomic_read(&lock->val);
+	for (;;) {
+		new = _QSPINLOCK_LOCKED;
+		if (val)
+			new = code | (val & new);
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock; forget about queueing.
+	 */
+	if (new == _QSPINLOCK_LOCKED) {
+		this_cpu_dec(mcs_nodes[0].count);
+		return;
+	}
+
+	/*
+	 * if there was a previous node; link it and wait.
+	 */
+	if (old & ~_QSPINLOCK_LOCKED) {
+		prev = decode(old);
+		ACCESS_ONCE(prev->next) = node;
+
+		arch_mcs_spin_lock_contended(&node->locked);
+	}
+
+	/*
+	 * We're at the head of the waitqueue, wait for the owner to go away.
+	 */
+	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED)
+		arch_mutex_cpu_relax();
+
+	/*
+	 * n,0 -> 0,1 : lock, uncontended
+	 * x,0 -> x,1 : lock, contended
+	 */
+	for (;;) {
+		new = _QSPINLOCK_LOCKED;
+		if (val != code)
+			new |= val;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * contended path; wait for next, release.
+	 */
+	if (new != _QSPINLOCK_LOCKED) {
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+
+		arch_mcs_spin_unlock_contended(&next->locked);
+	}
+
+	/*
+	 * release the node
+	 */
+	this_cpu_dec(mcs_nodes[0].count);
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-13 16:35       ` Peter Zijlstra
@ 2014-02-13 17:26         ` Peter Zijlstra
  2014-02-14 19:01           ` Waiman Long
  2014-02-14 18:48         ` Waiman Long
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-13 17:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Thu, Feb 13, 2014 at 05:35:46PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 03:12:59PM -0500, Waiman Long wrote:
> > Using the same locktest program to repetitively take a single rwlock with
> > programmable number of threads and count their execution times. Each
> > thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
> > system. I bound all the threads to different CPUs with the following
> > 3 configurations:
> > 
> >  1) Both CPUs and lock are in the same node
> >  2) CPUs and lock are in different nodes
> >  3) Half of the CPUs are in same node as the lock & the other half
> >     are remote
> 
> I can't find these configurations in the below numbers; esp the first is
> interesting because most computers out there have no nodes.
> 
> > Two types of qrwlock are tested:
> >  1) Use MCS lock
> >  2) Use ticket lock
> 
> arch_spinlock_t; you forget that if you change that to an MCS style lock
> this one goes along for free.

Furthermore; comparing the current rwlock to the ticket-rwlock already
shows an improvement, so on that aspect its worth it as well.

And there's also the paravirt people to consider; a fair rwlock will
make them unhappy; and I'm hoping that their current paravirt ticket
stuff is sufficient to deal with the ticket-rwlock without them having
to come and wreck things again.

Similarly; qspinlock needs paravirt support.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-13 16:35       ` Peter Zijlstra
  2014-02-13 17:26         ` Peter Zijlstra
@ 2014-02-14 18:48         ` Waiman Long
  1 sibling, 0 replies; 35+ messages in thread
From: Waiman Long @ 2014-02-14 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On 02/13/2014 11:35 AM, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 03:12:59PM -0500, Waiman Long wrote:
>> Using the same locktest program to repetitively take a single rwlock with
>> programmable number of threads and count their execution times. Each
>> thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
>> system. I bound all the threads to different CPUs with the following
>> 3 configurations:
>>
>>   1) Both CPUs and lock are in the same node
>>   2) CPUs and lock are in different nodes
>>   3) Half of the CPUs are in same node as the lock&  the other half
>>      are remote
> I can't find these configurations in the below numbers; esp the first is
> interesting because most computers out there have no nodes.

I have a local and remote number in the measurement data that I sent 
out. The local ones are when both CPUs and lock are in the same node. 
The remote is when they are in different nodes.

>> Two types of qrwlock are tested:
>>   1) Use MCS lock
>>   2) Use ticket lock
> arch_spinlock_t; you forget that if you change that to an MCS style lock
> this one goes along for free.

Yes, I am aware of that. I am not saying that it is a bad idea to use 
arch_spin_t.  I will be happy if your version of qrwlock patch get 
merged. I am just saying that it maybe a better idea to use MCS lock 
directly especially in case that the spinlock is not converted to use a 
MCS-style lock. I will be more happy if that happen.


>
> On that; I had a look at your qspinlock and got a massive head-ache so I
> rewrote it. Aside from being very mess code it also suffered from a few
> fairness issues in that it is possible (albeit highly unlikely) to steal
> a lock instead of being properly queued; per your xchg() usage.
>
> The below boots; but I've not done much else with it, so it will
> probably explode in your face.

Thank for looking into my qspinlock patch. I will take a look at your 
changes and incorporate it to make it more fair. I have already 
rewritten it along the same line your version of the qrwlock patch. I 
have done some performance testing at low contention level using my 
microbenchmark. The qspinlock was indeed slower than ticket lock with 
2-4 contending tasks. The break-even point is at 5 contending tasks. To 
fix this performance deficit, I added an optimized x86 specific 
contention path for 2 contending tasks so that it would perform better 
than the ticket lock. It will still be somewhat slower for 3-4 
contending tasks, but the 2 contending task case is probably the most 
common.

With that change, I would say that my qspinlock patch should be good 
enough as a replacement of ticket spinlock for x86. I will send out an 
updated qspinlock patch in a day or two when I finish my testing.

-Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] locking: Introduce qrwlock
  2014-02-13 17:26         ` Peter Zijlstra
@ 2014-02-14 19:01           ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2014-02-14 19:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On 02/13/2014 12:26 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 05:35:46PM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 11, 2014 at 03:12:59PM -0500, Waiman Long wrote:
>>> Using the same locktest program to repetitively take a single rwlock with
>>> programmable number of threads and count their execution times. Each
>>> thread takes the lock 5M times on a 4-socket 40-core Westmere-EX
>>> system. I bound all the threads to different CPUs with the following
>>> 3 configurations:
>>>
>>>   1) Both CPUs and lock are in the same node
>>>   2) CPUs and lock are in different nodes
>>>   3) Half of the CPUs are in same node as the lock&  the other half
>>>      are remote
>> I can't find these configurations in the below numbers; esp the first is
>> interesting because most computers out there have no nodes.
>>
>>> Two types of qrwlock are tested:
>>>   1) Use MCS lock
>>>   2) Use ticket lock
>> arch_spinlock_t; you forget that if you change that to an MCS style lock
>> this one goes along for free.
> Furthermore; comparing the current rwlock to the ticket-rwlock already
> shows an improvement, so on that aspect its worth it as well.

As I said in my previous email, I am not against your change.

> And there's also the paravirt people to consider; a fair rwlock will
> make them unhappy; and I'm hoping that their current paravirt ticket
> stuff is sufficient to deal with the ticket-rwlock without them having
> to come and wreck things again.

Actually, my original qrwlock patch has an unfair option. With some 
minor change, it can be made unfair pretty easily. So we can use the 
paravirt config macro to change that to unfair if it is what the 
virtualization people want.

> Similarly; qspinlock needs paravirt support.
>
>

The current paravirt code has hard-coded the use of ticket spinlock. 
That is why I have to disable my qspinlock code if paravirt is enabled.

I have thinking about that paravirt support. Since the waiting tasks are 
queued up. By maintaining some kind of heart beat signal, it is possible 
to make the waiting task jump the queue if the previous one in the queue 
doesn't seem to be alive. I will work on that next once I am done with 
the current qspinlock patch.

-Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-10 23:02 ` [PATCH 0/8] locking/core patches Andrew Morton
  2014-02-11  7:17   ` Peter Zijlstra
@ 2014-02-25 19:26   ` Jason Low
  1 sibling, 0 replies; 35+ messages in thread
From: Jason Low @ 2014-02-25 19:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, linux-kernel, Waiman Long, mingo, paulmck,
	torvalds, tglx, riel, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Mon, 2014-02-10 at 15:02 -0800, Andrew Morton wrote:
> On Mon, 10 Feb 2014 20:58:20 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Hi all,
> > 
> > I would propose merging the following patches...
> > 
> > The first set is mostly from Jason and tweaks the mutex adaptive
> > spinning, AIM7 throughput numbers:
> > 
> > PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> > POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99
> 
> What do these columns represent?  I'm guessing the large improvement
> was in context switches?

Hello,

I also re-tested the mutex patches 1-6 on my 2 and 8 socket machines
with the high_systime and fserver AIM7 workloads (ran on disk). The
workloads are able to generate contention on the 
&EXT4_SB(inode->i_sb)->s_orphan_lock mutex. Below are the % improvement
in throughput with the patches on a recent tip kernel. The main benefits
were on the larger box and when there were higher number of users.

Note: the -0.7% drop in performance for fserver at 10-90 users on the 2
socket machine was mainly due to "[PATCH 6/8] mutex: Extra reschedule
point". Without patch 6, there was almost no % difference in throughput
between the baseline kernel and kernel with patches 1-5.


8 socket machine:

--------------------------
       	 fserver   
--------------------------
users     | % improvement
          | in throughput
          | with patches
--------------------------
1000-2000 |  +29.2%
--------------------------
100-900   |  +10.0%
--------------------------
10-90     |   +0.4%


--------------------------
       high_systime
--------------------------
users     | % improvement
          | in throughput
          | with patches
--------------------------
1000-2000 |  +34.9%
--------------------------
100-900   |  +49.2%
--------------------------
10-90     |   +3.1%



2 socket machine:

--------------------------
         fserver   
--------------------------
users     | % improvement
          | in throughput
          | with patches
--------------------------
1000-2000 |   +1.8%
--------------------------
100-900   |   +0.0%
--------------------------
10-90     |   -0.7%


--------------------------
       high_systime
--------------------------
users     | % improvement
          | in throughput
          | with patches
--------------------------
1000-2000 |   +0.8%
--------------------------
100-900   |   +0.4%
--------------------------
10-90     |   +0.0%





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-10 19:58 ` [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning Peter Zijlstra
  2014-02-10 21:15   ` Jason Low
@ 2014-02-25 19:56   ` Jason Low
  2014-02-26  9:22     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Low @ 2014-02-25 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:

> +unqueue:
> +	/*
> +	 * Step - A  -- stabilize @prev
> +	 *
> +	 * Undo our @prev->next assignment; this will make @prev's
> +	 * unlock()/unqueue() wait for a next pointer since @lock points to us
> +	 * (or later).
> +	 */
> +
> +	for (;;) {
> +		if (prev->next == node &&
> +		    cmpxchg(&prev->next, node, NULL) == node)
> +			break;
> +
> +		/*
> +		 * We can only fail the cmpxchg() racing against an unlock(),
> +		 * in which case we should observe @node->locked becomming
> +		 * true.
> +		 */
> +		if (smp_load_acquire(&node->locked))
> +			return true;
> +
> +		/*
> +		 * Or we race against a concurrent unqueue()'s step-B, in which
> +		 * case its step-C will write us a new @node->prev pointer.
> +		 */
> +		prev = ACCESS_ONCE(node->prev);

Should we also add an arch_mutex_cpu_relax() to this loop?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-25 19:56   ` Jason Low
@ 2014-02-26  9:22     ` Peter Zijlstra
  2014-02-26 17:45       ` Jason Low
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-02-26  9:22 UTC (permalink / raw)
  To: Jason Low
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Tue, Feb 25, 2014 at 11:56:19AM -0800, Jason Low wrote:
> On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> 
> > +unqueue:
> > +	/*
> > +	 * Step - A  -- stabilize @prev
> > +	 *
> > +	 * Undo our @prev->next assignment; this will make @prev's
> > +	 * unlock()/unqueue() wait for a next pointer since @lock points to us
> > +	 * (or later).
> > +	 */
> > +
> > +	for (;;) {
> > +		if (prev->next == node &&
> > +		    cmpxchg(&prev->next, node, NULL) == node)
> > +			break;
> > +
> > +		/*
> > +		 * We can only fail the cmpxchg() racing against an unlock(),
> > +		 * in which case we should observe @node->locked becomming
> > +		 * true.
> > +		 */
> > +		if (smp_load_acquire(&node->locked))
> > +			return true;

I've stuck on in right about here. So that we don't unduly delay the
cmpxchg() after the load of prev.

> > +
> > +		/*
> > +		 * Or we race against a concurrent unqueue()'s step-B, in which
> > +		 * case its step-C will write us a new @node->prev pointer.
> > +		 */
> > +		prev = ACCESS_ONCE(node->prev);
> 
> Should we also add an arch_mutex_cpu_relax() to this loop?
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning
  2014-02-26  9:22     ` Peter Zijlstra
@ 2014-02-26 17:45       ` Jason Low
  0 siblings, 0 replies; 35+ messages in thread
From: Jason Low @ 2014-02-26 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Waiman Long, mingo, paulmck, torvalds, tglx, riel,
	akpm, davidlohr, hpa, andi, aswin, scott.norton, chegu_vinod

On Wed, 2014-02-26 at 10:22 +0100, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 11:56:19AM -0800, Jason Low wrote:
> > On Mon, 2014-02-10 at 20:58 +0100, Peter Zijlstra wrote:
> > > +	for (;;) {
> > > +		if (prev->next == node &&
> > > +		    cmpxchg(&prev->next, node, NULL) == node)
> > > +			break;
> > > +
> > > +		/*
> > > +		 * We can only fail the cmpxchg() racing against an unlock(),
> > > +		 * in which case we should observe @node->locked becomming
> > > +		 * true.
> > > +		 */
> > > +		if (smp_load_acquire(&node->locked))
> > > +			return true;
> 
> I've stuck on in right about here. So that we don't unduly delay the
> cmpxchg() after the load of prev.

Ok.

Reviewed-by: Jason Low <jason.low2@hp.com>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] locking/core patches
  2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
                   ` (8 preceding siblings ...)
  2014-02-10 23:02 ` [PATCH 0/8] locking/core patches Andrew Morton
@ 2014-02-26 21:40 ` Paul E. McKenney
  9 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2014-02-26 21:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jason Low, Waiman Long, mingo, torvalds, tglx,
	riel, akpm, davidlohr, hpa, andi, aswin, scott.norton,
	chegu_vinod

On Mon, Feb 10, 2014 at 08:58:20PM +0100, Peter Zijlstra wrote:
> Hi all,
> 
> I would propose merging the following patches...
> 
> The first set is mostly from Jason and tweaks the mutex adaptive
> spinning, AIM7 throughput numbers:
> 
> PRE:  100   2000.04  21564.90 2721.29 311.99     3.12       0.01     0.00     99
> POST: 100   2000.04  42603.85 5142.80 311.99     3.12       0.00     0.00     99
> 
> The second set is the qrwlock, although mostly rewritten by me. I didn't do
> much with it other than boot and build a kernel. But I like them because
> of the much better worst case preformance.

This series passes a short locktorture test when based on top of current
tip/core/locking.

But don't read too much into this...  This was in an 8-CPU KVM guest on
x86, and locktorture is still a bit on the lame side.  But you have to
start somewhere!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-02-26 21:40 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 19:58 [PATCH 0/8] locking/core patches Peter Zijlstra
2014-02-10 19:58 ` [PATCH 1/8] locking: Move mcs_spinlock.h into kernel/locking/ Peter Zijlstra
2014-02-10 19:58 ` [PATCH 2/8] mutex: In mutex_can_spin_on_owner(), return false if task need_resched() Peter Zijlstra
2014-02-10 21:02   ` Peter Zijlstra
2014-02-10 19:58 ` [PATCH 3/8] mutex: Modify the way optimistic spinners are queued Peter Zijlstra
2014-02-11  1:33   ` Jason Low
2014-02-11  7:20     ` Peter Zijlstra
2014-02-10 19:58 ` [PATCH 4/8] mutex: Unlock the mutex without the wait_lock Peter Zijlstra
2014-02-10 19:58 ` [PATCH 5/8] locking, mutex: Cancelable MCS lock for adaptive spinning Peter Zijlstra
2014-02-10 21:15   ` Jason Low
2014-02-10 21:32     ` Peter Zijlstra
2014-02-10 22:04       ` Jason Low
2014-02-11  9:18         ` Peter Zijlstra
2014-02-11  9:38           ` Ingo Molnar
2014-02-25 19:56   ` Jason Low
2014-02-26  9:22     ` Peter Zijlstra
2014-02-26 17:45       ` Jason Low
2014-02-10 19:58 ` [PATCH 6/8] mutex: Extra reschedule point Peter Zijlstra
2014-02-10 22:59   ` Andrew Morton
2014-02-10 19:58 ` [PATCH 7/8] locking: Introduce qrwlock Peter Zijlstra
2014-02-11 18:17   ` Waiman Long
2014-02-11 20:12     ` Waiman Long
2014-02-13 16:35       ` Peter Zijlstra
2014-02-13 17:26         ` Peter Zijlstra
2014-02-14 19:01           ` Waiman Long
2014-02-14 18:48         ` Waiman Long
2014-02-10 19:58 ` [PATCH 8/8] x86,locking: Enable qrwlock Peter Zijlstra
2014-02-10 23:02 ` [PATCH 0/8] locking/core patches Andrew Morton
2014-02-11  7:17   ` Peter Zijlstra
2014-02-11  8:03     ` Andrew Morton
2014-02-11  8:45       ` Ingo Molnar
2014-02-11  8:57         ` Peter Zijlstra
2014-02-11 21:37           ` Waiman Long
2014-02-25 19:26   ` Jason Low
2014-02-26 21:40 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).