linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] locking/qrwlock: Improve qrwlock performance
@ 2015-07-06 15:43 Waiman Long
  2015-07-06 15:43 ` [PATCH 1/4] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Waiman Long @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

In converting some existing spinlocks to rwlock, it was found that the
write lock slowpath performance isn't as good as the qspinlock. With
a workload that added a large number of inodes to the superblock and
was rate-limited by the inode_sb_list_lock, converting that spinlock
into a write lock slowed down its performance from about 22s to 36s.

This patch series tries to squeeze out as much performance as possible
to close the performance gap between qspinlock and qrwlock. With
all that patches applies, the workload performance improves to about
24-25s which is much better than before, though still a bit slower
than the spinlock.

With this patch series in place, we can start converting some spinlocks
back to rwlocks where it makes sense and the lock size increase isn't
a concern.

Waiman Long (4):
  locking/qrwlock: Better optimization for interrupt context readers
  locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  locking/qrwlock: Reduce writer to writer lock transfer latency
  locking/qrwlock: Use direct MCS lock/unlock in slowpath

 arch/x86/include/asm/qrwlock.h      |    4 +
 include/asm-generic/qrwlock.h       |    4 +-
 include/asm-generic/qrwlock_types.h |   26 ++++-
 kernel/locking/qrwlock.c            |  185 ++++++++++++++++++++++++++--------
 kernel/locking/qspinlock.c          |    9 +-
 5 files changed, 174 insertions(+), 54 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] locking/qrwlock: Better optimization for interrupt context readers
  2015-07-06 15:43 [PATCH 0/4] locking/qrwlock: Improve qrwlock performance Waiman Long
@ 2015-07-06 15:43 ` Waiman Long
  2015-07-06 15:43 ` [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.

The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
do an additional read of the lock value which is not necessary as
the information have already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.

This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also document
the action for the interrupt context readers more clearly.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |    4 ++--
 kernel/locking/qrwlock.c      |   13 +++++++------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 6383d54..865d021 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -36,7 +36,7 @@
 /*
  * External function declarations
  */
-extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts);
 extern void queue_write_lock_slowpath(struct qrwlock *lock);
 
 /**
@@ -105,7 +105,7 @@ static inline void queue_read_lock(struct qrwlock *lock)
 		return;
 
 	/* The slowpath will decrement the reader count, if necessary. */
-	queue_read_lock_slowpath(lock);
+	queue_read_lock_slowpath(lock, cnts);
 }
 
 /**
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 6c5da48..81bae99 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -62,20 +62,21 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
 /**
  * queue_read_lock_slowpath - acquire read lock of a queue rwlock
  * @lock: Pointer to queue rwlock structure
+ * @cnts: Current qrwlock lock value
  */
-void queue_read_lock_slowpath(struct qrwlock *lock)
+void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 {
-	u32 cnts;
-
 	/*
 	 * Readers come here when they cannot get the lock without waiting
 	 */
 	if (unlikely(in_interrupt())) {
 		/*
-		 * Readers in interrupt context will spin until the lock is
-		 * available without waiting in the queue.
+		 * Readers in interrupt context will get the lock immediately
+		 * if the writer is just waiting (not holding the lock yet).
+		 * The rspin_until_writer_unlock() function returns immediately
+		 * in this case. Otherwise, they will spin until the lock
+		 * is available without waiting in the queue.
 		 */
-		cnts = smp_load_acquire((u32 *)&lock->cnts);
 		rspin_until_writer_unlock(lock, cnts);
 		return;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-06 15:43 [PATCH 0/4] locking/qrwlock: Improve qrwlock performance Waiman Long
  2015-07-06 15:43 ` [PATCH 1/4] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
@ 2015-07-06 15:43 ` Waiman Long
  2015-07-06 18:23   ` Will Deacon
  2015-07-06 15:43 ` [PATCH 3/4] locking/qrwlock: Reduce writer to writer " Waiman Long
  2015-07-06 15:43 ` [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath Waiman Long
  3 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.

This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:

        Kernel          Locking Rate (Kops/s)
        ------          ---------------------
        4.1.1               15,063,081
        Patched 4.1.1       17,241,552

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 kernel/locking/qrwlock.c |   12 ++++--------
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 81bae99..ecd2d19 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	arch_spin_lock(&lock->lock);
 
 	/*
-	 * At the head of the wait queue now, wait until the writer state
-	 * goes to 0 and then try to increment the reader count and get
-	 * the lock. It is possible that an incoming writer may steal the
-	 * lock in the interim, so it is necessary to check the writer byte
-	 * to make sure that the write lock isn't taken.
+	 * At the head of the wait queue now, increment the reader count
+	 * and wait until the writer, if it has the lock, has gone away.
+	 * At ths stage, it is not possible for a writer to remain in the
+	 * waiting state (_QW_WAITING). So there won't be any deadlock.
 	 */
-	while (atomic_read(&lock->cnts) & _QW_WMASK)
-		cpu_relax_lowlatency();
-
 	cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
 	rspin_until_writer_unlock(lock, cnts);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] locking/qrwlock: Reduce writer to writer lock transfer latency
  2015-07-06 15:43 [PATCH 0/4] locking/qrwlock: Improve qrwlock performance Waiman Long
  2015-07-06 15:43 ` [PATCH 1/4] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
  2015-07-06 15:43 ` [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency Waiman Long
@ 2015-07-06 15:43 ` Waiman Long
  2015-07-06 15:43 ` [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath Waiman Long
  3 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

In most cases, a writer acquires the lock in two steps - first setting
the writer mode byte to _QW_WAITING and then to _QW_LOCKED. So two
atomic operations are required. This 2-step dance is only needed if
readers are present. This patch modifies the logic so that a writer
will try to acquire the lock in a single step as long as possible
until it see some readers.

Using a locking microbenchmark, a 10-threads 5M locking loop of only
writers has the following performance numbers in a Haswell-EX box:

        Kernel          Locking Rate (Kops/s)
        ------          ---------------------
        4.1.1               11,939,648
        Patched 4.1.1       12,906,593

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 kernel/locking/qrwlock.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index ecd2d19..87e2d6b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -109,15 +109,22 @@ EXPORT_SYMBOL(queue_read_lock_slowpath);
  */
 void queue_write_lock_slowpath(struct qrwlock *lock)
 {
-	u32 cnts;
-
 	/* Put the writer into the wait queue */
 	arch_spin_lock(&lock->lock);
 
 	/* Try to acquire the lock directly if no reader is present */
-	if (!atomic_read(&lock->cnts) &&
-	    (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0))
-		goto unlock;
+	for (;;) {
+		u32 cnts = atomic_read(&lock->cnts);
+
+		if (!cnts) {
+			cnts = atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED);
+			if (cnts == 0)
+				goto unlock;
+		}
+		if (cnts & ~_QW_WMASK)
+			break;	/* Reader is present */
+		cpu_relax_lowlatency();
+	}
 
 	/*
 	 * Set the waiting flag to notify readers that a writer is pending,
@@ -135,8 +142,7 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 
 	/* When no more readers, set the locked flag */
 	for (;;) {
-		cnts = atomic_read(&lock->cnts);
-		if ((cnts == _QW_WAITING) &&
+		if ((atomic_read(&lock->cnts) == _QW_WAITING) &&
 		    (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
 				    _QW_LOCKED) == _QW_WAITING))
 			break;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath
  2015-07-06 15:43 [PATCH 0/4] locking/qrwlock: Improve qrwlock performance Waiman Long
                   ` (2 preceding siblings ...)
  2015-07-06 15:43 ` [PATCH 3/4] locking/qrwlock: Reduce writer to writer " Waiman Long
@ 2015-07-06 15:43 ` Waiman Long
  2015-07-07 11:24   ` Peter Zijlstra
  3 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-06 15:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
as the waiting queue. This is slower than using MCS lock directly
because of the extra level of indirection causing more atomics to
be used as well as 2 waiting threads spinning on the lock cacheline
instead of only one.

This patch change lock waiting code to use direct MCS lock/unlock for
bare metal, but keep on using spinlock in VM guest to take advantage
of the pvqspinlock and unfair lock functionality of the qspinlock code.

With that patch, we saw further improvement in reader and writer
performance on a Haswell-EX box using a locking microbenchmark.

Before patch:

        Locker          Locking Rate (Kops/s)
        ------          ---------------------
        reader              17,241,552
        writer              12,906,593

After patch:

        Locker          Locking Rate (Kops/s)
        ------          ---------------------
        reader              17,436,786
        writer              14,394,326

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qrwlock.h      |    4 +
 include/asm-generic/qrwlock_types.h |   26 ++++++-
 kernel/locking/qrwlock.c            |  142 +++++++++++++++++++++++++++++------
 kernel/locking/qspinlock.c          |    9 +-
 4 files changed, 149 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/qrwlock.h b/arch/x86/include/asm/qrwlock.h
index ae0e241..7fab5ad 100644
--- a/arch/x86/include/asm/qrwlock.h
+++ b/arch/x86/include/asm/qrwlock.h
@@ -3,6 +3,10 @@
 
 #include <asm-generic/qrwlock_types.h>
 
+#if defined(CONFIG_HYPERVISOR_GUEST) && !defined(static_cpu_has_hypervisor)
+#define static_cpu_has_hypervisor	static_cpu_has(X86_FEATURE_HYPERVISOR)
+#endif
+
 #ifndef CONFIG_X86_PPRO_FENCE
 #define queue_write_unlock queue_write_unlock
 static inline void queue_write_unlock(struct qrwlock *lock)
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 4d76f24..8efd4b9 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -3,19 +3,37 @@
 
 #include <linux/types.h>
 #include <asm/spinlock_types.h>
+#include <asm/byteorder.h>
 
 /*
  * The queue read/write lock data structure
  */
 
 typedef struct qrwlock {
-	atomic_t		cnts;
-	arch_spinlock_t		lock;
+	union {
+		atomic_t	cnts;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8	wmode;		/* Writer mode   */
+			u8	rcnts[3];	/* Reader counts */
+#else
+			u8	rcnts[3];	/* Reader counts */
+			u8	wmode;		/* Writer mode   */
+#endif
+		};
+	};
+	union {
+		u32		tail;
+		arch_spinlock_t	lock;
+	};
 } arch_rwlock_t;
 
-#define	__ARCH_RW_LOCK_UNLOCKED {		\
+/*
+ * Assuming that the spinlock is also initialized to 0.
+ */
+#define __ARCH_RW_LOCK_UNLOCKED {		\
 	.cnts = ATOMIC_INIT(0),			\
-	.lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	.tail = 0,				\
 }
 
 #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 87e2d6b..d3e40c1 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -11,7 +11,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
- * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
  *
  * Authors: Waiman Long <waiman.long@hp.com>
  */
@@ -21,26 +21,117 @@
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
 #include <asm/qrwlock.h>
+#include "mcs_spinlock.h"
 
 /*
- * This internal data structure is used for optimizing access to some of
- * the subfields within the atomic_t cnts.
+ * Use the internal qspinlock MCS nodes, if available
  */
-struct __qrwlock {
-	union {
-		atomic_t cnts;
-		struct {
-#ifdef __LITTLE_ENDIAN
-			u8 wmode;	/* Writer mode   */
-			u8 rcnts[3];	/* Reader counts */
+#ifdef CONFIG_QUEUED_SPINLOCKS
+extern struct mcs_spinlock _mcs_qnodes[];
 #else
-			u8 rcnts[3];	/* Reader counts */
-			u8 wmode;	/* Writer mode   */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[4]);
 #endif
-		};
-	};
-	arch_spinlock_t	lock;
-};
+
+#ifndef static_cpu_has_hypervisor
+#define static_cpu_has_hypervisor	false
+#endif
+
+/*
+ * The following qlock()/qunlock() functions are simplified versions of the
+ * locking code in qspinlock.c. In bare metal, the MCS locking and unlocking
+ * code will be used to minimize the performance difference between qspinlock
+ * and qwriter lock. In VM guest, however, the qspinlock functions will be
+ * called to take advantage of the pvqspinlock and unfair lock functionality
+ * present in the qspinlock code at the expense of a bit of performance.
+ */
+#define TAIL_CPU_OFFSET		2
+#define TAIL_IDX_MASK		3
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+	return ((cpu + 1) << TAIL_CPU_OFFSET) | (idx & TAIL_IDX_MASK);
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+	int cpu = (tail >> TAIL_CPU_OFFSET) - 1;
+	int idx = tail & TAIL_IDX_MASK;
+
+	return per_cpu_ptr(&_mcs_qnodes[idx], cpu);
+}
+
+/*
+ * Enter the MCS lock waiting queue
+ */
+static inline u32 qlock(struct qrwlock *lock, struct mcs_spinlock **nptr)
+{
+	struct mcs_spinlock *prev, *node;
+	u32 old, tail;
+	int idx;
+
+	/*
+	 * Use arch_spin_lock() if under hypervisor
+	 */
+	if (static_cpu_has_hypervisor) {
+		arch_spin_lock(&lock->lock);
+		return 0;
+	}
+
+	/*
+	 * MCS node initialization
+	 */
+	node = this_cpu_ptr(&_mcs_qnodes[0]);
+	idx = node->count++;
+	tail = encode_tail(smp_processor_id(), idx);
+	node += idx;
+	node->locked = 0;
+	node->next = NULL;
+
+	old = xchg(&lock->tail, tail);
+
+	if (old) {
+		prev = decode_tail(old);
+		WRITE_ONCE(prev->next, node);
+		arch_mcs_spin_lock_contended(&node->locked);
+	}
+
+	/* Got the lock now */
+	*nptr = node;
+	return tail;
+}
+
+/*
+ * Exit the MCS lock waiting queue
+ */
+static inline void
+qunlock(struct qrwlock *lock, struct mcs_spinlock *node, u32 tail)
+{
+	struct mcs_spinlock *next;
+
+	/*
+	 * Use arch_spin_unlock() if under hypervisor
+	 */
+	if (static_cpu_has_hypervisor) {
+		arch_spin_unlock(&lock->lock);
+		return;
+	}
+
+
+	if ((READ_ONCE(lock->tail) == tail) &&
+	    (cmpxchg(&lock->tail, tail, 0) == tail))
+		goto release;
+	/*
+	 * Contended path; wait for next, release.
+	 */
+	while (!(next = READ_ONCE(node->next)))
+		cpu_relax();
+	arch_mcs_spin_unlock_contended(&next->locked);
+release:
+	/*
+	 * Release the node
+	 */
+	this_cpu_dec(_mcs_qnodes[0].count);
+}
 
 /**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
@@ -66,6 +157,10 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
  */
 void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 {
+	struct mcs_spinlock *node = NULL;
+	u32 tail;
+
+
 	/*
 	 * Readers come here when they cannot get the lock without waiting
 	 */
@@ -85,7 +180,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Put the reader into the wait queue
 	 */
-	arch_spin_lock(&lock->lock);
+	tail = qlock(lock, &node);
 
 	/*
 	 * At the head of the wait queue now, increment the reader count
@@ -99,7 +194,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Signal the next one in queue to become queue head
 	 */
-	arch_spin_unlock(&lock->lock);
+	qunlock(lock, node, tail);
 }
 EXPORT_SYMBOL(queue_read_lock_slowpath);
 
@@ -109,8 +204,9 @@ EXPORT_SYMBOL(queue_read_lock_slowpath);
  */
 void queue_write_lock_slowpath(struct qrwlock *lock)
 {
+	struct mcs_spinlock *node = NULL;
 	/* Put the writer into the wait queue */
-	arch_spin_lock(&lock->lock);
+	u32 tail = qlock(lock, &node);
 
 	/* Try to acquire the lock directly if no reader is present */
 	for (;;) {
@@ -131,10 +227,8 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 	 * or wait for a previous writer to go away.
 	 */
 	for (;;) {
-		struct __qrwlock *l = (struct __qrwlock *)lock;
-
-		if (!READ_ONCE(l->wmode) &&
-		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
+		if (!READ_ONCE(lock->wmode) &&
+		   (cmpxchg(&lock->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		cpu_relax_lowlatency();
@@ -150,6 +244,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 		cpu_relax_lowlatency();
 	}
 unlock:
-	arch_spin_unlock(&lock->lock);
+	qunlock(lock, node, tail);
 }
 EXPORT_SYMBOL(queue_write_lock_slowpath);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..3812498 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -81,8 +81,9 @@
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * The MCS nodes are also shared with qrwlock.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
+DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -107,7 +108,7 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
 	int idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
 
-	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+	return per_cpu_ptr(&_mcs_qnodes[idx], cpu);
 }
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -358,7 +359,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
-	node = this_cpu_ptr(&mcs_nodes[0]);
+	node = this_cpu_ptr(&_mcs_qnodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
@@ -446,7 +447,7 @@ release:
 	/*
 	 * release the node
 	 */
-	this_cpu_dec(mcs_nodes[0].count);
+	this_cpu_dec(_mcs_qnodes[0].count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-06 15:43 ` [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency Waiman Long
@ 2015-07-06 18:23   ` Will Deacon
  2015-07-06 19:49     ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2015-07-06 18:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

Hi Waiman,

On Mon, Jul 06, 2015 at 04:43:04PM +0100, Waiman Long wrote:
> Currently, a reader will check first to make sure that the writer mode
> byte is cleared before incrementing the reader count. That waiting is
> not really necessary. It increases the latency in the reader/writer
> to reader transition and reduces readers performance.
> 
> This patch eliminates that waiting. It also has the side effect
> of reducing the chance of writer lock stealing and improving the
> fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
> locking loop of mostly readers (RW ratio = 10,000:1) has the following
> performance numbers in a Haswell-EX box:
> 
>         Kernel          Locking Rate (Kops/s)
>         ------          ---------------------
>         4.1.1               15,063,081
>         Patched 4.1.1       17,241,552
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

I've just finished rebasing my arm64 qrwlock stuff, but I think it will
conflict with these patches. Do you mind if I post them for review anyway,
so we can at least co-ordinate our efforts?

> ---
>  kernel/locking/qrwlock.c |   12 ++++--------
>  1 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 81bae99..ecd2d19 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>  	arch_spin_lock(&lock->lock);
>  
>  	/*
> -	 * At the head of the wait queue now, wait until the writer state
> -	 * goes to 0 and then try to increment the reader count and get
> -	 * the lock. It is possible that an incoming writer may steal the
> -	 * lock in the interim, so it is necessary to check the writer byte
> -	 * to make sure that the write lock isn't taken.
> +	 * At the head of the wait queue now, increment the reader count
> +	 * and wait until the writer, if it has the lock, has gone away.
> +	 * At ths stage, it is not possible for a writer to remain in the
> +	 * waiting state (_QW_WAITING). So there won't be any deadlock.
>  	 */
> -	while (atomic_read(&lock->cnts) & _QW_WMASK)
> -		cpu_relax_lowlatency();

Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
rmode until it hits zero?

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-06 18:23   ` Will Deacon
@ 2015-07-06 19:49     ` Waiman Long
  2015-07-07  9:17       ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-06 19:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On 07/06/2015 02:23 PM, Will Deacon wrote:
> Hi Waiman,
>
> On Mon, Jul 06, 2015 at 04:43:04PM +0100, Waiman Long wrote:
>> Currently, a reader will check first to make sure that the writer mode
>> byte is cleared before incrementing the reader count. That waiting is
>> not really necessary. It increases the latency in the reader/writer
>> to reader transition and reduces readers performance.
>>
>> This patch eliminates that waiting. It also has the side effect
>> of reducing the chance of writer lock stealing and improving the
>> fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
>> locking loop of mostly readers (RW ratio = 10,000:1) has the following
>> performance numbers in a Haswell-EX box:
>>
>>          Kernel          Locking Rate (Kops/s)
>>          ------          ---------------------
>>          4.1.1               15,063,081
>>          Patched 4.1.1       17,241,552
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
> I've just finished rebasing my arm64 qrwlock stuff, but I think it will
> conflict with these patches. Do you mind if I post them for review anyway,
> so we can at least co-ordinate our efforts?

Yes, sure. I would also like to coordinate my changes with yours to 
minimize conflict. BTW, I just got 2 tip-bot messages about the commits:

    locking/qrwlock:  Better optimization for interrupt context readers
    locking/qrwlock:  Rename functions to queued_*()

So I need to rebase my patches also.

>> ---
>>   kernel/locking/qrwlock.c |   12 ++++--------
>>   1 files changed, 4 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>> index 81bae99..ecd2d19 100644
>> --- a/kernel/locking/qrwlock.c
>> +++ b/kernel/locking/qrwlock.c
>> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>   	arch_spin_lock(&lock->lock);
>>
>>   	/*
>> -	 * At the head of the wait queue now, wait until the writer state
>> -	 * goes to 0 and then try to increment the reader count and get
>> -	 * the lock. It is possible that an incoming writer may steal the
>> -	 * lock in the interim, so it is necessary to check the writer byte
>> -	 * to make sure that the write lock isn't taken.
>> +	 * At the head of the wait queue now, increment the reader count
>> +	 * and wait until the writer, if it has the lock, has gone away.
>> +	 * At ths stage, it is not possible for a writer to remain in the
>> +	 * waiting state (_QW_WAITING). So there won't be any deadlock.
>>   	 */
>> -	while (atomic_read(&lock->cnts)&  _QW_WMASK)
>> -		cpu_relax_lowlatency();
> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> rmode until it hits zero?

No, this is how we make the lock fair so that an incoming streams of 
later readers won't block a writer from getting the lock.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-06 19:49     ` Waiman Long
@ 2015-07-07  9:17       ` Will Deacon
  2015-07-07 11:17         ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2015-07-07  9:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Mon, Jul 06, 2015 at 08:49:33PM +0100, Waiman Long wrote:
> On 07/06/2015 02:23 PM, Will Deacon wrote:
> > I've just finished rebasing my arm64 qrwlock stuff, but I think it will
> > conflict with these patches. Do you mind if I post them for review anyway,
> > so we can at least co-ordinate our efforts?
> 
> Yes, sure. I would also like to coordinate my changes with yours to 
> minimize conflict. BTW, I just got 2 tip-bot messages about the commits:
> 
>     locking/qrwlock:  Better optimization for interrupt context readers
>     locking/qrwlock:  Rename functions to queued_*()
> 
> So I need to rebase my patches also.

Yeah, I've been carrying those two on my branch as well, but everything
should rebase cleanly.

> >> ---
> >>   kernel/locking/qrwlock.c |   12 ++++--------
> >>   1 files changed, 4 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> >> index 81bae99..ecd2d19 100644
> >> --- a/kernel/locking/qrwlock.c
> >> +++ b/kernel/locking/qrwlock.c
> >> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> >>   	arch_spin_lock(&lock->lock);
> >>
> >>   	/*
> >> -	 * At the head of the wait queue now, wait until the writer state
> >> -	 * goes to 0 and then try to increment the reader count and get
> >> -	 * the lock. It is possible that an incoming writer may steal the
> >> -	 * lock in the interim, so it is necessary to check the writer byte
> >> -	 * to make sure that the write lock isn't taken.
> >> +	 * At the head of the wait queue now, increment the reader count
> >> +	 * and wait until the writer, if it has the lock, has gone away.
> >> +	 * At ths stage, it is not possible for a writer to remain in the
> >> +	 * waiting state (_QW_WAITING). So there won't be any deadlock.
> >>   	 */
> >> -	while (atomic_read(&lock->cnts)&  _QW_WMASK)
> >> -		cpu_relax_lowlatency();
> > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > rmode until it hits zero?
> 
> No, this is how we make the lock fair so that an incoming streams of 
> later readers won't block a writer from getting the lock.

But won't those readers effectively see that the lock is held for write
(because we set wmode to _QW_LOCKED before the existing reader had drained)
and therefore fall down the slow-path and get held up on the spinlock?

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07  9:17       ` Will Deacon
@ 2015-07-07 11:17         ` Peter Zijlstra
  2015-07-07 11:49           ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-07-07 11:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > > rmode until it hits zero?
> > 
> > No, this is how we make the lock fair so that an incoming streams of 
> > later readers won't block a writer from getting the lock.
> 
> But won't those readers effectively see that the lock is held for write
> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> and therefore fall down the slow-path and get held up on the spinlock?

Yes, that's the entire point. Once there's a writer pending, new readers
should queue too.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath
  2015-07-06 15:43 ` [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath Waiman Long
@ 2015-07-07 11:24   ` Peter Zijlstra
  2015-07-07 21:59     ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-07-07 11:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnd Bergmann, Thomas Gleixner, linux-arch,
	linux-kernel, Will Deacon, Scott J Norton, Douglas Hatch

On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
> Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
> as the waiting queue. This is slower than using MCS lock directly
> because of the extra level of indirection causing more atomics to
> be used as well as 2 waiting threads spinning on the lock cacheline
> instead of only one.

This needs a better explanation. Didn't we find with the qspinlock thing
that the pending spinner improved performance on light loads?

Taking it out seems counter intuitive, we could very much like these two
the be the same.

> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c

> +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[4]);


> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -81,8 +81,9 @@
>   * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>   *
>   * PV doubles the storage and uses the second cacheline for PV state.
> + * The MCS nodes are also shared with qrwlock.
>   */
> -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
> +DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[MAX_NODES]);

Except you don't... 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 11:17         ` Peter Zijlstra
@ 2015-07-07 11:49           ` Will Deacon
  2015-07-07 14:30             ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2015-07-07 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > > > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > > > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > > > rmode until it hits zero?
> > > 
> > > No, this is how we make the lock fair so that an incoming streams of 
> > > later readers won't block a writer from getting the lock.
> > 
> > But won't those readers effectively see that the lock is held for write
> > (because we set wmode to _QW_LOCKED before the existing reader had drained)
> > and therefore fall down the slow-path and get held up on the spinlock?
> 
> Yes, that's the entire point. Once there's a writer pending, new readers
> should queue too.

Agreed. My point was that we can achieve the same result without
a separate _QW_WAITING flag afaict.

Will


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 11:49           ` Will Deacon
@ 2015-07-07 14:30             ` Waiman Long
  2015-07-07 17:27               ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-07 14:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On 07/07/2015 07:49 AM, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
>> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
>>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
>>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
>>>>> rmode until it hits zero?
>>>> No, this is how we make the lock fair so that an incoming streams of
>>>> later readers won't block a writer from getting the lock.
>>> But won't those readers effectively see that the lock is held for write
>>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
>>> and therefore fall down the slow-path and get held up on the spinlock?
>> Yes, that's the entire point. Once there's a writer pending, new readers
>> should queue too.
> Agreed. My point was that we can achieve the same result without
> a separate _QW_WAITING flag afaict.
>
> Will
>

_QW_WAITING and _QW_LOCKED has different semantics and are necessary for 
the proper handshake between readers and writer. We set _QW_WAITING when 
readers own the lock and the writer is waiting for the readers to go 
away. The _QW_WAITING flag will force new readers to go to queuing while 
the writer is waiting. We set _QW_LOCKED when a writer own the lock and 
it can only be set atomically when no reader is present. Without the 
intermediate _QW_WAITING step, a continuous stream of incoming readers 
(which make the reader count never 0) could deny a writer from getting 
the lock indefinitely.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 14:30             ` Waiman Long
@ 2015-07-07 17:27               ` Will Deacon
  2015-07-07 18:10                 ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2015-07-07 17:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
> On 07/07/2015 07:49 AM, Will Deacon wrote:
> > On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> >> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> >>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> >>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> >>>>> rmode until it hits zero?
> >>>> No, this is how we make the lock fair so that an incoming streams of
> >>>> later readers won't block a writer from getting the lock.
> >>> But won't those readers effectively see that the lock is held for write
> >>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> >>> and therefore fall down the slow-path and get held up on the spinlock?
> >> Yes, that's the entire point. Once there's a writer pending, new readers
> >> should queue too.
> > Agreed. My point was that we can achieve the same result without
> > a separate _QW_WAITING flag afaict.
> 
> _QW_WAITING and _QW_LOCKED has different semantics and are necessary for 
> the proper handshake between readers and writer. We set _QW_WAITING when 
> readers own the lock and the writer is waiting for the readers to go 
> away. The _QW_WAITING flag will force new readers to go to queuing while 
> the writer is waiting. We set _QW_LOCKED when a writer own the lock and 
> it can only be set atomically when no reader is present. Without the 
> intermediate _QW_WAITING step, a continuous stream of incoming readers 
> (which make the reader count never 0) could deny a writer from getting 
> the lock indefinitely.

It's probably best if I try to implement something and we can either pick
holes in the patch or I'll realise why I'm wrong in the process :)

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 17:27               ` Will Deacon
@ 2015-07-07 18:10                 ` Will Deacon
  2015-07-07 21:29                   ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2015-07-07 18:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 06:27:18PM +0100, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
> > On 07/07/2015 07:49 AM, Will Deacon wrote:
> > > On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> > >> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > >>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > >>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > >>>>> rmode until it hits zero?
> > >>>> No, this is how we make the lock fair so that an incoming streams of
> > >>>> later readers won't block a writer from getting the lock.
> > >>> But won't those readers effectively see that the lock is held for write
> > >>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> > >>> and therefore fall down the slow-path and get held up on the spinlock?
> > >> Yes, that's the entire point. Once there's a writer pending, new readers
> > >> should queue too.
> > > Agreed. My point was that we can achieve the same result without
> > > a separate _QW_WAITING flag afaict.
> > 
> > _QW_WAITING and _QW_LOCKED has different semantics and are necessary for 
> > the proper handshake between readers and writer. We set _QW_WAITING when 
> > readers own the lock and the writer is waiting for the readers to go 
> > away. The _QW_WAITING flag will force new readers to go to queuing while 
> > the writer is waiting. We set _QW_LOCKED when a writer own the lock and 
> > it can only be set atomically when no reader is present. Without the 
> > intermediate _QW_WAITING step, a continuous stream of incoming readers 
> > (which make the reader count never 0) could deny a writer from getting 
> > the lock indefinitely.
> 
> It's probably best if I try to implement something and we can either pick
> holes in the patch or I'll realise why I'm wrong in the process :)

Hmm, wasn't very enlightening. What's wrong with the following?

Will

--->8

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index deb9e8b0eb9e..be8dc5c6fdbd 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -27,7 +27,6 @@
 /*
  * Writer states & reader shift and bias
  */
-#define        _QW_WAITING     1               /* A writer is waiting     */
 #define        _QW_LOCKED      0xff            /* A writer holds the lock */
 #define        _QW_WMASK       0xff            /* Writer mask             */
 #define        _QR_SHIFT       8               /* Reader count shift      */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 9f644933f6d4..4006aa1fbd0b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
        }
 
        /*
-        * Set the waiting flag to notify readers that a writer is pending,
-        * or wait for a previous writer to go away.
+        * Wait for a previous writer to go away, then set the locked
+        * flag to notify future readers/writers that we are pending.
         */
        for (;;) {
                struct __qrwlock *l = (struct __qrwlock *)lock;
 
                if (!READ_ONCE(l->wmode) &&
-                  (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
+                  (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
                        break;
 
                cpu_relax_lowlatency();
        }
 
-       /* When no more readers, set the locked flag */
-       for (;;) {
-               if ((atomic_read(&lock->cnts) == _QW_WAITING) &&
-                   (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
-                                   _QW_LOCKED) == _QW_WAITING))
-                       break;
-
+       /* Wait for the readers to drain */
+       while (smp_load_acquire((u32 *)&lock->cnts) & ~_QW_WMASK)
                cpu_relax_lowlatency();
-       }
+
 unlock:
        arch_spin_unlock(&lock->lock);
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 18:10                 ` Will Deacon
@ 2015-07-07 21:29                   ` Waiman Long
  2015-07-08  9:52                     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-07 21:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On 07/07/2015 02:10 PM, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 06:27:18PM +0100, Will Deacon wrote:
>> On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
>>> On 07/07/2015 07:49 AM, Will Deacon wrote:
>>>> On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
>>>>> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
>>>>>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
>>>>>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
>>>>>>>> rmode until it hits zero?
>>>>>>> No, this is how we make the lock fair so that an incoming streams of
>>>>>>> later readers won't block a writer from getting the lock.
>>>>>> But won't those readers effectively see that the lock is held for write
>>>>>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
>>>>>> and therefore fall down the slow-path and get held up on the spinlock?
>>>>> Yes, that's the entire point. Once there's a writer pending, new readers
>>>>> should queue too.
>>>> Agreed. My point was that we can achieve the same result without
>>>> a separate _QW_WAITING flag afaict.
>>> _QW_WAITING and _QW_LOCKED has different semantics and are necessary for
>>> the proper handshake between readers and writer. We set _QW_WAITING when
>>> readers own the lock and the writer is waiting for the readers to go
>>> away. The _QW_WAITING flag will force new readers to go to queuing while
>>> the writer is waiting. We set _QW_LOCKED when a writer own the lock and
>>> it can only be set atomically when no reader is present. Without the
>>> intermediate _QW_WAITING step, a continuous stream of incoming readers
>>> (which make the reader count never 0) could deny a writer from getting
>>> the lock indefinitely.
>> It's probably best if I try to implement something and we can either pick
>> holes in the patch or I'll realise why I'm wrong in the process :)
> Hmm, wasn't very enlightening. What's wrong with the following?
>
> Will
>
> --->8
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index deb9e8b0eb9e..be8dc5c6fdbd 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -27,7 +27,6 @@
>   /*
>    * Writer states&  reader shift and bias
>    */
> -#define        _QW_WAITING     1               /* A writer is waiting     */
>   #define        _QW_LOCKED      0xff            /* A writer holds the lock */
>   #define        _QW_WMASK       0xff            /* Writer mask             */
>   #define        _QR_SHIFT       8               /* Reader count shift      */
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 9f644933f6d4..4006aa1fbd0b 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
>          }
>
>          /*
> -        * Set the waiting flag to notify readers that a writer is pending,
> -        * or wait for a previous writer to go away.
> +        * Wait for a previous writer to go away, then set the locked
> +        * flag to notify future readers/writers that we are pending.
>           */
>          for (;;) {
>                  struct __qrwlock *l = (struct __qrwlock *)lock;
>
>                  if (!READ_ONCE(l->wmode)&&
> -                  (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
> +                  (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
>                          break;
>
>                  cpu_relax_lowlatency();
>          }
>
> -       /* When no more readers, set the locked flag */
> -       for (;;) {
> -               if ((atomic_read(&lock->cnts) == _QW_WAITING)&&
> -                   (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
> -                                   _QW_LOCKED) == _QW_WAITING))
> -                       break;
> -
> +       /* Wait for the readers to drain */
> +       while (smp_load_acquire((u32 *)&lock->cnts)&  ~_QW_WMASK)
>                  cpu_relax_lowlatency();
> -       }
> +
>   unlock:
>          arch_spin_unlock(&lock->lock);
>   }

That changes the handshaking protocol. In this case, the readers will 
have to decrement its reader count to enable the writer to continue. The 
interrupt context reader code has to be changed. This gives preference 
to writer and reader will be in a disadvantage. I prefer the current 
setting as you won't know if the writer has the lock or not when you 
take a snapshot of the value of the lock. You need the whole time 
sequence in this case to figure it out and so will be more prone to error.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath
  2015-07-07 11:24   ` Peter Zijlstra
@ 2015-07-07 21:59     ` Waiman Long
  2015-07-07 22:13       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2015-07-07 21:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnd Bergmann, Thomas Gleixner, linux-arch,
	linux-kernel, Will Deacon, Scott J Norton, Douglas Hatch

On 07/07/2015 07:24 AM, Peter Zijlstra wrote:
> On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
>> Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
>> as the waiting queue. This is slower than using MCS lock directly
>> because of the extra level of indirection causing more atomics to
>> be used as well as 2 waiting threads spinning on the lock cacheline
>> instead of only one.
> This needs a better explanation. Didn't we find with the qspinlock thing
> that the pending spinner improved performance on light loads?
>
> Taking it out seems counter intuitive, we could very much like these two
> the be the same.

Yes, for lightly loaded case, using raw_spin_lock should have an 
advantage. It is a different matter when the lock is highly contended. 
In this case, having the indirection in qspinlock will make it slower. I 
struggle myself as to whether to duplicate the locking code in qrwlock. 
So I send this patch out to test the water. I won't insist if you think 
this is not a good idea, but I do want to get the previous 2 patches in 
which should not be controversial.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath
  2015-07-07 21:59     ` Waiman Long
@ 2015-07-07 22:13       ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2015-07-07 22:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnd Bergmann, Thomas Gleixner, linux-arch,
	linux-kernel, Will Deacon, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 05:59:59PM -0400, Waiman Long wrote:
> On 07/07/2015 07:24 AM, Peter Zijlstra wrote:
> >On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
> >>Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
> >>as the waiting queue. This is slower than using MCS lock directly
> >>because of the extra level of indirection causing more atomics to
> >>be used as well as 2 waiting threads spinning on the lock cacheline
> >>instead of only one.
> >This needs a better explanation. Didn't we find with the qspinlock thing
> >that the pending spinner improved performance on light loads?
> >
> >Taking it out seems counter intuitive, we could very much like these two
> >the be the same.
> 
> Yes, for lightly loaded case, using raw_spin_lock should have an advantage.
> It is a different matter when the lock is highly contended. In this case,
> having the indirection in qspinlock will make it slower.

But we should not optimize the lock for the complete saturation case, if
you encounter that, change the locking. The light contention case is
much more important.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-07 21:29                   ` Waiman Long
@ 2015-07-08  9:52                     ` Peter Zijlstra
  2015-07-08 17:19                       ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-07-08  9:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Will Deacon, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Tue, Jul 07, 2015 at 05:29:50PM -0400, Waiman Long wrote:
> On 07/07/2015 02:10 PM, Will Deacon wrote:

> >diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> >index deb9e8b0eb9e..be8dc5c6fdbd 100644
> >--- a/include/asm-generic/qrwlock.h
> >+++ b/include/asm-generic/qrwlock.h
> >@@ -27,7 +27,6 @@
> >  /*
> >   * Writer states&  reader shift and bias
> >   */
> >-#define        _QW_WAITING     1               /* A writer is waiting     */
> >  #define        _QW_LOCKED      0xff            /* A writer holds the lock */
> >  #define        _QW_WMASK       0xff            /* Writer mask             */
> >  #define        _QR_SHIFT       8               /* Reader count shift      */
> >diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> >index 9f644933f6d4..4006aa1fbd0b 100644
> >--- a/kernel/locking/qrwlock.c
> >+++ b/kernel/locking/qrwlock.c
> >@@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> >         }
> >
> >         /*
> >-        * Set the waiting flag to notify readers that a writer is pending,
> >-        * or wait for a previous writer to go away.
> >+        * Wait for a previous writer to go away, then set the locked
> >+        * flag to notify future readers/writers that we are pending.
> >          */
> >         for (;;) {
> >                 struct __qrwlock *l = (struct __qrwlock *)lock;
> >
> >                 if (!READ_ONCE(l->wmode)&&
> >-                  (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
> >+                  (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
> >                         break;
> >
> >                 cpu_relax_lowlatency();
> >         }
> >
> >-       /* When no more readers, set the locked flag */
> >-       for (;;) {
> >-               if ((atomic_read(&lock->cnts) == _QW_WAITING)&&
> >-                   (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
> >-                                   _QW_LOCKED) == _QW_WAITING))
> >-                       break;
> >-
> >+       /* Wait for the readers to drain */
> >+       while (smp_load_acquire((u32 *)&lock->cnts)&  ~_QW_WMASK)
> >                 cpu_relax_lowlatency();
> >-       }
> >+
> >  unlock:
> >         arch_spin_unlock(&lock->lock);
> >  }
> 
> That changes the handshaking protocol. In this case, the readers will have
> to decrement its reader count to enable the writer to continue.

It already needs to, no?

> The interrupt context reader code has to be changed.

Agreed.

> This gives preference to writer and reader will be in a disadvantage.

I don't see that, everybody is still ordered by the wait queue / lock.

> I prefer the current setting as you won't know if the writer has the
> lock or not when you take a snapshot of the value of the lock. You
> need the whole time sequence in this case to figure it out and so will
> be more prone to error.

I still need to wake up, but I suspect we need to change
queue_read_{try,}lock() to use cmpxchg/inc_not_zero like things, which
is fine for ARM, but not so much for x86.

So I think I agree with Waiman, but am willing to be shown differently.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency
  2015-07-08  9:52                     ` Peter Zijlstra
@ 2015-07-08 17:19                       ` Will Deacon
  0 siblings, 0 replies; 19+ messages in thread
From: Will Deacon @ 2015-07-08 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Arnd Bergmann, Thomas Gleixner,
	linux-arch, linux-kernel, Scott J Norton, Douglas Hatch

On Wed, Jul 08, 2015 at 10:52:48AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 05:29:50PM -0400, Waiman Long wrote:
> > I prefer the current setting as you won't know if the writer has the
> > lock or not when you take a snapshot of the value of the lock. You
> > need the whole time sequence in this case to figure it out and so will
> > be more prone to error.
> 
> I still need to wake up, but I suspect we need to change
> queue_read_{try,}lock() to use cmpxchg/inc_not_zero like things, which
> is fine for ARM, but not so much for x86.
> 
> So I think I agree with Waiman, but am willing to be shown differently.

That's fine; I just wanted to make sure I wasn't going round the twist!

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-07-08 17:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 15:43 [PATCH 0/4] locking/qrwlock: Improve qrwlock performance Waiman Long
2015-07-06 15:43 ` [PATCH 1/4] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
2015-07-06 15:43 ` [PATCH 2/4] locking/qrwlock: Reduce reader/writer to reader lock transfer latency Waiman Long
2015-07-06 18:23   ` Will Deacon
2015-07-06 19:49     ` Waiman Long
2015-07-07  9:17       ` Will Deacon
2015-07-07 11:17         ` Peter Zijlstra
2015-07-07 11:49           ` Will Deacon
2015-07-07 14:30             ` Waiman Long
2015-07-07 17:27               ` Will Deacon
2015-07-07 18:10                 ` Will Deacon
2015-07-07 21:29                   ` Waiman Long
2015-07-08  9:52                     ` Peter Zijlstra
2015-07-08 17:19                       ` Will Deacon
2015-07-06 15:43 ` [PATCH 3/4] locking/qrwlock: Reduce writer to writer " Waiman Long
2015-07-06 15:43 ` [PATCH 4/4] locking/qrwlock: Use direct MCS lock/unlock in slowpath Waiman Long
2015-07-07 11:24   ` Peter Zijlstra
2015-07-07 21:59     ` Waiman Long
2015-07-07 22:13       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).