linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation
@ 2013-11-04 17:17 Waiman Long
  2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Waiman Long @ 2013-11-04 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T,
	George Spelvin, Tim Chen, Aswin Chandramouleeswaran",
	Scott J Norton, Waiman Long

v4->v5:
 - Fix wrong definitions for QW_MASK_FAIR & QW_MASK_UNFAIR macros.
 - Add an optional patch 4 which should only be applied after the
   mcs_spinlock.h header file is merged.

v3->v4:
 - Optimize the fast path with better cold cache behavior and
   performance.
 - Removing some testing code.
 - Make x86 use queue rwlock with no user configuration.

v2->v3:
 - Make read lock stealing the default and fair rwlock an option with
   a different initializer.
 - In queue_read_lock_slowpath(), check irq_count() and force spinning
   and lock stealing in interrupt context.
 - Unify the fair and classic read-side code path, and make write-side
   to use cmpxchg with 2 different writer states. This slows down the
   write lock fastpath to make the read side more efficient, but is
   still slightly faster than a spinlock.

v1->v2:
 - Improve lock fastpath performance.
 - Optionally provide classic read/write lock behavior for backward
   compatibility.
 - Use xadd instead of cmpxchg for fair reader code path to make it
   immute to reader contention.
 - Run more performance testing.

As mentioned in the LWN article http://lwn.net/Articles/364583/,
the read/write lock suffer from an unfairness problem that it is
possible for a stream of incoming readers to block a waiting writer
from getting the lock for a long time. Also, a waiting reader/writer
contending a rwlock in local memory will have a higher chance of
acquiring the lock than a reader/writer in remote node.

This patch set introduces a queue-based read/write lock implementation
that can largely solve this unfairness problem if the lock owners
choose to use the fair variant of the lock.

The queue rwlock has two variants selected at initialization time
- unfair (with read lock stealing) and fair (to both readers and
writers). The unfair rwlock is the default.

The read lock slowpath will check if the reader is in an interrupt
context. If so, it will force lock spinning and stealing without
waiting in a queue. This is to ensure the read lock will be granted
as soon as possible.

Even the unfair rwlock is fairer than the current version as there
is a higher chance for writers to get the lock and is fair among
the writers.

The queue write lock can also be used as a replacement for ticket
spinlocks that are highly contended if lock size increase is not
an issue.

This patch set currently provides queue read/write lock support on
x86 architecture only. Support for other architectures can be added
later on once architecture specific support infrastructure is added
and proper testing is done.

The optional patch 4 has a dependency on the the mcs_spinlock.h
header file which has not been merged yet. So this patch should only
be applied after the mcs_spinlock.h header file is merged.

Waiman Long (4):
  qrwlock: A queue read/write lock implementation
  qrwlock x86: Enable x86 to use queue read/write lock
  qrwlock: Enable fair queue read/write lock
  qrwlock: Use the mcs_spinlock helper functions for MCS queuing

 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/spinlock.h       |    2 +
 arch/x86/include/asm/spinlock_types.h |    4 +
 include/asm-generic/qrwlock.h         |  253 +++++++++++++++++++++++++++++++++
 include/linux/rwlock.h                |   15 ++
 include/linux/rwlock_types.h          |   13 ++
 kernel/Kconfig.locks                  |    7 +
 lib/Makefile                          |    1 +
 lib/qrwlock.c                         |  193 +++++++++++++++++++++++++
 lib/spinlock_debug.c                  |   19 +++
 10 files changed, 508 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
  2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
@ 2013-11-04 17:17 ` Waiman Long
  2013-11-08 21:11   ` Paul E. McKenney
  2013-11-04 17:17 ` [PATCH v5 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-11-04 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T,
	George Spelvin, Tim Chen, Aswin Chandramouleeswaran",
	Scott J Norton, Waiman Long

This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is mostly fair with respect to
the writers, even though there is still a slight chance of write
lock stealing.

Externally, there are two different types of readers - unfair (the
default) and fair. A unfair reader will try to steal read lock even
if a writer is waiting, whereas a fair reader will be waiting in
the queue under this circumstance.  These variants are chosen at
initialization time by using different initializers. The new *_fair()
initializers are added for selecting the use of fair reader.

Internally, there is a third type of readers which steal lock more
aggressively than the unfair reader. They simply increments the reader
count and wait until the writer releases the lock. The transition to
aggressive reader happens in the read lock slowpath when
1. In an interrupt context.
2. when a classic reader comes to the head of the wait queue.
3. When a fair reader comes to the head of the wait queue and sees
   the release of a write lock.

The fair queue rwlock is more deterministic in the sense that late
comers jumping ahead and stealing the lock is unlikely even though
there is still a very small chance for lock stealing to happen if
the readers or writers come at the right moment.  Other than that,
lock granting is done in a FIFO manner.  As a result, it is possible
to determine a maximum time period after which the waiting is over
and the lock can be acquired.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock. However, the fair readers
will not support recursive read lock in a non-interrupt environment
when a writer is waiting.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

This patch will replace the architecture specific implementation
of rwlock by this generic version of queue rwlock when the
ARCH_QUEUE_RWLOCK configuration parameter is set.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type		    2.4GHz	2.93GHz
---------		    ------	-------
Ticket spinlock		     14.9	 12.3
Read lock		     17.0	 13.5
Write lock		     17.0	 13.5
Queue read lock	     	     16.0	 13.5
Queue fair read lock	     16.0	 13.5
Queue write lock	      9.2	  7.8
Queue fair write lock	     17.5	 14.5

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock. The queue fair write lock, on the
other hand, is slightly slower than the write lock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that convert
rwsem to rwlock was applied to a 3.12-rc2 kernel. This kernel was
then tested under the following 4 conditions:

1) Plain 3.12-rc2
2) Ingo's patch
3) Ingo's patch + unfair qrwlock (default)
4) Ingo's patch + fair qrwlock

The jobs per minutes (JPM) results of the AIM7's high_systime workload
at 1500 users on a 8-socket 80-core DL980 (HT off) were:

Kernel	JPM	%Change from (1)
------	---	----------------
  1	148265		-
  2	238715	       +61%
  3	242048	       +63%
  4	234881	       +58%

The use of unfair qrwlock provides a small boost of 2%, while using
fair qrwlock leads to 3% decrease of performance. However, looking
at the perf profiles, we can clearly see that other bottlenecks were
constraining the performance improvement.

Perf profile of kernel (2):

  18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
   9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
   2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
   2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
   1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
   1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
   1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner

Perf profile of kernel (3):

  10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
   5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
   2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
   1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
   1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
   1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
   1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
   1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux

Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
machine.  It was found the performance improvement of 11% was the
same with regular rwlock or queue rwlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |  256 +++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks          |    7 +
 lib/Makefile                  |    1 +
 lib/qrwlock.c                 |  247 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 511 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..78ad4a5
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,256 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <asm/bitops.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+#include <asm/byteorder.h>
+
+#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
+#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
+#endif
+
+#if (CONFIG_NR_CPUS < 65536)
+typedef u16 __nrcpu_t;
+typedef u32 __nrcpupair_t;
+#define	QRW_READER_BIAS	(1U << 16)
+#else
+typedef u32 __nrcpu_t;
+typedef u64 __nrcpupair_t;
+#define	QRW_READER_BIAS	(1UL << 32)
+#endif
+
+/*
+ * The queue read/write lock data structure
+ *
+ * Read lock stealing can only happen when there is at least one reader
+ * holding the read lock. When the fair flag is not set, it mimics the
+ * behavior of the regular rwlock at the expense that a perpetual stream
+ * of readers could starve a writer for a long period of time. That
+ * behavior, however, may be beneficial to a workload that is reader heavy
+ * with slow writers, and the writers can wait without undesirable consequence.
+ * This fair flag should only be set at initialization time.
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * QRW_READER_BIAS to the rw field to increment the reader count won't
+ * disturb the writer and the fair fields.
+ */
+struct qrwnode {
+	struct qrwnode *next;
+	bool		wait;	/* Waiting flag */
+};
+
+typedef struct qrwlock {
+	union qrwcnts {
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8	  writer;	/* Writer state		*/
+			u8	  fair;		/* Fair rwlock flag	*/
+			__nrcpu_t readers;	/* # of active readers	*/
+#else
+			__nrcpu_t readers;	/* # of active readers	*/
+			u8	  fair;		/* Fair rwlock flag	*/
+			u8	  writer;	/* Writer state		*/
+#endif
+		};
+		__nrcpupair_t rw;		/* Reader/writer number pair */
+	} cnts;
+	struct qrwnode *waitq;			/* Tail of waiting queue */
+} arch_rwlock_t;
+
+/*
+ * Writer state values & mask
+ */
+#define	QW_WAITING	1			/* A writer is waiting	   */
+#define	QW_LOCKED	0xff			/* A writer holds the lock */
+#define QW_MASK_FAIR	((u8)~0)		/* Mask for fair reader    */
+#define QW_MASK_UNFAIR	((u8)~QW_WAITING)	/* Mask for unfair reader  */
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+	union qrwcnts rwcnts;
+
+	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	return !rwcnts.writer || (!rwcnts.fair && rwcnts.readers);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+	union qrwcnts rwcnts;
+
+	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	return !rwcnts.writer && !rwcnts.readers;
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+	union qrwcnts cnts;
+	u8 wmask;
+
+	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
+	if (likely(!(cnts.writer & wmask))) {
+		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
+		if (likely(!(cnts.writer & wmask)))
+			return 1;
+		/*
+		 * Restore correct reader count
+		 * It had been found that two nearly consecutive atomic
+		 * operations (xadd & add) can cause significant cacheline
+		 * contention. By inserting a pause between these two atomic
+		 * operations, it can significantly reduce unintended
+		 * contention.
+		 */
+		cpu_relax();
+		add_smp(&lock->cnts.readers, -1);
+	}
+	return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+	union qrwcnts old, new;
+
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+	if (likely(!old.writer && !old.readers)) {
+		new.rw = old.rw;
+		new.writer = QW_LOCKED;
+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+			return 1;
+	}
+	return 0;
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+	union qrwcnts cnts;
+	u8 wmask;
+
+	cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
+	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
+	if (likely(!(cnts.writer & wmask)))
+		return;
+	/*
+	 * Slowpath will decrement the reader count, if necessary
+	 */
+	queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+	union qrwcnts old;
+
+	/*
+	 * Optimize for the unfair lock case where the fair flag is 0.
+	 */
+	old.rw = cmpxchg(&lock->cnts.rw, 0, QW_LOCKED);
+	if (likely(old.rw == 0))
+		return;
+	if (likely(!old.writer && !old.readers)) {
+		union qrwcnts new;
+
+		new.rw = old.rw;
+		new.writer = QW_LOCKED;
+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+			return;
+	}
+	queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Atomically decrement the reader count
+	 */
+	add_smp(&lock->cnts.readers, -1);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Make sure that none of the critical section will be leaked out.
+	 */
+	smp_mb__before_clear_bit();
+	ACCESS_ONCE(lock->cnts.writer) = 0;
+	smp_mb__after_clear_bit();
+}
+
+/*
+ * Initializier
+ */
+#define	__ARCH_RW_LOCK_UNLOCKED	{ .cnts = { .rw = 0 }, .waitq = NULL }
+#define	__ARCH_RW_LOCK_UNLOCKED_FAIR	\
+	{ .cnts = { { .writer = 0, .fair = 1, .readers = 0 } }, .waitq = NULL }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l)	queue_read_can_lock(l)
+#define arch_write_can_lock(l)	queue_write_can_lock(l)
+#define arch_read_lock(l)	queue_read_lock(l)
+#define arch_write_lock(l)	queue_write_lock(l)
+#define arch_read_trylock(l)	queue_read_trylock(l)
+#define arch_write_trylock(l)	queue_write_trylock(l)
+#define arch_read_unlock(l)	queue_read_unlock(l)
+#define arch_write_unlock(l)	queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..b665478 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_QUEUE_RWLOCK
+	bool
+
+config QUEUE_RWLOCK
+	def_bool y if ARCH_QUEUE_RWLOCK
+	depends on SMP
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..e3175db 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -189,3 +189,4 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/lib/qrwlock.c b/lib/qrwlock.c
new file mode 100644
index 0000000..a85b9e1
--- /dev/null
+++ b/lib/qrwlock.c
@@ -0,0 +1,247 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. It is more deterministic for the fair variant. Even though there is
+ *    a slight chance of stealing the lock if come at the right moment, the
+ *    granting of the lock is mostly in FIFO order. Even the default unfair
+ *    variant is fairer at least among the writers.
+ * 2. It is faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ *
+ * The use of smp_wmb() is to make sure that the other CPUs see the change
+ * ASAP.
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *prev;
+
+	node->next = NULL;
+	node->wait = true;
+	prev = xchg(&lock->waitq, node);
+	if (prev) {
+		prev->next = node;
+		smp_wmb();
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (ACCESS_ONCE(node->wait))
+			cpu_relax();
+	}
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *next;
+
+	/*
+	 * Try to notify the next node first without disturbing the cacheline
+	 * of the lock. If that fails, check to see if it is the last node
+	 * and so should clear the wait queue.
+	 */
+	next = ACCESS_ONCE(node->next);
+	if (likely(next))
+		goto notify_next;
+
+	/*
+	 * Clear the wait queue if it is the last node
+	 */
+	if ((ACCESS_ONCE(lock->waitq) == node) &&
+	    (cmpxchg(&lock->waitq, node, NULL) == node))
+			return;
+	/*
+	 * Wait until the next one in queue set up the next field
+	 */
+	while (likely(!(next = ACCESS_ONCE(node->next))))
+		cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+notify_next:
+	barrier();
+	ACCESS_ONCE(next->wait) = false;
+	smp_wmb();
+}
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock: Pointer to queue rwlock structure
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, int inc)
+{
+	union qrwcnts cnts;
+
+	if (inc)
+		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
+	else
+		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	while (cnts.writer == QW_LOCKED) {
+		cpu_relax();
+		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	}
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+	struct qrwnode node;
+	union qrwcnts cnts;
+
+	/*
+	 * Readers come here when it cannot get the lock without waiting
+	 */
+	if (unlikely(irq_count())) {
+		/*
+		 * Readers in interrupt context will spin until the lock is
+		 * available without waiting in the queue.
+		 */
+		rspin_until_writer_unlock(lock, 0);
+		return;
+	}
+	cnts.rw = xadd(&lock->cnts.rw, -QRW_READER_BIAS);
+
+	/*
+	 * Put the reader into the wait queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, try to increment the reader
+	 * count and get the lock.
+	 */
+	if (unlikely(cnts.fair)) {
+		/*
+		 * For fair reader, wait until the writer state goes to 0
+		 * before incrementing the reader count.
+		 */
+		while (ACCESS_ONCE(lock->cnts.writer))
+			cpu_relax();
+	}
+	rspin_until_writer_unlock(lock, 1);
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * queue_write_3step_lock - acquire write lock in 3 steps
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 otherwise
+ *
+ * Step 1 - Try to acquire the lock directly if no reader is present
+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
+ * Step 3 - When the readers field goes to 0, set the locked flag
+ *
+ * When not in fair mode, the readers actually ignore the second step.
+ * However, this is still necessary to force other writers to fall in line.
+ */
+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
+{
+	union qrwcnts old, new;
+
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+
+	/* Step 1 */
+	if (!old.writer & !old.readers) {
+		new.rw     = old.rw;
+		new.writer = QW_LOCKED;
+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+			return 1;
+	}
+
+	/* Step 2 */
+	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
+		return 0;
+
+	/* Step 3 */
+	while (true) {
+		cpu_relax();
+		old.rw = ACCESS_ONCE(lock->cnts.rw);
+		if (!old.readers) {
+			new.rw     = old.rw;
+			new.writer = QW_LOCKED;
+			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
+				== old.rw))
+				return 1;
+		}
+	}
+	/* Should never reach here */
+	return 0;
+}
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+	struct qrwnode node;
+
+	/*
+	 * Put the writer into the wait queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, call queue_write_3step_lock()
+	 * to acquire the lock until it is done.
+	 */
+	while (!queue_write_3step_lock(lock))
+		cpu_relax();
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] qrwlock x86: Enable x86 to use queue read/write lock
  2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
  2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
@ 2013-11-04 17:17 ` Waiman Long
  2013-11-04 17:17 ` [PATCH v5 3/4] qrwlock: Enable fair " Waiman Long
  2013-11-04 17:17 ` [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
  3 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-11-04 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T,
	George Spelvin, Tim Chen, Aswin Chandramouleeswaran",
	Scott J Norton, Waiman Long

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables CONFIG_ARCH_QUEUE_RWLOCK which will force the use
of queue read/write lock for x86 which tends to have the largest
NUMA machines compared with the other architectures. This patch will
improve the scalability of those large machines.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f67e839..4f447bb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -123,6 +123,7 @@ config X86
 	select COMPAT_OLD_SIGACTION if IA32_EMULATION
 	select RTC_LIB
 	select HAVE_DEBUG_STACKOVERFLOW
+	select QUEUE_RWLOCK
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/4] qrwlock: Enable fair queue read/write lock
  2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
  2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
  2013-11-04 17:17 ` [PATCH v5 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
@ 2013-11-04 17:17 ` Waiman Long
  2013-11-04 17:17 ` [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
  3 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-11-04 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T,
	George Spelvin, Tim Chen, Aswin Chandramouleeswaran",
	Scott J Norton, Waiman Long

By default, queue rwlock is fair among writers and gives preference
to readers allowing them to steal lock even if a writer is
waiting. However, there is a desire to have a fair variant of
rwlock that is more deterministic. To enable this, fair variants
of lock initializers are added by this patch to allow lock owners
to choose to use fair rwlock. These newly added initializers all
have the _fair or _FAIR suffix to indicate the desire to use a fair
rwlock. If the QUEUE_RWLOCK config option is not selected, the fair
rwlock initializers will be the same as the regular ones.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/linux/rwlock.h       |   15 +++++++++++++++
 include/linux/rwlock_types.h |   13 +++++++++++++
 lib/spinlock_debug.c         |   19 +++++++++++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index bc2994e..5f2628b 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -23,9 +23,24 @@ do {								\
 								\
 	__rwlock_init((lock), #lock, &__key);			\
 } while (0)
+
+# ifdef CONFIG_QUEUE_RWLOCK
+extern void __rwlock_init_fair(rwlock_t *lock, const char *name,
+			       struct lock_class_key *key);
+#  define rwlock_init_fair(lock)				\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	__rwlock_init_fair((lock), #lock, &__key);		\
+} while (0)
+# else
+#  define __rwlock_init_fair(l, n, k)	__rwlock_init(l, n, k)
+# endif /* CONFIG_QUEUE_RWLOCK */
 #else
 # define rwlock_init(lock)					\
 	do { *(lock) = __RW_LOCK_UNLOCKED(lock); } while (0)
+# define rwlock_init_fair(lock)				\
+	do { *(lock) = __RW_LOCK_UNLOCKED_FAIR(lock); } while (0)
 #endif
 
 #ifdef CONFIG_DEBUG_SPINLOCK
diff --git a/include/linux/rwlock_types.h b/include/linux/rwlock_types.h
index cc0072e..d27c812 100644
--- a/include/linux/rwlock_types.h
+++ b/include/linux/rwlock_types.h
@@ -37,12 +37,25 @@ typedef struct {
 				.owner = SPINLOCK_OWNER_INIT,		\
 				.owner_cpu = -1,			\
 				RW_DEP_MAP_INIT(lockname) }
+#define __RW_LOCK_UNLOCKED_FAIR(lockname)				\
+	(rwlock_t)	{	.raw_lock = __ARCH_RW_LOCK_UNLOCKED_FAIR,\
+				.magic = RWLOCK_MAGIC,			\
+				.owner = SPINLOCK_OWNER_INIT,		\
+				.owner_cpu = -1,			\
+				RW_DEP_MAP_INIT(lockname) }
 #else
 #define __RW_LOCK_UNLOCKED(lockname) \
 	(rwlock_t)	{	.raw_lock = __ARCH_RW_LOCK_UNLOCKED,	\
 				RW_DEP_MAP_INIT(lockname) }
+#define __RW_LOCK_UNLOCKED_FAIR(lockname) \
+	(rwlock_t)	{	.raw_lock = __ARCH_RW_LOCK_UNLOCKED_FAIR,\
+				RW_DEP_MAP_INIT(lockname) }
 #endif
 
 #define DEFINE_RWLOCK(x)	rwlock_t x = __RW_LOCK_UNLOCKED(x)
+#define DEFINE_RWLOCK_FAIR(x)	rwlock_t x = __RW_LOCK_UNLOCKED_FAIR(x)
 
+#ifndef	__ARCH_RW_LOCK_UNLOCKED_FAIR
+#define	__ARCH_RW_LOCK_UNLOCKED_FAIR	__ARCH_RW_LOCK_UNLOCKED
+#endif
 #endif /* __LINUX_RWLOCK_TYPES_H */
diff --git a/lib/spinlock_debug.c b/lib/spinlock_debug.c
index 0374a59..d6ef7ce 100644
--- a/lib/spinlock_debug.c
+++ b/lib/spinlock_debug.c
@@ -49,6 +49,25 @@ void __rwlock_init(rwlock_t *lock, const char *name,
 
 EXPORT_SYMBOL(__rwlock_init);
 
+#ifdef CONFIG_QUEUE_RWLOCK
+void __rwlock_init_fair(rwlock_t *lock, const char *name,
+			struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held lock:
+	 */
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+	lock->raw_lock = (arch_rwlock_t) __ARCH_RW_LOCK_UNLOCKED_FAIR;
+	lock->magic = RWLOCK_MAGIC;
+	lock->owner = SPINLOCK_OWNER_INIT;
+	lock->owner_cpu = -1;
+}
+EXPORT_SYMBOL(__rwlock_init_fair);
+#endif /* CONFIG_QUEUE_RWLOCK */
+
 static void spin_dump(raw_spinlock_t *lock, const char *msg)
 {
 	struct task_struct *owner = NULL;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
                   ` (2 preceding siblings ...)
  2013-11-04 17:17 ` [PATCH v5 3/4] qrwlock: Enable fair " Waiman Long
@ 2013-11-04 17:17 ` Waiman Long
  2013-11-08 21:21   ` Paul E. McKenney
  3 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-11-04 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T,
	George Spelvin, Tim Chen, Aswin Chandramouleeswaran",
	Scott J Norton, Waiman Long

There is a pending patch in the rwsem patch series that adds a generic
MCS locking helper functions to do MCS-style locking. This patch
will enable the queue rwlock to use that generic MCS lock/unlock
primitives for internal queuing. This patch should only be merged
after the merging of that generic MCS locking patch.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |    7 +--
 lib/qrwlock.c                 |  140 +++++++++++++----------------------------
 2 files changed, 45 insertions(+), 102 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 78ad4a5..014e6e9 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -54,10 +54,7 @@ typedef u64 __nrcpupair_t;
  * QRW_READER_BIAS to the rw field to increment the reader count won't
  * disturb the writer and the fair fields.
  */
-struct qrwnode {
-	struct qrwnode *next;
-	bool		wait;	/* Waiting flag */
-};
+struct mcs_spinlock;
 
 typedef struct qrwlock {
 	union qrwcnts {
@@ -74,7 +71,7 @@ typedef struct qrwlock {
 		};
 		__nrcpupair_t rw;		/* Reader/writer number pair */
 	} cnts;
-	struct qrwnode *waitq;			/* Tail of waiting queue */
+	struct mcs_spinlock *waitq;		/* Tail of waiting queue */
 } arch_rwlock_t;
 
 /*
diff --git a/lib/qrwlock.c b/lib/qrwlock.c
index a85b9e1..6817853 100644
--- a/lib/qrwlock.c
+++ b/lib/qrwlock.c
@@ -20,6 +20,7 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/mcs_spinlock.h>
 #include <asm-generic/qrwlock.h>
 
 /*
@@ -46,87 +47,16 @@
  */
 
 /**
- * wait_in_queue - Add to queue and wait until it is at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to be added to the queue
- *
- * The use of smp_wmb() is to make sure that the other CPUs see the change
- * ASAP.
- */
-static __always_inline void
-wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
-{
-	struct qrwnode *prev;
-
-	node->next = NULL;
-	node->wait = true;
-	prev = xchg(&lock->waitq, node);
-	if (prev) {
-		prev->next = node;
-		smp_wmb();
-		/*
-		 * Wait until the waiting flag is off
-		 */
-		while (ACCESS_ONCE(node->wait))
-			cpu_relax();
-	}
-}
-
-/**
- * signal_next - Signal the next one in queue to be at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to the current head of queue
- */
-static __always_inline void
-signal_next(struct qrwlock *lock, struct qrwnode *node)
-{
-	struct qrwnode *next;
-
-	/*
-	 * Try to notify the next node first without disturbing the cacheline
-	 * of the lock. If that fails, check to see if it is the last node
-	 * and so should clear the wait queue.
-	 */
-	next = ACCESS_ONCE(node->next);
-	if (likely(next))
-		goto notify_next;
-
-	/*
-	 * Clear the wait queue if it is the last node
-	 */
-	if ((ACCESS_ONCE(lock->waitq) == node) &&
-	    (cmpxchg(&lock->waitq, node, NULL) == node))
-			return;
-	/*
-	 * Wait until the next one in queue set up the next field
-	 */
-	while (likely(!(next = ACCESS_ONCE(node->next))))
-		cpu_relax();
-	/*
-	 * The next one in queue is now at the head
-	 */
-notify_next:
-	barrier();
-	ACCESS_ONCE(next->wait) = false;
-	smp_wmb();
-}
-
-/**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
  * @lock: Pointer to queue rwlock structure
+ * @cnts: Queue read/write lock counts structure
  *
  * In interrupt context or at the head of the queue, the reader will just
  * increment the reader count & wait until the writer releases the lock.
  */
 static __always_inline void
-rspin_until_writer_unlock(struct qrwlock *lock, int inc)
+rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
 {
-	union qrwcnts cnts;
-
-	if (inc)
-		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
-	else
-		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
 	while (cnts.writer == QW_LOCKED) {
 		cpu_relax();
 		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
@@ -139,7 +69,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, int inc)
  */
 void queue_read_lock_slowpath(struct qrwlock *lock)
 {
-	struct qrwnode node;
+	struct mcs_spinlock node;
 	union qrwcnts cnts;
 
 	/*
@@ -150,7 +80,8 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
 		 * Readers in interrupt context will spin until the lock is
 		 * available without waiting in the queue.
 		 */
-		rspin_until_writer_unlock(lock, 0);
+		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+		rspin_until_writer_unlock(lock, cnts);
 		return;
 	}
 	cnts.rw = xadd(&lock->cnts.rw, -QRW_READER_BIAS);
@@ -158,7 +89,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
 	/*
 	 * Put the reader into the wait queue
 	 */
-	wait_in_queue(lock, &node);
+	mcs_spin_lock(&lock->waitq, &node);
 
 	/*
 	 * At the head of the wait queue now, try to increment the reader
@@ -172,12 +103,36 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
 		while (ACCESS_ONCE(lock->cnts.writer))
 			cpu_relax();
 	}
-	rspin_until_writer_unlock(lock, 1);
-	signal_next(lock, &node);
+	/*
+	 * Increment reader count & wait until writer unlock
+	 */
+	cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
+	rspin_until_writer_unlock(lock, cnts);
+	mcs_spin_unlock(&lock->waitq, &node);
 }
 EXPORT_SYMBOL(queue_read_lock_slowpath);
 
 /**
+ * _write_trylock - try to acquire a write lock
+ * @lock : Pointer to queue rwlock structure
+ * @old  : Old value of the qrwcnts
+ * @new  : New value of the qrwcnts
+ * Return: 1 if lock acquired, 0 otherwise
+ *
+ * Put old & new as function arguments can force the compiler to generate
+ * better code with less stack memory access.
+ */
+static __always_inline int _write_trylock(struct qrwlock *lock,
+				union qrwcnts old, union qrwcnts new)
+{
+	new.rw     = old.rw;
+	new.writer = QW_LOCKED;
+	if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+		return 1;
+	return 0;
+}
+
+/**
  * queue_write_3step_lock - acquire write lock in 3 steps
  * @lock : Pointer to queue rwlock structure
  * Return: 1 if lock acquired, 0 otherwise
@@ -194,33 +149,24 @@ static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
 	union qrwcnts old, new;
 
 	old.rw = ACCESS_ONCE(lock->cnts.rw);
+	new.rw = 0;
 
 	/* Step 1 */
-	if (!old.writer & !old.readers) {
-		new.rw     = old.rw;
-		new.writer = QW_LOCKED;
-		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
-			return 1;
-	}
+	if (!old.writer && !old.readers && _write_trylock(lock, old, new))
+		return 1;
 
 	/* Step 2 */
 	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
 		return 0;
 
 	/* Step 3 */
-	while (true) {
+	cpu_relax();
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+	while (old.readers || !_write_trylock(lock, old, new)) {
 		cpu_relax();
 		old.rw = ACCESS_ONCE(lock->cnts.rw);
-		if (!old.readers) {
-			new.rw     = old.rw;
-			new.writer = QW_LOCKED;
-			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
-				== old.rw))
-				return 1;
-		}
 	}
-	/* Should never reach here */
-	return 0;
+	return 1;
 }
 
 /**
@@ -229,12 +175,12 @@ static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
  */
 void queue_write_lock_slowpath(struct qrwlock *lock)
 {
-	struct qrwnode node;
+	struct mcs_spinlock node;
 
 	/*
 	 * Put the writer into the wait queue
 	 */
-	wait_in_queue(lock, &node);
+	mcs_spin_lock(&lock->waitq, &node);
 
 	/*
 	 * At the head of the wait queue now, call queue_write_3step_lock()
@@ -242,6 +188,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 	 */
 	while (!queue_write_3step_lock(lock))
 		cpu_relax();
-	signal_next(lock, &node);
+	mcs_spin_unlock(&lock->waitq, &node);
 }
 EXPORT_SYMBOL(queue_write_lock_slowpath);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
  2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
@ 2013-11-08 21:11   ` Paul E. McKenney
  2013-11-08 22:36     ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2013-11-08 21:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran",
	Scott J Norton

On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
> This patch introduces a new read/write lock implementation that put
> waiting readers and writers into a queue instead of actively contending
> the lock like the current read/write lock implementation. This will
> improve performance in highly contended situation by reducing the
> cache line bouncing effect.
> 
> The queue read/write lock (qrwlock) is mostly fair with respect to
> the writers, even though there is still a slight chance of write
> lock stealing.
> 
> Externally, there are two different types of readers - unfair (the
> default) and fair. A unfair reader will try to steal read lock even
> if a writer is waiting, whereas a fair reader will be waiting in
> the queue under this circumstance.  These variants are chosen at
> initialization time by using different initializers. The new *_fair()
> initializers are added for selecting the use of fair reader.
> 
> Internally, there is a third type of readers which steal lock more
> aggressively than the unfair reader. They simply increments the reader
> count and wait until the writer releases the lock. The transition to
> aggressive reader happens in the read lock slowpath when
> 1. In an interrupt context.
> 2. when a classic reader comes to the head of the wait queue.
> 3. When a fair reader comes to the head of the wait queue and sees
>    the release of a write lock.
> 
> The fair queue rwlock is more deterministic in the sense that late
> comers jumping ahead and stealing the lock is unlikely even though
> there is still a very small chance for lock stealing to happen if
> the readers or writers come at the right moment.  Other than that,
> lock granting is done in a FIFO manner.  As a result, it is possible
> to determine a maximum time period after which the waiting is over
> and the lock can be acquired.
> 
> The queue read lock is safe to use in an interrupt context (softirq
> or hardirq) as it will switch to become an aggressive reader in such
> environment allowing recursive read lock. However, the fair readers
> will not support recursive read lock in a non-interrupt environment
> when a writer is waiting.
> 
> The only downside of queue rwlock is the size increase in the lock
> structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
> systems.
> 
> This patch will replace the architecture specific implementation
> of rwlock by this generic version of queue rwlock when the
> ARCH_QUEUE_RWLOCK configuration parameter is set.
> 
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
> CPUs. The following table shows the average time (in ns) for a single
> lock/unlock sequence (including the looping and timing overhead):
> 
> Lock Type		    2.4GHz	2.93GHz
> ---------		    ------	-------
> Ticket spinlock	     14.9	 12.3
> Read lock		     17.0	 13.5
> Write lock		     17.0	 13.5
> Queue read lock	     16.0	 13.5
> Queue fair read lock	     16.0	 13.5
> Queue write lock	      9.2	  7.8
> Queue fair write lock	     17.5	 14.5
> 
> The queue read lock is slightly slower than the spinlock, but is
> slightly faster than the read lock. The queue write lock, however,
> is the fastest of all. It is almost twice as fast as the write lock
> and about 1.5X of the spinlock. The queue fair write lock, on the
> other hand, is slightly slower than the write lock.
> 
> With lock contention, the speed of each individual lock/unlock function
> is less important than the amount of contention-induced delays.
> 
> To investigate the performance characteristics of the queue rwlock
> compared with the regular rwlock, Ingo's anon_vmas patch that convert
> rwsem to rwlock was applied to a 3.12-rc2 kernel. This kernel was
> then tested under the following 4 conditions:
> 
> 1) Plain 3.12-rc2
> 2) Ingo's patch
> 3) Ingo's patch + unfair qrwlock (default)
> 4) Ingo's patch + fair qrwlock
> 
> The jobs per minutes (JPM) results of the AIM7's high_systime workload
> at 1500 users on a 8-socket 80-core DL980 (HT off) were:
> 
> Kernel	JPM	%Change from (1)
> ------	---	----------------
>   1	148265		-
>   2	238715	       +61%
>   3	242048	       +63%
>   4	234881	       +58%
> 
> The use of unfair qrwlock provides a small boost of 2%, while using
> fair qrwlock leads to 3% decrease of performance. However, looking
> at the perf profiles, we can clearly see that other bottlenecks were
> constraining the performance improvement.
> 
> Perf profile of kernel (2):
> 
>   18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
>    9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
>    2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>    2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
>    1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
>    1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
>    1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> 
> Perf profile of kernel (3):
> 
>   10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
>    5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
>    2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>    1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
>    1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
>    1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
>    1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
>    1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux

But wouldn't kernel (4) be the one that was the most highly constrained?

(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
is unrelated to the qrwlock.)

> Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
> machine.  It was found the performance improvement of 11% was the
> same with regular rwlock or queue rwlock.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Some memory-barrier issues with additional commentary below.

							Thanx, Paul

> ---
>  include/asm-generic/qrwlock.h |  256 +++++++++++++++++++++++++++++++++++++++++
>  kernel/Kconfig.locks          |    7 +
>  lib/Makefile                  |    1 +
>  lib/qrwlock.c                 |  247 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 511 insertions(+), 0 deletions(-)
>  create mode 100644 include/asm-generic/qrwlock.h
>  create mode 100644 lib/qrwlock.c
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> new file mode 100644
> index 0000000..78ad4a5
> --- /dev/null
> +++ b/include/asm-generic/qrwlock.h
> @@ -0,0 +1,256 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#ifndef __ASM_GENERIC_QRWLOCK_H
> +#define __ASM_GENERIC_QRWLOCK_H
> +
> +#include <linux/types.h>
> +#include <asm/bitops.h>
> +#include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> +#include <asm/processor.h>
> +#include <asm/byteorder.h>
> +
> +#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
> +#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
> +#endif
> +
> +#if (CONFIG_NR_CPUS < 65536)
> +typedef u16 __nrcpu_t;
> +typedef u32 __nrcpupair_t;
> +#define	QRW_READER_BIAS	(1U << 16)
> +#else
> +typedef u32 __nrcpu_t;
> +typedef u64 __nrcpupair_t;
> +#define	QRW_READER_BIAS	(1UL << 32)
> +#endif
> +
> +/*
> + * The queue read/write lock data structure
> + *
> + * Read lock stealing can only happen when there is at least one reader
> + * holding the read lock. When the fair flag is not set, it mimics the
> + * behavior of the regular rwlock at the expense that a perpetual stream
> + * of readers could starve a writer for a long period of time. That
> + * behavior, however, may be beneficial to a workload that is reader heavy
> + * with slow writers, and the writers can wait without undesirable consequence.
> + * This fair flag should only be set at initialization time.
> + *
> + * The layout of the structure is endian-sensitive to make sure that adding
> + * QRW_READER_BIAS to the rw field to increment the reader count won't
> + * disturb the writer and the fair fields.
> + */
> +struct qrwnode {
> +	struct qrwnode *next;
> +	bool		wait;	/* Waiting flag */
> +};
> +
> +typedef struct qrwlock {
> +	union qrwcnts {
> +		struct {
> +#ifdef __LITTLE_ENDIAN
> +			u8	  writer;	/* Writer state		*/
> +			u8	  fair;		/* Fair rwlock flag	*/
> +			__nrcpu_t readers;	/* # of active readers	*/
> +#else
> +			__nrcpu_t readers;	/* # of active readers	*/
> +			u8	  fair;		/* Fair rwlock flag	*/
> +			u8	  writer;	/* Writer state		*/
> +#endif
> +		};
> +		__nrcpupair_t rw;		/* Reader/writer number pair */
> +	} cnts;
> +	struct qrwnode *waitq;			/* Tail of waiting queue */
> +} arch_rwlock_t;
> +
> +/*
> + * Writer state values & mask
> + */
> +#define	QW_WAITING	1			/* A writer is waiting	   */
> +#define	QW_LOCKED	0xff			/* A writer holds the lock */
> +#define QW_MASK_FAIR	((u8)~0)		/* Mask for fair reader    */
> +#define QW_MASK_UNFAIR	((u8)~QW_WAITING)	/* Mask for unfair reader  */
> +
> +/*
> + * External function declarations
> + */
> +extern void queue_read_lock_slowpath(struct qrwlock *lock);
> +extern void queue_write_lock_slowpath(struct qrwlock *lock);
> +
> +/**
> + * queue_read_can_lock- would read_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_read_can_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts rwcnts;
> +
> +	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	return !rwcnts.writer || (!rwcnts.fair && rwcnts.readers);
> +}
> +
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts rwcnts;
> +
> +	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	return !rwcnts.writer && !rwcnts.readers;
> +}
> +
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +	u8 wmask;
> +
> +	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> +	if (likely(!(cnts.writer & wmask))) {
> +		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);

On an unfair lock, this can momentarily make queue_read_can_lock() give
a false positive.  Not sure that this is a problem -- after all, the
return value from queue_read_can_lock() is immediately obsolete anyway.

> +		if (likely(!(cnts.writer & wmask)))
> +			return 1;
> +		/*
> +		 * Restore correct reader count
> +		 * It had been found that two nearly consecutive atomic
> +		 * operations (xadd & add) can cause significant cacheline
> +		 * contention. By inserting a pause between these two atomic
> +		 * operations, it can significantly reduce unintended
> +		 * contention.
> +		 */
> +		cpu_relax();
> +		add_smp(&lock->cnts.readers, -1);
> +	}
> +	return 0;
> +}
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_write_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	if (likely(!old.writer && !old.readers)) {
> +		new.rw = old.rw;
> +		new.writer = QW_LOCKED;
> +		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +			return 1;
> +	}
> +	return 0;
> +}
> +/**
> + * queue_read_lock - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline void queue_read_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +	u8 wmask;
> +
> +	cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> +	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> +	if (likely(!(cnts.writer & wmask)))
> +		return;
> +	/*
> +	 * Slowpath will decrement the reader count, if necessary
> +	 */
> +	queue_read_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts old;
> +
> +	/*
> +	 * Optimize for the unfair lock case where the fair flag is 0.
> +	 */
> +	old.rw = cmpxchg(&lock->cnts.rw, 0, QW_LOCKED);
> +	if (likely(old.rw == 0))
> +		return;
> +	if (likely(!old.writer && !old.readers)) {
> +		union qrwcnts new;
> +
> +		new.rw = old.rw;
> +		new.writer = QW_LOCKED;
> +		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +			return;
> +	}
> +	queue_write_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Atomically decrement the reader count
> +	 */
> +	add_smp(&lock->cnts.readers, -1);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Make sure that none of the critical section will be leaked out.
> +	 */
> +	smp_mb__before_clear_bit();
> +	ACCESS_ONCE(lock->cnts.writer) = 0;
> +	smp_mb__after_clear_bit();

How about the new smp_store_release() for this write?  Looks to me that
smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
if they in fact do work for all architectures.

> +}
> +
> +/*
> + * Initializier
> + */
> +#define	__ARCH_RW_LOCK_UNLOCKED	{ .cnts = { .rw = 0 }, .waitq = NULL }
> +#define	__ARCH_RW_LOCK_UNLOCKED_FAIR	\
> +	{ .cnts = { { .writer = 0, .fair = 1, .readers = 0 } }, .waitq = NULL }
> +
> +/*
> + * Remapping rwlock architecture specific functions to the corresponding
> + * queue rwlock functions.
> + */
> +#define arch_read_can_lock(l)	queue_read_can_lock(l)
> +#define arch_write_can_lock(l)	queue_write_can_lock(l)
> +#define arch_read_lock(l)	queue_read_lock(l)
> +#define arch_write_lock(l)	queue_write_lock(l)
> +#define arch_read_trylock(l)	queue_read_trylock(l)
> +#define arch_write_trylock(l)	queue_write_trylock(l)
> +#define arch_read_unlock(l)	queue_read_unlock(l)
> +#define arch_write_unlock(l)	queue_write_unlock(l)
> +
> +#endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index d2b32ac..b665478 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,10 @@ endif
>  config MUTEX_SPIN_ON_OWNER
>  	def_bool y
>  	depends on SMP && !DEBUG_MUTEXES
> +
> +config ARCH_QUEUE_RWLOCK
> +	bool
> +
> +config QUEUE_RWLOCK
> +	def_bool y if ARCH_QUEUE_RWLOCK
> +	depends on SMP
> diff --git a/lib/Makefile b/lib/Makefile
> index f3bb2cb..e3175db 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -189,3 +189,4 @@ quiet_cmd_build_OID_registry = GEN     $@
>  clean-files	+= oid_registry_data.c
> 
>  obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
> +obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
> diff --git a/lib/qrwlock.c b/lib/qrwlock.c
> new file mode 100644
> index 0000000..a85b9e1
> --- /dev/null
> +++ b/lib/qrwlock.c
> @@ -0,0 +1,247 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#include <linux/smp.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <linux/hardirq.h>
> +#include <asm-generic/qrwlock.h>
> +
> +/*
> + * Compared with regular rwlock, the queue rwlock has has the following
> + * advantages:
> + * 1. It is more deterministic for the fair variant. Even though there is
> + *    a slight chance of stealing the lock if come at the right moment, the
> + *    granting of the lock is mostly in FIFO order. Even the default unfair
> + *    variant is fairer at least among the writers.
> + * 2. It is faster in high contention situation.

Sometimes, anyway!  (Referring to your performance results on top of
Ingo's patch.)

> + *
> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot wait queue. The writers that follow will have to wait in the
> + * combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue rwlock is faster in high
> + * contention situation. The writer lock is also faster in single thread
> + * operations. Therefore, queue rwlock can be considered as a replacement
> + * for those spinlocks that are highly contended as long as an increase
> + * in lock size is not an issue.
> + */
> +
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to be added to the queue
> + *
> + * The use of smp_wmb() is to make sure that the other CPUs see the change
> + * ASAP.
> + */
> +static __always_inline void
> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *prev;
> +
> +	node->next = NULL;
> +	node->wait = true;
> +	prev = xchg(&lock->waitq, node);
> +	if (prev) {
> +		prev->next = node;
> +		smp_wmb();

This smp_wmb() desperately needs a comment.  Presumably it is ordering
the above "prev->next = node" with some later write, but what write?

Oh...  I see the header comment above.

Actually, memory barriers don't necessarily make things visible sooner.
They are instead used for ordering.  Or did you actually measure a
performance increase with this?  (Seems -highly- unlikely given smp_wmb()'s
definition on x86...)

> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (ACCESS_ONCE(node->wait))
> +			cpu_relax();
> +	}
> +}
> +
> +/**
> + * signal_next - Signal the next one in queue to be at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to the current head of queue
> + */
> +static __always_inline void
> +signal_next(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *next;
> +
> +	/*
> +	 * Try to notify the next node first without disturbing the cacheline
> +	 * of the lock. If that fails, check to see if it is the last node
> +	 * and so should clear the wait queue.
> +	 */
> +	next = ACCESS_ONCE(node->next);
> +	if (likely(next))
> +		goto notify_next;
> +
> +	/*
> +	 * Clear the wait queue if it is the last node
> +	 */
> +	if ((ACCESS_ONCE(lock->waitq) == node) &&
> +	    (cmpxchg(&lock->waitq, node, NULL) == node))
> +			return;
> +	/*
> +	 * Wait until the next one in queue set up the next field
> +	 */
> +	while (likely(!(next = ACCESS_ONCE(node->next))))
> +		cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +notify_next:
> +	barrier();
> +	ACCESS_ONCE(next->wait) = false;
> +	smp_wmb();

Because smp_wmb() does not order reads, reads from the critical section
could leak out of the critical section.  A full memory barrier (smp_mb())
seems necessary to avoid this.

Yes, you do have full memory barriers implicit in various atomic operations,
but it appears to be possible to avoid them all in some situations.

> +}
> +
> +/**
> + * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> + * @lock: Pointer to queue rwlock structure
> + *
> + * In interrupt context or at the head of the queue, the reader will just
> + * increment the reader count & wait until the writer releases the lock.
> + */
> +static __always_inline void
> +rspin_until_writer_unlock(struct qrwlock *lock, int inc)
> +{
> +	union qrwcnts cnts;
> +
> +	if (inc)
> +		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> +	else
> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	while (cnts.writer == QW_LOCKED) {
> +		cpu_relax();
> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	}
> +}
> +
> +/**
> + * queue_read_lock_slowpath - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +void queue_read_lock_slowpath(struct qrwlock *lock)
> +{
> +	struct qrwnode node;
> +	union qrwcnts cnts;
> +
> +	/*
> +	 * Readers come here when it cannot get the lock without waiting
> +	 */
> +	if (unlikely(irq_count())) {
> +		/*
> +		 * Readers in interrupt context will spin until the lock is
> +		 * available without waiting in the queue.
> +		 */
> +		rspin_until_writer_unlock(lock, 0);
> +		return;
> +	}
> +	cnts.rw = xadd(&lock->cnts.rw, -QRW_READER_BIAS);
> +
> +	/*
> +	 * Put the reader into the wait queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, try to increment the reader
> +	 * count and get the lock.
> +	 */
> +	if (unlikely(cnts.fair)) {
> +		/*
> +		 * For fair reader, wait until the writer state goes to 0
> +		 * before incrementing the reader count.
> +		 */
> +		while (ACCESS_ONCE(lock->cnts.writer))
> +			cpu_relax();
> +	}
> +	rspin_until_writer_unlock(lock, 1);
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_read_lock_slowpath);
> +
> +/**
> + * queue_write_3step_lock - acquire write lock in 3 steps
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 otherwise
> + *
> + * Step 1 - Try to acquire the lock directly if no reader is present
> + * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> + * Step 3 - When the readers field goes to 0, set the locked flag
> + *
> + * When not in fair mode, the readers actually ignore the second step.
> + * However, this is still necessary to force other writers to fall in line.
> + */
> +static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +
> +	/* Step 1 */
> +	if (!old.writer & !old.readers) {
> +		new.rw     = old.rw;
> +		new.writer = QW_LOCKED;
> +		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +			return 1;
> +	}
> +
> +	/* Step 2 */
> +	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
> +		return 0;
> +
> +	/* Step 3 */
> +	while (true) {
> +		cpu_relax();
> +		old.rw = ACCESS_ONCE(lock->cnts.rw);

Suppose that there now is a writer, but no readers...

> +		if (!old.readers) {
> +			new.rw     = old.rw;
> +			new.writer = QW_LOCKED;
> +			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> +				== old.rw))

... can't this mistakenly hand out the lock to a second writer?

Ah, the trick is that we are at the head of the queue, so the only writer
we can possibly contend with is a prior holder of the lock.  Once that
writer leaves, no other writer but can appear.  And the QW_WAITING bit
prevents new writers from immediately grabbing the lock.

> +				return 1;
> +		}
> +	}
> +	/* Should never reach here */
> +	return 0;
> +}
> +
> +/**
> + * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +void queue_write_lock_slowpath(struct qrwlock *lock)
> +{
> +	struct qrwnode node;
> +
> +	/*
> +	 * Put the writer into the wait queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, call queue_write_3step_lock()
> +	 * to acquire the lock until it is done.
> +	 */
> +	while (!queue_write_3step_lock(lock))
> +		cpu_relax();

If we get here, queue_write_3step_lock() just executed a successful
cmpxchg(), which implies a full memory barrier.  This prevents the
critical section from leaking out, good!

> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_write_lock_slowpath);
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-04 17:17 ` [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
@ 2013-11-08 21:21   ` Paul E. McKenney
  2013-11-08 22:42     ` Waiman Long
  2013-11-09  1:17     ` Tim Chen
  0 siblings, 2 replies; 13+ messages in thread
From: Paul E. McKenney @ 2013-11-08 21:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran",
	Scott J Norton

On Mon, Nov 04, 2013 at 12:17:20PM -0500, Waiman Long wrote:
> There is a pending patch in the rwsem patch series that adds a generic
> MCS locking helper functions to do MCS-style locking. This patch
> will enable the queue rwlock to use that generic MCS lock/unlock
> primitives for internal queuing. This patch should only be merged
> after the merging of that generic MCS locking patch.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

This one does might address at least some of the earlier memory-barrier
issues, at least assuming that the MCS lock is properly memory-barriered.

Then again, maybe not.  Please see below.

							Thanx, Paul

> ---
>  include/asm-generic/qrwlock.h |    7 +--
>  lib/qrwlock.c                 |  140 +++++++++++++----------------------------
>  2 files changed, 45 insertions(+), 102 deletions(-)
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 78ad4a5..014e6e9 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -54,10 +54,7 @@ typedef u64 __nrcpupair_t;
>   * QRW_READER_BIAS to the rw field to increment the reader count won't
>   * disturb the writer and the fair fields.
>   */
> -struct qrwnode {
> -	struct qrwnode *next;
> -	bool		wait;	/* Waiting flag */
> -};
> +struct mcs_spinlock;
> 
>  typedef struct qrwlock {
>  	union qrwcnts {
> @@ -74,7 +71,7 @@ typedef struct qrwlock {
>  		};
>  		__nrcpupair_t rw;		/* Reader/writer number pair */
>  	} cnts;
> -	struct qrwnode *waitq;			/* Tail of waiting queue */
> +	struct mcs_spinlock *waitq;		/* Tail of waiting queue */
>  } arch_rwlock_t;
> 
>  /*
> diff --git a/lib/qrwlock.c b/lib/qrwlock.c
> index a85b9e1..6817853 100644
> --- a/lib/qrwlock.c
> +++ b/lib/qrwlock.c
> @@ -20,6 +20,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/percpu.h>
>  #include <linux/hardirq.h>
> +#include <linux/mcs_spinlock.h>
>  #include <asm-generic/qrwlock.h>
> 
>  /*
> @@ -46,87 +47,16 @@
>   */
> 
>  /**
> - * wait_in_queue - Add to queue and wait until it is at the head
> - * @lock: Pointer to queue rwlock structure
> - * @node: Node pointer to be added to the queue
> - *
> - * The use of smp_wmb() is to make sure that the other CPUs see the change
> - * ASAP.
> - */
> -static __always_inline void
> -wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> -{
> -	struct qrwnode *prev;
> -
> -	node->next = NULL;
> -	node->wait = true;
> -	prev = xchg(&lock->waitq, node);
> -	if (prev) {
> -		prev->next = node;
> -		smp_wmb();
> -		/*
> -		 * Wait until the waiting flag is off
> -		 */
> -		while (ACCESS_ONCE(node->wait))
> -			cpu_relax();
> -	}
> -}
> -
> -/**
> - * signal_next - Signal the next one in queue to be at the head
> - * @lock: Pointer to queue rwlock structure
> - * @node: Node pointer to the current head of queue
> - */
> -static __always_inline void
> -signal_next(struct qrwlock *lock, struct qrwnode *node)
> -{
> -	struct qrwnode *next;
> -
> -	/*
> -	 * Try to notify the next node first without disturbing the cacheline
> -	 * of the lock. If that fails, check to see if it is the last node
> -	 * and so should clear the wait queue.
> -	 */
> -	next = ACCESS_ONCE(node->next);
> -	if (likely(next))
> -		goto notify_next;
> -
> -	/*
> -	 * Clear the wait queue if it is the last node
> -	 */
> -	if ((ACCESS_ONCE(lock->waitq) == node) &&
> -	    (cmpxchg(&lock->waitq, node, NULL) == node))
> -			return;
> -	/*
> -	 * Wait until the next one in queue set up the next field
> -	 */
> -	while (likely(!(next = ACCESS_ONCE(node->next))))
> -		cpu_relax();
> -	/*
> -	 * The next one in queue is now at the head
> -	 */
> -notify_next:
> -	barrier();
> -	ACCESS_ONCE(next->wait) = false;
> -	smp_wmb();
> -}
> -
> -/**
>   * rspin_until_writer_unlock - inc reader count & spin until writer is gone
>   * @lock: Pointer to queue rwlock structure
> + * @cnts: Queue read/write lock counts structure
>   *
>   * In interrupt context or at the head of the queue, the reader will just
>   * increment the reader count & wait until the writer releases the lock.
>   */
>  static __always_inline void
> -rspin_until_writer_unlock(struct qrwlock *lock, int inc)
> +rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
>  {
> -	union qrwcnts cnts;
> -
> -	if (inc)
> -		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> -	else
> -		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
>  	while (cnts.writer == QW_LOCKED) {
>  		cpu_relax();
>  		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> @@ -139,7 +69,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, int inc)
>   */
>  void queue_read_lock_slowpath(struct qrwlock *lock)
>  {
> -	struct qrwnode node;
> +	struct mcs_spinlock node;
>  	union qrwcnts cnts;
> 
>  	/*
> @@ -150,7 +80,8 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>  		 * Readers in interrupt context will spin until the lock is
>  		 * available without waiting in the queue.
>  		 */
> -		rspin_until_writer_unlock(lock, 0);
> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +		rspin_until_writer_unlock(lock, cnts);
>  		return;
>  	}
>  	cnts.rw = xadd(&lock->cnts.rw, -QRW_READER_BIAS);
> @@ -158,7 +89,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>  	/*
>  	 * Put the reader into the wait queue
>  	 */
> -	wait_in_queue(lock, &node);
> +	mcs_spin_lock(&lock->waitq, &node);
> 
>  	/*
>  	 * At the head of the wait queue now, try to increment the reader
> @@ -172,12 +103,36 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>  		while (ACCESS_ONCE(lock->cnts.writer))
>  			cpu_relax();
>  	}
> -	rspin_until_writer_unlock(lock, 1);
> -	signal_next(lock, &node);
> +	/*
> +	 * Increment reader count & wait until writer unlock
> +	 */
> +	cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> +	rspin_until_writer_unlock(lock, cnts);
> +	mcs_spin_unlock(&lock->waitq, &node);

But mcs_spin_unlock() is only required to do a RELEASE barrier, which
could still allow critical-section leakage.

>  }
>  EXPORT_SYMBOL(queue_read_lock_slowpath);
> 
>  /**
> + * _write_trylock - try to acquire a write lock
> + * @lock : Pointer to queue rwlock structure
> + * @old  : Old value of the qrwcnts
> + * @new  : New value of the qrwcnts
> + * Return: 1 if lock acquired, 0 otherwise
> + *
> + * Put old & new as function arguments can force the compiler to generate
> + * better code with less stack memory access.
> + */
> +static __always_inline int _write_trylock(struct qrwlock *lock,
> +				union qrwcnts old, union qrwcnts new)
> +{
> +	new.rw     = old.rw;
> +	new.writer = QW_LOCKED;
> +	if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +		return 1;
> +	return 0;
> +}
> +
> +/**
>   * queue_write_3step_lock - acquire write lock in 3 steps
>   * @lock : Pointer to queue rwlock structure
>   * Return: 1 if lock acquired, 0 otherwise
> @@ -194,33 +149,24 @@ static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
>  	union qrwcnts old, new;
> 
>  	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	new.rw = 0;
> 
>  	/* Step 1 */
> -	if (!old.writer & !old.readers) {
> -		new.rw     = old.rw;
> -		new.writer = QW_LOCKED;
> -		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> -			return 1;
> -	}
> +	if (!old.writer && !old.readers && _write_trylock(lock, old, new))
> +		return 1;
> 
>  	/* Step 2 */
>  	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
>  		return 0;
> 
>  	/* Step 3 */
> -	while (true) {
> +	cpu_relax();
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	while (old.readers || !_write_trylock(lock, old, new)) {
>  		cpu_relax();
>  		old.rw = ACCESS_ONCE(lock->cnts.rw);
> -		if (!old.readers) {
> -			new.rw     = old.rw;
> -			new.writer = QW_LOCKED;
> -			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> -				== old.rw))
> -				return 1;
> -		}
>  	}
> -	/* Should never reach here */
> -	return 0;
> +	return 1;

This one still seems properly barriered, good!

>  }
> 
>  /**
> @@ -229,12 +175,12 @@ static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
>   */
>  void queue_write_lock_slowpath(struct qrwlock *lock)
>  {
> -	struct qrwnode node;
> +	struct mcs_spinlock node;
> 
>  	/*
>  	 * Put the writer into the wait queue
>  	 */
> -	wait_in_queue(lock, &node);
> +	mcs_spin_lock(&lock->waitq, &node);
> 
>  	/*
>  	 * At the head of the wait queue now, call queue_write_3step_lock()
> @@ -242,6 +188,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>  	 */
>  	while (!queue_write_3step_lock(lock))
>  		cpu_relax();
> -	signal_next(lock, &node);
> +	mcs_spin_unlock(&lock->waitq, &node);
>  }
>  EXPORT_SYMBOL(queue_write_lock_slowpath);
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
  2013-11-08 21:11   ` Paul E. McKenney
@ 2013-11-08 22:36     ` Waiman Long
  2013-11-08 23:51       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-11-08 22:36 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On 11/08/2013 04:11 PM, Paul E. McKenney wrote:
> On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
>> Kernel	JPM	%Change from (1)
>> ------	---	----------------
>>    1	148265		-
>>    2	238715	       +61%
>>    3	242048	       +63%
>>    4	234881	       +58%
>>
>> The use of unfair qrwlock provides a small boost of 2%, while using
>> fair qrwlock leads to 3% decrease of performance. However, looking
>> at the perf profiles, we can clearly see that other bottlenecks were
>> constraining the performance improvement.
>>
>> Perf profile of kernel (2):
>>
>>    18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
>>     9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
>>     2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>>     2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
>>     1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
>>     1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
>>     1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
>>
>> Perf profile of kernel (3):
>>
>>    10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
>>     5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
>>     2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>>     1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>>     1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
>>     1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
>>     1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
>>     1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
>>     1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux
> But wouldn't kernel (4) be the one that was the most highly constrained?
>
> (That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
> is unrelated to the qrwlock.)

I think the performance data is a bit off as it was collected with a 
previous version that has a minor bug in it. I will rerun the test to 
get the new data.

>
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts rwcnts;
> +
> +	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	return !rwcnts.writer&&  !rwcnts.readers;
> +}
> +
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +	u8 wmask;
> +
> +	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> +	if (likely(!(cnts.writer&  wmask))) {
> +		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> On an unfair lock, this can momentarily make queue_read_can_lock() give
> a false positive.  Not sure that this is a problem -- after all, the
> return value from queue_read_can_lock() is immediately obsolete anyway.

Yes, this is an issue. However, I don't this is a big deal as you said. 
Using cmpxchg may avoid this issue, but then their will be a fair chance 
of false collision among readers. So it is probably something we may 
have to live with.

>> +/**
>> + * queue_write_unlock - release write lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + */
>> +static inline void queue_write_unlock(struct qrwlock *lock)
>> +{
>> +	/*
>> +	 * Make sure that none of the critical section will be leaked out.
>> +	 */
>> +	smp_mb__before_clear_bit();
>> +	ACCESS_ONCE(lock->cnts.writer) = 0;
>> +	smp_mb__after_clear_bit();
> How about the new smp_store_release() for this write?  Looks to me that
> smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
> if they in fact do work for all architectures.

Yes, I am thinking about updating the patch to use smp_store_release() 
once it is in. I don't want to create such a dependency for this patch 
yet. Anyway, clearing a bit looks the same as clearing a byte to me.

>> +/*
>> + * Compared with regular rwlock, the queue rwlock has has the following
>> + * advantages:
>> + * 1. It is more deterministic for the fair variant. Even though there is
>> + *    a slight chance of stealing the lock if come at the right moment, the
>> + *    granting of the lock is mostly in FIFO order. Even the default unfair
>> + *    variant is fairer at least among the writers.
>> + * 2. It is faster in high contention situation.
> Sometimes, anyway!  (Referring to your performance results on top of
> Ingo's patch.)

Will adjust the wording by adding "usually".

>
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to be added to the queue
> + *
> + * The use of smp_wmb() is to make sure that the other CPUs see the change
> + * ASAP.
> + */
> +static __always_inline void
> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *prev;
> +
> +	node->next = NULL;
> +	node->wait = true;
> +	prev = xchg(&lock->waitq, node);
> +	if (prev) {
> +		prev->next = node;
> +		smp_wmb();
> This smp_wmb() desperately needs a comment.  Presumably it is ordering
> the above "prev->next = node" with some later write, but what write?
>
> Oh...  I see the header comment above.
>
> Actually, memory barriers don't necessarily make things visible sooner.
> They are instead used for ordering.  Or did you actually measure a
> performance increase with this?  (Seems -highly- unlikely given smp_wmb()'s
> definition on x86...)

I have some incorrect assumptions about memory barrier. Anyway, this 
issue will be gone once I use the MCS lock/unlock code.

>> +		/*
>> +		 * Wait until the waiting flag is off
>> +		 */
>> +		while (ACCESS_ONCE(node->wait))
>> +			cpu_relax();
>> +	}
>> +}
>> +
>> +/**
>> + * signal_next - Signal the next one in queue to be at the head
>> + * @lock: Pointer to queue rwlock structure
>> + * @node: Node pointer to the current head of queue
>> + */
>> +static __always_inline void
>> +signal_next(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *next;
>> +
>> +	/*
>> +	 * Try to notify the next node first without disturbing the cacheline
>> +	 * of the lock. If that fails, check to see if it is the last node
>> +	 * and so should clear the wait queue.
>> +	 */
>> +	next = ACCESS_ONCE(node->next);
>> +	if (likely(next))
>> +		goto notify_next;
>> +
>> +	/*
>> +	 * Clear the wait queue if it is the last node
>> +	 */
>> +	if ((ACCESS_ONCE(lock->waitq) == node)&&
>> +	    (cmpxchg(&lock->waitq, node, NULL) == node))
>> +			return;
>> +	/*
>> +	 * Wait until the next one in queue set up the next field
>> +	 */
>> +	while (likely(!(next = ACCESS_ONCE(node->next))))
>> +		cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +notify_next:
>> +	barrier();
>> +	ACCESS_ONCE(next->wait) = false;
>> +	smp_wmb();
> Because smp_wmb() does not order reads, reads from the critical section
> could leak out of the critical section.  A full memory barrier (smp_mb())
> seems necessary to avoid this.
>
> Yes, you do have full memory barriers implicit in various atomic operations,
> but it appears to be possible to avoid them all in some situations.

Yes, you are right.

>> +/**
>> + * queue_write_3step_lock - acquire write lock in 3 steps
>> + * @lock : Pointer to queue rwlock structure
>> + * Return: 1 if lock acquired, 0 otherwise
>> + *
>> + * Step 1 - Try to acquire the lock directly if no reader is present
>> + * Step 2 - Set the waiting flag to notify readers that a writer is waiting
>> + * Step 3 - When the readers field goes to 0, set the locked flag
>> + *
>> + * When not in fair mode, the readers actually ignore the second step.
>> + * However, this is still necessary to force other writers to fall in line.
>> + */
>> +static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
>> +{
>> +	union qrwcnts old, new;
>> +
>> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
>> +
>> +	/* Step 1 */
>> +	if (!old.writer&  !old.readers) {
>> +		new.rw     = old.rw;
>> +		new.writer = QW_LOCKED;
>> +		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
>> +			return 1;
>> +	}
>> +
>> +	/* Step 2 */
>> +	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
>> +		return 0;
>> +
>> +	/* Step 3 */
>> +	while (true) {
>> +		cpu_relax();
>> +		old.rw = ACCESS_ONCE(lock->cnts.rw);
> Suppose that there now is a writer, but no readers...
>
>> +		if (!old.readers) {
>> +			new.rw     = old.rw;
>> +			new.writer = QW_LOCKED;
>> +			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
>> +				== old.rw))
> ... can't this mistakenly hand out the lock to a second writer?
>
> Ah, the trick is that we are at the head of the queue, so the only writer
> we can possibly contend with is a prior holder of the lock.  Once that
> writer leaves, no other writer but can appear.  And the QW_WAITING bit
> prevents new writers from immediately grabbing the lock.

Yes, that is the point. The current has 2 queue for writers - a one 
position queue in the waiting bit and the MCS locking queue that are 
used by both readers and writers.

>> +				return 1;
>> +		}
>> +	}
>> +	/* Should never reach here */
>> +	return 0;
>> +}
>> +
>> +/**
>> + * queue_write_lock_slowpath - acquire write lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + */
>> +void queue_write_lock_slowpath(struct qrwlock *lock)
>> +{
>> +	struct qrwnode node;
>> +
>> +	/*
>> +	 * Put the writer into the wait queue
>> +	 */
>> +	wait_in_queue(lock,&node);
>> +
>> +	/*
>> +	 * At the head of the wait queue now, call queue_write_3step_lock()
>> +	 * to acquire the lock until it is done.
>> +	 */
>> +	while (!queue_write_3step_lock(lock))
>> +		cpu_relax();
> If we get here, queue_write_3step_lock() just executed a successful
> cmpxchg(), which implies a full memory barrier.  This prevents the
> critical section from leaking out, good!
>
>> +	signal_next(lock,&node);
>> +}
>> +EXPORT_SYMBOL(queue_write_lock_slowpath);
>> -- 
>> 1.7.1
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Thank for the review.

-Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-08 21:21   ` Paul E. McKenney
@ 2013-11-08 22:42     ` Waiman Long
  2013-11-09  1:17     ` Tim Chen
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-11-08 22:42 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On 11/08/2013 04:21 PM, Paul E. McKenney wrote:
> On Mon, Nov 04, 2013 at 12:17:20PM -0500, Waiman Long wrote:
>> There is a pending patch in the rwsem patch series that adds a generic
>> MCS locking helper functions to do MCS-style locking. This patch
>> will enable the queue rwlock to use that generic MCS lock/unlock
>> primitives for internal queuing. This patch should only be merged
>> after the merging of that generic MCS locking patch.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
> This one does might address at least some of the earlier memory-barrier
> issues, at least assuming that the MCS lock is properly memory-barriered.
>
> Then again, maybe not.  Please see below.
>
> 							Thanx, Paul
>
>>   	/*
>>   	 * At the head of the wait queue now, try to increment the reader
>> @@ -172,12 +103,36 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>>   		while (ACCESS_ONCE(lock->cnts.writer))
>>   			cpu_relax();
>>   	}
>> -	rspin_until_writer_unlock(lock, 1);
>> -	signal_next(lock,&node);
>> +	/*
>> +	 * Increment reader count&  wait until writer unlock
>> +	 */
>> +	cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
>> +	rspin_until_writer_unlock(lock, cnts);
>> +	mcs_spin_unlock(&lock->waitq,&node);
> But mcs_spin_unlock() is only required to do a RELEASE barrier, which
> could still allow critical-section leakage.

Yes, that is a problem. I will try to add an ACQUIRE barrier in reading 
the writer byte.

-Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
  2013-11-08 22:36     ` Waiman Long
@ 2013-11-08 23:51       ` Paul E. McKenney
  2013-11-09  3:05         ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2013-11-08 23:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On Fri, Nov 08, 2013 at 05:36:12PM -0500, Waiman Long wrote:
> On 11/08/2013 04:11 PM, Paul E. McKenney wrote:
> >On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
> >>Kernel	JPM	%Change from (1)
> >>------	---	----------------
> >>   1	148265		-
> >>   2	238715	       +61%
> >>   3	242048	       +63%
> >>   4	234881	       +58%
> >>
> >>The use of unfair qrwlock provides a small boost of 2%, while using
> >>fair qrwlock leads to 3% decrease of performance. However, looking
> >>at the perf profiles, we can clearly see that other bottlenecks were
> >>constraining the performance improvement.
> >>
> >>Perf profile of kernel (2):
> >>
> >>   18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
> >>    9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
> >>    2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
> >>    2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
> >>    1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
> >>    1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
> >>    1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> >>
> >>Perf profile of kernel (3):
> >>
> >>   10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
> >>    5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
> >>    2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
> >>    1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
> >>    1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> >>    1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
> >>    1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
> >>    1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux
> >But wouldn't kernel (4) be the one that was the most highly constrained?
> >
> >(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
> >is unrelated to the qrwlock.)
> 
> I think the performance data is a bit off as it was collected with a
> previous version that has a minor bug in it. I will rerun the test
> to get the new data.
> 
> >
> >+/**
> >+ * queue_write_can_lock- would write_trylock() succeed?
> >+ * @lock: Pointer to queue rwlock structure
> >+ */
> >+static inline int queue_write_can_lock(struct qrwlock *lock)
> >+{
> >+	union qrwcnts rwcnts;
> >+
> >+	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+	return !rwcnts.writer&&  !rwcnts.readers;
> >+}
> >+
> >+/**
> >+ * queue_read_trylock - try to acquire read lock of a queue rwlock
> >+ * @lock : Pointer to queue rwlock structure
> >+ * Return: 1 if lock acquired, 0 if failed
> >+ */
> >+static inline int queue_read_trylock(struct qrwlock *lock)
> >+{
> >+	union qrwcnts cnts;
> >+	u8 wmask;
> >+
> >+	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> >+	if (likely(!(cnts.writer&  wmask))) {
> >+		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> >On an unfair lock, this can momentarily make queue_read_can_lock() give
> >a false positive.  Not sure that this is a problem -- after all, the
> >return value from queue_read_can_lock() is immediately obsolete anyway.
> 
> Yes, this is an issue. However, I don't this is a big deal as you
> said. Using cmpxchg may avoid this issue, but then their will be a
> fair chance of false collision among readers. So it is probably
> something we may have to live with.
> 
> >>+/**
> >>+ * queue_write_unlock - release write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+static inline void queue_write_unlock(struct qrwlock *lock)
> >>+{
> >>+	/*
> >>+	 * Make sure that none of the critical section will be leaked out.
> >>+	 */
> >>+	smp_mb__before_clear_bit();
> >>+	ACCESS_ONCE(lock->cnts.writer) = 0;
> >>+	smp_mb__after_clear_bit();
> >How about the new smp_store_release() for this write?  Looks to me that
> >smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
> >if they in fact do work for all architectures.
> 
> Yes, I am thinking about updating the patch to use
> smp_store_release() once it is in. I don't want to create such a
> dependency for this patch yet. Anyway, clearing a bit looks the same
> as clearing a byte to me.
> 
> >>+/*
> >>+ * Compared with regular rwlock, the queue rwlock has has the following
> >>+ * advantages:
> >>+ * 1. It is more deterministic for the fair variant. Even though there is
> >>+ *    a slight chance of stealing the lock if come at the right moment, the
> >>+ *    granting of the lock is mostly in FIFO order. Even the default unfair
> >>+ *    variant is fairer at least among the writers.
> >>+ * 2. It is faster in high contention situation.
> >Sometimes, anyway!  (Referring to your performance results on top of
> >Ingo's patch.)
> 
> Will adjust the wording by adding "usually".
> 
> >
> >+/**
> >+ * wait_in_queue - Add to queue and wait until it is at the head
> >+ * @lock: Pointer to queue rwlock structure
> >+ * @node: Node pointer to be added to the queue
> >+ *
> >+ * The use of smp_wmb() is to make sure that the other CPUs see the change
> >+ * ASAP.
> >+ */
> >+static __always_inline void
> >+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> >+{
> >+	struct qrwnode *prev;
> >+
> >+	node->next = NULL;
> >+	node->wait = true;
> >+	prev = xchg(&lock->waitq, node);
> >+	if (prev) {
> >+		prev->next = node;
> >+		smp_wmb();
> >This smp_wmb() desperately needs a comment.  Presumably it is ordering
> >the above "prev->next = node" with some later write, but what write?
> >
> >Oh...  I see the header comment above.
> >
> >Actually, memory barriers don't necessarily make things visible sooner.
> >They are instead used for ordering.  Or did you actually measure a
> >performance increase with this?  (Seems -highly- unlikely given smp_wmb()'s
> >definition on x86...)
> 
> I have some incorrect assumptions about memory barrier. Anyway, this
> issue will be gone once I use the MCS lock/unlock code.

Here is a presentation that has some diagrams that might help:

http://www.rdrop.com/users/paulmck/scalability/paper/Scaling.2013.10.25c.pdf

So, for example, if X and Y are both initially zero:

	CPU 0			CPU 1

	ACCESS_ONCE(X) = 1;	r1 = ACCESS_ONCE(Y);
	smp_wmb();		smp_rmb();
	ACCESS_ONCE(Y) = 1;	r2 = ACCESS_ONCE(X);

Then the two memory barriers enforce a conditional ordering.  The
condition is whether or not CPU 0's store to Y is seen by CPU 1's
load from Y.  If it is, then the pair of memory barriers ensure that
CPU 1's load from X sees the result of CPU 0's store to X.  In other
words, BUG_ON(r1 == 1 && r2 == 0) will never fire.

In general, if a memory access after memory barrier A happens before
a memory access before memory barrier B, then the two memory barriers
will ensure that applicable accesses before memory barrier A happen
before applicable accesses after memory barrier B.

Does that help?

							Thanx, Paul

> >>+		/*
> >>+		 * Wait until the waiting flag is off
> >>+		 */
> >>+		while (ACCESS_ONCE(node->wait))
> >>+			cpu_relax();
> >>+	}
> >>+}
> >>+
> >>+/**
> >>+ * signal_next - Signal the next one in queue to be at the head
> >>+ * @lock: Pointer to queue rwlock structure
> >>+ * @node: Node pointer to the current head of queue
> >>+ */
> >>+static __always_inline void
> >>+signal_next(struct qrwlock *lock, struct qrwnode *node)
> >>+{
> >>+	struct qrwnode *next;
> >>+
> >>+	/*
> >>+	 * Try to notify the next node first without disturbing the cacheline
> >>+	 * of the lock. If that fails, check to see if it is the last node
> >>+	 * and so should clear the wait queue.
> >>+	 */
> >>+	next = ACCESS_ONCE(node->next);
> >>+	if (likely(next))
> >>+		goto notify_next;
> >>+
> >>+	/*
> >>+	 * Clear the wait queue if it is the last node
> >>+	 */
> >>+	if ((ACCESS_ONCE(lock->waitq) == node)&&
> >>+	    (cmpxchg(&lock->waitq, node, NULL) == node))
> >>+			return;
> >>+	/*
> >>+	 * Wait until the next one in queue set up the next field
> >>+	 */
> >>+	while (likely(!(next = ACCESS_ONCE(node->next))))
> >>+		cpu_relax();
> >>+	/*
> >>+	 * The next one in queue is now at the head
> >>+	 */
> >>+notify_next:
> >>+	barrier();
> >>+	ACCESS_ONCE(next->wait) = false;
> >>+	smp_wmb();
> >Because smp_wmb() does not order reads, reads from the critical section
> >could leak out of the critical section.  A full memory barrier (smp_mb())
> >seems necessary to avoid this.
> >
> >Yes, you do have full memory barriers implicit in various atomic operations,
> >but it appears to be possible to avoid them all in some situations.
> 
> Yes, you are right.
> 
> >>+/**
> >>+ * queue_write_3step_lock - acquire write lock in 3 steps
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ * Return: 1 if lock acquired, 0 otherwise
> >>+ *
> >>+ * Step 1 - Try to acquire the lock directly if no reader is present
> >>+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> >>+ * Step 3 - When the readers field goes to 0, set the locked flag
> >>+ *
> >>+ * When not in fair mode, the readers actually ignore the second step.
> >>+ * However, this is still necessary to force other writers to fall in line.
> >>+ */
> >>+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
> >>+{
> >>+	union qrwcnts old, new;
> >>+
> >>+	old.rw = ACCESS_ONCE(lock->cnts.rw);
> >>+
> >>+	/* Step 1 */
> >>+	if (!old.writer&  !old.readers) {
> >>+		new.rw     = old.rw;
> >>+		new.writer = QW_LOCKED;
> >>+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> >>+			return 1;
> >>+	}
> >>+
> >>+	/* Step 2 */
> >>+	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
> >>+		return 0;
> >>+
> >>+	/* Step 3 */
> >>+	while (true) {
> >>+		cpu_relax();
> >>+		old.rw = ACCESS_ONCE(lock->cnts.rw);
> >Suppose that there now is a writer, but no readers...
> >
> >>+		if (!old.readers) {
> >>+			new.rw     = old.rw;
> >>+			new.writer = QW_LOCKED;
> >>+			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> >>+				== old.rw))
> >... can't this mistakenly hand out the lock to a second writer?
> >
> >Ah, the trick is that we are at the head of the queue, so the only writer
> >we can possibly contend with is a prior holder of the lock.  Once that
> >writer leaves, no other writer but can appear.  And the QW_WAITING bit
> >prevents new writers from immediately grabbing the lock.
> 
> Yes, that is the point. The current has 2 queue for writers - a one
> position queue in the waiting bit and the MCS locking queue that are
> used by both readers and writers.
> 
> >>+				return 1;
> >>+		}
> >>+	}
> >>+	/* Should never reach here */
> >>+	return 0;
> >>+}
> >>+
> >>+/**
> >>+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+void queue_write_lock_slowpath(struct qrwlock *lock)
> >>+{
> >>+	struct qrwnode node;
> >>+
> >>+	/*
> >>+	 * Put the writer into the wait queue
> >>+	 */
> >>+	wait_in_queue(lock,&node);
> >>+
> >>+	/*
> >>+	 * At the head of the wait queue now, call queue_write_3step_lock()
> >>+	 * to acquire the lock until it is done.
> >>+	 */
> >>+	while (!queue_write_3step_lock(lock))
> >>+		cpu_relax();
> >If we get here, queue_write_3step_lock() just executed a successful
> >cmpxchg(), which implies a full memory barrier.  This prevents the
> >critical section from leaking out, good!
> >
> >>+	signal_next(lock,&node);
> >>+}
> >>+EXPORT_SYMBOL(queue_write_lock_slowpath);
> >>-- 
> >>1.7.1
> >>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/
> 
> Thank for the review.
> 
> -Longman
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-08 21:21   ` Paul E. McKenney
  2013-11-08 22:42     ` Waiman Long
@ 2013-11-09  1:17     ` Tim Chen
  2013-11-09  3:07       ` Paul E. McKenney
  1 sibling, 1 reply; 13+ messages in thread
From: Tim Chen @ 2013-11-09  1:17 UTC (permalink / raw)
  To: paulmck
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Linus Torvalds, Raghavendra K T, George Spelvin,
	Aswin Chandramouleeswaran",
	Scott J Norton

On Fri, 2013-11-08 at 13:21 -0800, Paul E. McKenney wrote:
> On Mon, Nov 04, 2013 at 12:17:20PM -0500, Waiman Long wrote:
> > There is a pending patch in the rwsem patch series that adds a generic
> > MCS locking helper functions to do MCS-style locking. This patch
> > will enable the queue rwlock to use that generic MCS lock/unlock
> > primitives for internal queuing. This patch should only be merged
> > after the merging of that generic MCS locking patch.
> > 
> > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> 
> This one does might address at least some of the earlier memory-barrier
> issues, at least assuming that the MCS lock is properly memory-barriered.

Paul, will appreciate if you can take a look the latest version
of MCS lock with load-acquire and store-release to see if it is now
properly memory-barriered.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
  2013-11-08 23:51       ` Paul E. McKenney
@ 2013-11-09  3:05         ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-11-09  3:05 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On 11/08/2013 06:51 PM, Paul E. McKenney wrote:
> On Fri, Nov 08, 2013 at 05:36:12PM -0500, Waiman Long wrote:
>> I have some incorrect assumptions about memory barrier. Anyway, this
>> issue will be gone once I use the MCS lock/unlock code.
> Here is a presentation that has some diagrams that might help:
>
> http://www.rdrop.com/users/paulmck/scalability/paper/Scaling.2013.10.25c.pdf
>
> So, for example, if X and Y are both initially zero:
>
> 	CPU 0			CPU 1
>
> 	ACCESS_ONCE(X) = 1;	r1 = ACCESS_ONCE(Y);
> 	smp_wmb();		smp_rmb();
> 	ACCESS_ONCE(Y) = 1;	r2 = ACCESS_ONCE(X);
>
> Then the two memory barriers enforce a conditional ordering.  The
> condition is whether or not CPU 0's store to Y is seen by CPU 1's
> load from Y.  If it is, then the pair of memory barriers ensure that
> CPU 1's load from X sees the result of CPU 0's store to X.  In other
> words, BUG_ON(r1 == 1&&  r2 == 0) will never fire.
>
> In general, if a memory access after memory barrier A happens before
> a memory access before memory barrier B, then the two memory barriers
> will ensure that applicable accesses before memory barrier A happen
> before applicable accesses after memory barrier B.
>
> Does that help?
>
> 							Thanx, Paul
>
>

Thank for the pointer. I understand the purpose of the memory barrier. I 
just thought that memory barrier can also kind of flush the cached data 
to the memory faster. Apparently that is not the case. Anyway, I now 
have a better understanding of what kind of barriers are needed in 
locking primitives by observing conversation in this and related threads.

-Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-09  1:17     ` Tim Chen
@ 2013-11-09  3:07       ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2013-11-09  3:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Linus Torvalds, Raghavendra K T, George Spelvin,
	Aswin Chandramouleeswaran",
	Scott J Norton

On Fri, Nov 08, 2013 at 05:17:07PM -0800, Tim Chen wrote:
> On Fri, 2013-11-08 at 13:21 -0800, Paul E. McKenney wrote:
> > On Mon, Nov 04, 2013 at 12:17:20PM -0500, Waiman Long wrote:
> > > There is a pending patch in the rwsem patch series that adds a generic
> > > MCS locking helper functions to do MCS-style locking. This patch
> > > will enable the queue rwlock to use that generic MCS lock/unlock
> > > primitives for internal queuing. This patch should only be merged
> > > after the merging of that generic MCS locking patch.
> > > 
> > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > 
> > This one does might address at least some of the earlier memory-barrier
> > issues, at least assuming that the MCS lock is properly memory-barriered.
> 
> Paul, will appreciate if you can take a look the latest version
> of MCS lock with load-acquire and store-release to see if it is now
> properly memory-barriered.

Don't worry, you cannot escape.  It is on my list.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-11-09  3:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
2013-11-08 21:11   ` Paul E. McKenney
2013-11-08 22:36     ` Waiman Long
2013-11-08 23:51       ` Paul E. McKenney
2013-11-09  3:05         ` Waiman Long
2013-11-04 17:17 ` [PATCH v5 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
2013-11-04 17:17 ` [PATCH v5 3/4] qrwlock: Enable fair " Waiman Long
2013-11-04 17:17 ` [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
2013-11-08 21:21   ` Paul E. McKenney
2013-11-08 22:42     ` Waiman Long
2013-11-09  1:17     ` Tim Chen
2013-11-09  3:07       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).