All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] qrwlock: Introducing a queue read/write lock implementation
@ 2013-07-13  1:34 ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch set introduces a queue-based read/write implementation that
is both faster and fairer than the current read/write lock. It can
also be used as a replacement for ticket spinlocks that are highly
contended if lock size increase is not an issue.

There is no change in the interface. By just replacing the current
read/write lock with the queue read/write lock, we can have a faster
and more deterministic system.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Waiman Long (2):
  qrwlock: A queue read/write lock implementation
  x86 qrwlock: Enable x86 to use queue read/write lock

 arch/x86/Kconfig                      |    3 +
 arch/x86/include/asm/spinlock.h       |    2 +
 arch/x86/include/asm/spinlock_types.h |    4 +
 include/asm-generic/qrwlock.h         |  124 +++++++++++++++++
 lib/Kconfig                           |   11 ++
 lib/Makefile                          |    1 +
 lib/qrwlock.c                         |  246 +++++++++++++++++++++++++++++++++
 7 files changed, 391 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH RFC 0/2] qrwlock: Introducing a queue read/write lock implementation
@ 2013-07-13  1:34 ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch set introduces a queue-based read/write implementation that
is both faster and fairer than the current read/write lock. It can
also be used as a replacement for ticket spinlocks that are highly
contended if lock size increase is not an issue.

There is no change in the interface. By just replacing the current
read/write lock with the queue read/write lock, we can have a faster
and more deterministic system.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Waiman Long (2):
  qrwlock: A queue read/write lock implementation
  x86 qrwlock: Enable x86 to use queue read/write lock

 arch/x86/Kconfig                      |    3 +
 arch/x86/include/asm/spinlock.h       |    2 +
 arch/x86/include/asm/spinlock_types.h |    4 +
 include/asm-generic/qrwlock.h         |  124 +++++++++++++++++
 lib/Kconfig                           |   11 ++
 lib/Makefile                          |    1 +
 lib/qrwlock.c                         |  246 +++++++++++++++++++++++++++++++++
 7 files changed, 391 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-13  1:34 ` Waiman Long
@ 2013-07-13  1:34   ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch introduces a read/write lock implementation that put waiting
readers and writers into a queue instead of actively contending the
lock like the regular read/write lock. This will improve performance in
highly contended situation by reducing the cache line bouncing effect.

In addition, the queue read/write lock is more deterministic even
though there is still a smaller chance for lock stealing if the reader
or writer comes at the right moment. Other than that, lock granting
is done in a FIFO manner. The only downside is the size increase in
the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
64-bit systems.

This patch allows the replacement of architecture specific
implementation of read/write lock by this generic version of queue
read/write lock. Two new config parameters are introduced:

1. QUEUE_RWLOCK
   A select-able option that enables the building and replacement of
   architecture specific read/write lock.
2. ARCH_QUEUE_RWLOCK
   Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
table shows the average time for a single lock/unlock sequence:

Lock Type		Time (ns)
---------		---------
Ticket spinlock		  15.7
Read lock		  17.0
Write lock		  17.2
Queue read lock		  31.1
Queue write lock	  13.6

While the queue read lock is almost double the time of a read lock
or spinlock, the queue write lock is the fastest of them all. The
execution time can probably be reduced a bit by allowing inlining of
the lock fast paths like the other locks.

To see how the queue write lock can be used as a replacement for ticket
spinlock (just like rwsem can be used as replacement of mutex), the
mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
write lock and a regular write lock. When running on a 8-socket 80-core
DL980 system, the performance improvement was shown in the table below.

+-----------------+------------+------------+-----------+---------+
|  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
|                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
+-----------------+-----------------------------------------------+
|                 |              User Range 10 - 100              |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
| 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
+-----------------+-----------------------------------------------+
|                 |              User Range 200 - 1000            |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
| 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
+-----------------+-----------------------------------------------+
|                 |              User Range 1100 - 2000           |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
| 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
+-----------------+------------+------------+-----------+---------+

Apparently, the regular read/write lock performs even better than
the queue read/write lock in some cases.  This is probably due to the
fact that mb_cache_spinlock is in a separate cache line from the data
being manipulated.

Looking at the fserver and new_fserver workloads (ext4 FS) where the
journal->j_state_lock (a read/write lock) is a bottleneck especially
when HT is on, we see a slight different story. The j_state_lock is
an embedded read/write lock which is in the same cacheline as some
of the data being manipulated. The replacement by a queue read/write
lock gives the following improvement.

+--------------------+-------------+--------------+---------------+
|      Workload      |mean % change|mean % change |mean % change  |
|                    |10-100 users |200-1000 users|1100-2000 users|
+--------------------+-------------+--------------+---------------+
|fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
|fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
|new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
|new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
+--------------------+-------------+--------------+---------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |  124 +++++++++++++++++++++
 lib/Kconfig                   |   11 ++
 lib/Makefile                  |    1 +
 lib/qrwlock.c                 |  246 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 382 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..d758dd0
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,124 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+
+#if (CONFIG_NR_CPUS < 65536)
+typedef u16 __nrcpu_t;
+typedef u32 __nrcpupair_t;
+#else
+typedef u32 __nrcpu_t;
+typedef u64 __nrcpupair_t;
+#endif
+
+/*
+ * The queue read/write lock data structure
+ * The reader stealing flag, if sea,t will enable reader at the head of the
+ * waiting queue to steal the read lock even if a writer is waiting. However,
+ * that may cause starving of a writer by a perpetual stream of readers.
+ */
+struct qrwnode {
+	struct qrwnode *next;
+	bool		wait;	/* Waiting flag */
+};
+
+typedef struct qrwlock {
+	union {
+		__nrcpupair_t rw;		/* Reader/writer number pair */
+		struct {
+			u8	  writer;	/* Set if a writer is waiting */
+			u8	  rsteal;	/* Reader stealing flag */
+			__nrcpu_t readers;	/* Number of active readers */
+		};
+	};
+	struct qrwnode *waitq;			/* Tail of waiting queue */
+} arch_rwlock_t;
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+	return lock->writer == 0;
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+	return lock->rw == 0;
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Atomically decrement the reader count
+	 */
+	add_smp(&lock->readers, -1);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+	ACCESS_ONCE(lock->writer) = 0;
+	smp_wmb();
+}
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock(struct qrwlock *);
+extern void queue_write_lock(struct qrwlock *);
+extern int  queue_read_trylock(struct qrwlock *);
+extern int  queue_write_trylock(struct qrwlock *);
+
+/*
+ * Initializier
+ */
+#define	__ARCH_RW_LOCK_UNLOCKED	{ { .rw = 0 }, .waitq = NULL }
+#define	__ARCH_RW_LOCK_UNLOCKED_RSTEAL	\
+	{ { .writer = 0, .rsteal = 1, .readers = 0 }, waitq = NULL }
+
+/*
+ * Remapping read/write lock architecture specific functions to the
+ * corresponding queue read/write lock functions.
+ */
+#define arch_read_can_lock(l)	queue_read_can_lock(l)
+#define arch_write_can_lock(l)	queue_write_can_lock(l)
+#define arch_read_lock(l)	queue_read_lock(l)
+#define arch_write_lock(l)	queue_write_lock(l)
+#define arch_read_trylock(l)	queue_read_trylock(l)
+#define arch_write_trylock(l)	queue_write_trylock(l)
+#define arch_read_unlock(l)	queue_read_unlock(l)
+#define arch_write_unlock(l)	queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 35da513..de32799 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -412,6 +412,17 @@ config SIGNATURE
 	  Implementation is done using GnuPG MPI library
 
 #
+# Generic queue read/write lock
+#
+config QUEUE_RWLOCK
+	bool "Generic queue read/write lock"
+	depends on ARCH_QUEUE_RWLOCK
+	help
+	  Use a NUMA optimized queue read/write lock implementation. This
+	  improves performance under lock contention on systems with more
+	  than two sockets.
+
+#
 # libfdt files, only selected if needed.
 #
 config LIBFDT
diff --git a/lib/Makefile b/lib/Makefile
index 7baccfd..2888c17 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/lib/qrwlock.c b/lib/qrwlock.c
new file mode 100644
index 0000000..a206fae
--- /dev/null
+++ b/lib/qrwlock.c
@@ -0,0 +1,246 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular read/write lock, the queue read/write lock has
+ * has the following advantages:
+ * 1. It is more deterministic. Even though there is a slight chance of
+ *    stealing the lock if come at the right moment, the granting of the
+ *    lock is mostly in FIFO order.
+ * 2. It is faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot waiting queue. The writers that follows will have to wait
+ * in the combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue read/write lock is faster
+ * in high contention situation. The writer lock is also faster in single
+ * thread operations. Therefore, queue read/write lock can be considered as
+ * a replacement for those spinlocks that are highly contended as long as
+ * an increase in lock size is not an issue.
+ */
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to be added to the queue
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *prev;
+
+	node->next = NULL;
+	node->wait = true;
+	barrier();
+	prev = xchg(&lock->waitq, node);
+	if (prev) {
+		prev->next = node;
+		smp_wmb();
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (ACCESS_ONCE(node->wait))
+			cpu_relax();
+	} else
+		node->wait = false;
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *next;
+
+	/*
+	 * Notify the next one in queue or clear the waiting queue
+	 */
+	if (ACCESS_ONCE(lock->waitq) == node) {
+		if (cmpxchg(&lock->waitq, node, NULL) == node)
+			return;
+	}
+	/*
+	 * Wait until the next one in queue set up the next field
+	 */
+	while (!likely(next = ACCESS_ONCE(node->next)))
+		cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	ACCESS_ONCE(next->wait) = false;
+	smp_wmb();
+}
+
+/**
+ * queue_read_lock - acquire read lock of a queue read/write lock
+ * @lock: Pointer to queue read/writer lock structure
+ */
+void queue_read_lock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+	struct qrwnode node;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	while (likely(!old.writer)) {
+		/*
+		 * Atomically increment the reader count while writer is 0
+		 */
+		new.rw = old.rw;
+		new.readers++;
+
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			return;
+		cpu_relax();
+		old.rw = ACCESS_ONCE(lock->rw);
+	}
+	/*
+	 * Slowpath
+	 * Put the reader into the waiting queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, wait until no writer is pending
+	 * or when the reader stealing flag is set and readers are present.
+	 * Then try to atomically inc the reader number.
+	 */
+	while (true) {
+		old.rw = ACCESS_ONCE(lock->rw);
+		if (old.writer && (!old.rsteal || !old.readers)) {
+			cpu_relax();
+			continue;
+		}
+		new.rw = old.rw;
+		new.readers++;
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			break;
+		cpu_relax();
+	}
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock);
+
+
+/*
+ * queue_read_trylock - try to acquire read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_read_trylock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	if (unlikely(old.writer))
+		return 0;
+	new.rw = old.rw;
+	new.readers++;
+
+	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+		return 1;
+	cpu_relax();
+	return 0;
+}
+EXPORT_SYMBOL(queue_read_trylock);
+
+/**
+ * queue_write_lock - acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+void queue_write_lock(struct qrwlock *lock)
+{
+	struct qrwnode node, *next;
+
+	if (likely(!ACCESS_ONCE(lock->writer))) {
+		/*
+		 * Atomically set the writer to 1, then wait until reader
+		 * count goes to 0.
+		 */
+		if (xchg(&lock->writer, 1) == 0) {
+			while (ACCESS_ONCE(lock->readers))
+				cpu_relax();
+			return;
+		}
+		cpu_relax();
+	}
+	/*
+	 * Slowpath
+	 * Put the writer into the waiting queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, wait until no writer is pending
+	 * and then atomically set it again.
+	 */
+	while (true) {
+		if (ACCESS_ONCE(lock->writer)) {
+			cpu_relax();
+			continue;
+		}
+		if (xchg(&lock->writer, 1) != 0) {
+			cpu_relax();
+			continue;
+		}
+		break;
+	}
+	/*
+	 * Wait until the reader count go to zero
+	 */
+	while (ACCESS_ONCE(lock->readers))
+		cpu_relax();
+
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock);
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_write_trylock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	if (!old.rw) {
+		/*
+		 * Atomically set the writer to 1 if readers = 0
+		 */
+		new.rw = old.rw;
+		new.writer = 1;
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			return 1;
+		cpu_relax();
+	}
+	return 0;
+}
+EXPORT_SYMBOL(queue_write_trylock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-13  1:34   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch introduces a read/write lock implementation that put waiting
readers and writers into a queue instead of actively contending the
lock like the regular read/write lock. This will improve performance in
highly contended situation by reducing the cache line bouncing effect.

In addition, the queue read/write lock is more deterministic even
though there is still a smaller chance for lock stealing if the reader
or writer comes at the right moment. Other than that, lock granting
is done in a FIFO manner. The only downside is the size increase in
the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
64-bit systems.

This patch allows the replacement of architecture specific
implementation of read/write lock by this generic version of queue
read/write lock. Two new config parameters are introduced:

1. QUEUE_RWLOCK
   A select-able option that enables the building and replacement of
   architecture specific read/write lock.
2. ARCH_QUEUE_RWLOCK
   Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
table shows the average time for a single lock/unlock sequence:

Lock Type		Time (ns)
---------		---------
Ticket spinlock		  15.7
Read lock		  17.0
Write lock		  17.2
Queue read lock		  31.1
Queue write lock	  13.6

While the queue read lock is almost double the time of a read lock
or spinlock, the queue write lock is the fastest of them all. The
execution time can probably be reduced a bit by allowing inlining of
the lock fast paths like the other locks.

To see how the queue write lock can be used as a replacement for ticket
spinlock (just like rwsem can be used as replacement of mutex), the
mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
write lock and a regular write lock. When running on a 8-socket 80-core
DL980 system, the performance improvement was shown in the table below.

+-----------------+------------+------------+-----------+---------+
|  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
|                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
+-----------------+-----------------------------------------------+
|                 |              User Range 10 - 100              |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
| 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
+-----------------+-----------------------------------------------+
|                 |              User Range 200 - 1000            |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
| 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
+-----------------+-----------------------------------------------+
|                 |              User Range 1100 - 2000           |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
| 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
+-----------------+------------+------------+-----------+---------+

Apparently, the regular read/write lock performs even better than
the queue read/write lock in some cases.  This is probably due to the
fact that mb_cache_spinlock is in a separate cache line from the data
being manipulated.

Looking at the fserver and new_fserver workloads (ext4 FS) where the
journal->j_state_lock (a read/write lock) is a bottleneck especially
when HT is on, we see a slight different story. The j_state_lock is
an embedded read/write lock which is in the same cacheline as some
of the data being manipulated. The replacement by a queue read/write
lock gives the following improvement.

+--------------------+-------------+--------------+---------------+
|      Workload      |mean % change|mean % change |mean % change  |
|                    |10-100 users |200-1000 users|1100-2000 users|
+--------------------+-------------+--------------+---------------+
|fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
|fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
|new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
|new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
+--------------------+-------------+--------------+---------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |  124 +++++++++++++++++++++
 lib/Kconfig                   |   11 ++
 lib/Makefile                  |    1 +
 lib/qrwlock.c                 |  246 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 382 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..d758dd0
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,124 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+
+#if (CONFIG_NR_CPUS < 65536)
+typedef u16 __nrcpu_t;
+typedef u32 __nrcpupair_t;
+#else
+typedef u32 __nrcpu_t;
+typedef u64 __nrcpupair_t;
+#endif
+
+/*
+ * The queue read/write lock data structure
+ * The reader stealing flag, if sea,t will enable reader at the head of the
+ * waiting queue to steal the read lock even if a writer is waiting. However,
+ * that may cause starving of a writer by a perpetual stream of readers.
+ */
+struct qrwnode {
+	struct qrwnode *next;
+	bool		wait;	/* Waiting flag */
+};
+
+typedef struct qrwlock {
+	union {
+		__nrcpupair_t rw;		/* Reader/writer number pair */
+		struct {
+			u8	  writer;	/* Set if a writer is waiting */
+			u8	  rsteal;	/* Reader stealing flag */
+			__nrcpu_t readers;	/* Number of active readers */
+		};
+	};
+	struct qrwnode *waitq;			/* Tail of waiting queue */
+} arch_rwlock_t;
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+	return lock->writer == 0;
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue read/writer lock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+	return lock->rw == 0;
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Atomically decrement the reader count
+	 */
+	add_smp(&lock->readers, -1);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+	ACCESS_ONCE(lock->writer) = 0;
+	smp_wmb();
+}
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock(struct qrwlock *);
+extern void queue_write_lock(struct qrwlock *);
+extern int  queue_read_trylock(struct qrwlock *);
+extern int  queue_write_trylock(struct qrwlock *);
+
+/*
+ * Initializier
+ */
+#define	__ARCH_RW_LOCK_UNLOCKED	{ { .rw = 0 }, .waitq = NULL }
+#define	__ARCH_RW_LOCK_UNLOCKED_RSTEAL	\
+	{ { .writer = 0, .rsteal = 1, .readers = 0 }, waitq = NULL }
+
+/*
+ * Remapping read/write lock architecture specific functions to the
+ * corresponding queue read/write lock functions.
+ */
+#define arch_read_can_lock(l)	queue_read_can_lock(l)
+#define arch_write_can_lock(l)	queue_write_can_lock(l)
+#define arch_read_lock(l)	queue_read_lock(l)
+#define arch_write_lock(l)	queue_write_lock(l)
+#define arch_read_trylock(l)	queue_read_trylock(l)
+#define arch_write_trylock(l)	queue_write_trylock(l)
+#define arch_read_unlock(l)	queue_read_unlock(l)
+#define arch_write_unlock(l)	queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 35da513..de32799 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -412,6 +412,17 @@ config SIGNATURE
 	  Implementation is done using GnuPG MPI library
 
 #
+# Generic queue read/write lock
+#
+config QUEUE_RWLOCK
+	bool "Generic queue read/write lock"
+	depends on ARCH_QUEUE_RWLOCK
+	help
+	  Use a NUMA optimized queue read/write lock implementation. This
+	  improves performance under lock contention on systems with more
+	  than two sockets.
+
+#
 # libfdt files, only selected if needed.
 #
 config LIBFDT
diff --git a/lib/Makefile b/lib/Makefile
index 7baccfd..2888c17 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/lib/qrwlock.c b/lib/qrwlock.c
new file mode 100644
index 0000000..a206fae
--- /dev/null
+++ b/lib/qrwlock.c
@@ -0,0 +1,246 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular read/write lock, the queue read/write lock has
+ * has the following advantages:
+ * 1. It is more deterministic. Even though there is a slight chance of
+ *    stealing the lock if come at the right moment, the granting of the
+ *    lock is mostly in FIFO order.
+ * 2. It is faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot waiting queue. The writers that follows will have to wait
+ * in the combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue read/write lock is faster
+ * in high contention situation. The writer lock is also faster in single
+ * thread operations. Therefore, queue read/write lock can be considered as
+ * a replacement for those spinlocks that are highly contended as long as
+ * an increase in lock size is not an issue.
+ */
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to be added to the queue
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *prev;
+
+	node->next = NULL;
+	node->wait = true;
+	barrier();
+	prev = xchg(&lock->waitq, node);
+	if (prev) {
+		prev->next = node;
+		smp_wmb();
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (ACCESS_ONCE(node->wait))
+			cpu_relax();
+	} else
+		node->wait = false;
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue read/writer lock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *next;
+
+	/*
+	 * Notify the next one in queue or clear the waiting queue
+	 */
+	if (ACCESS_ONCE(lock->waitq) == node) {
+		if (cmpxchg(&lock->waitq, node, NULL) == node)
+			return;
+	}
+	/*
+	 * Wait until the next one in queue set up the next field
+	 */
+	while (!likely(next = ACCESS_ONCE(node->next)))
+		cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	ACCESS_ONCE(next->wait) = false;
+	smp_wmb();
+}
+
+/**
+ * queue_read_lock - acquire read lock of a queue read/write lock
+ * @lock: Pointer to queue read/writer lock structure
+ */
+void queue_read_lock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+	struct qrwnode node;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	while (likely(!old.writer)) {
+		/*
+		 * Atomically increment the reader count while writer is 0
+		 */
+		new.rw = old.rw;
+		new.readers++;
+
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			return;
+		cpu_relax();
+		old.rw = ACCESS_ONCE(lock->rw);
+	}
+	/*
+	 * Slowpath
+	 * Put the reader into the waiting queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, wait until no writer is pending
+	 * or when the reader stealing flag is set and readers are present.
+	 * Then try to atomically inc the reader number.
+	 */
+	while (true) {
+		old.rw = ACCESS_ONCE(lock->rw);
+		if (old.writer && (!old.rsteal || !old.readers)) {
+			cpu_relax();
+			continue;
+		}
+		new.rw = old.rw;
+		new.readers++;
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			break;
+		cpu_relax();
+	}
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock);
+
+
+/*
+ * queue_read_trylock - try to acquire read lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_read_trylock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	if (unlikely(old.writer))
+		return 0;
+	new.rw = old.rw;
+	new.readers++;
+
+	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+		return 1;
+	cpu_relax();
+	return 0;
+}
+EXPORT_SYMBOL(queue_read_trylock);
+
+/**
+ * queue_write_lock - acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ */
+void queue_write_lock(struct qrwlock *lock)
+{
+	struct qrwnode node, *next;
+
+	if (likely(!ACCESS_ONCE(lock->writer))) {
+		/*
+		 * Atomically set the writer to 1, then wait until reader
+		 * count goes to 0.
+		 */
+		if (xchg(&lock->writer, 1) == 0) {
+			while (ACCESS_ONCE(lock->readers))
+				cpu_relax();
+			return;
+		}
+		cpu_relax();
+	}
+	/*
+	 * Slowpath
+	 * Put the writer into the waiting queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, wait until no writer is pending
+	 * and then atomically set it again.
+	 */
+	while (true) {
+		if (ACCESS_ONCE(lock->writer)) {
+			cpu_relax();
+			continue;
+		}
+		if (xchg(&lock->writer, 1) != 0) {
+			cpu_relax();
+			continue;
+		}
+		break;
+	}
+	/*
+	 * Wait until the reader count go to zero
+	 */
+	while (ACCESS_ONCE(lock->readers))
+		cpu_relax();
+
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock);
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue read/write lock
+ * @lock : Pointer to queue read/writer lock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+int queue_write_trylock(struct qrwlock *lock)
+{
+	struct qrwlock old, new;
+
+	old.rw = ACCESS_ONCE(lock->rw);
+	if (!old.rw) {
+		/*
+		 * Atomically set the writer to 1 if readers = 0
+		 */
+		new.rw = old.rw;
+		new.writer = 1;
+		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
+			return 1;
+		cpu_relax();
+	}
+	return 0;
+}
+EXPORT_SYMBOL(queue_write_trylock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 2/2] x86 qrwlock: Enable x86 to use queue read/write lock
  2013-07-13  1:34 ` Waiman Long
@ 2013-07-13  1:34   ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the plain read/write lock by the queue read/write lock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                      |    3 +++
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b32ebf9..638dbaa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
 	bool
 	depends on STA2X11
 
+config ARCH_QUEUE_RWLOCK
+	def_bool y
+
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..613a4ff 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -137,6 +137,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -219,6 +220,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..afacd36 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -28,6 +28,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 2/2] x86 qrwlock: Enable x86 to use queue read/write lock
@ 2013-07-13  1:34   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-13  1:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Andrew Morton, Richard Weinberger,
	Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu,
	Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Chandramouleeswaran, Aswin, Norton, Scott J

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the plain read/write lock by the queue read/write lock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                      |    3 +++
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b32ebf9..638dbaa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
 	bool
 	depends on STA2X11
 
+config ARCH_QUEUE_RWLOCK
+	def_bool y
+
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..613a4ff 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -137,6 +137,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -219,6 +220,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..afacd36 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -28,6 +28,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-13  1:34   ` Waiman Long
@ 2013-07-15 14:39     ` Steven Rostedt
  -1 siblings, 0 replies; 49+ messages in thread
From: Steven Rostedt @ 2013-07-15 14:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 2013-07-12 at 21:34 -0400, Waiman Long wrote:

> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  include/asm-generic/qrwlock.h |  124 +++++++++++++++++++++
>  lib/Kconfig                   |   11 ++
>  lib/Makefile                  |    1 +
>  lib/qrwlock.c                 |  246 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 382 insertions(+), 0 deletions(-)
>  create mode 100644 include/asm-generic/qrwlock.h
>  create mode 100644 lib/qrwlock.c
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> new file mode 100644
> index 0000000..d758dd0
> --- /dev/null
> +++ b/include/asm-generic/qrwlock.h
> @@ -0,0 +1,124 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#ifndef __ASM_GENERIC_QRWLOCK_H
> +#define __ASM_GENERIC_QRWLOCK_H
> +
> +#include <linux/types.h>
> +#include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> +
> +#if (CONFIG_NR_CPUS < 65536)
> +typedef u16 __nrcpu_t;
> +typedef u32 __nrcpupair_t;
> +#else
> +typedef u32 __nrcpu_t;
> +typedef u64 __nrcpupair_t;
> +#endif
> +
> +/*
> + * The queue read/write lock data structure
> + * The reader stealing flag, if sea,t will enable reader at the head of the

"sea,t"?

> + * waiting queue to steal the read lock even if a writer is waiting. However,
> + * that may cause starving of a writer by a perpetual stream of readers.
> + */
> +struct qrwnode {
> +	struct qrwnode *next;
> +	bool		wait;	/* Waiting flag */
> +};
> +
> +typedef struct qrwlock {
> +	union {
> +		__nrcpupair_t rw;		/* Reader/writer number pair */
> +		struct {
> +			u8	  writer;	/* Set if a writer is waiting */
> +			u8	  rsteal;	/* Reader stealing flag */
> +			__nrcpu_t readers;	/* Number of active readers */
> +		};
> +	};
> +	struct qrwnode *waitq;			/* Tail of waiting queue */
> +} arch_rwlock_t;
> +
> +/**
> + * queue_read_can_lock- would read_trylock() succeed?
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +static inline int queue_read_can_lock(struct qrwlock *lock)
> +{
> +	return lock->writer == 0;
> +}
> +
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> +	return lock->rw == 0;
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Atomically decrement the reader count
> +	 */
> +	add_smp(&lock->readers, -1);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	ACCESS_ONCE(lock->writer) = 0;
> +	smp_wmb();
> +}
> +
> +/*
> + * External function declarations
> + */
> +extern void queue_read_lock(struct qrwlock *);
> +extern void queue_write_lock(struct qrwlock *);
> +extern int  queue_read_trylock(struct qrwlock *);
> +extern int  queue_write_trylock(struct qrwlock *);
> +
> +/*
> + * Initializier
> + */
> +#define	__ARCH_RW_LOCK_UNLOCKED	{ { .rw = 0 }, .waitq = NULL }
> +#define	__ARCH_RW_LOCK_UNLOCKED_RSTEAL	\
> +	{ { .writer = 0, .rsteal = 1, .readers = 0 }, waitq = NULL }
> +
> +/*
> + * Remapping read/write lock architecture specific functions to the
> + * corresponding queue read/write lock functions.
> + */
> +#define arch_read_can_lock(l)	queue_read_can_lock(l)
> +#define arch_write_can_lock(l)	queue_write_can_lock(l)
> +#define arch_read_lock(l)	queue_read_lock(l)
> +#define arch_write_lock(l)	queue_write_lock(l)
> +#define arch_read_trylock(l)	queue_read_trylock(l)
> +#define arch_write_trylock(l)	queue_write_trylock(l)
> +#define arch_read_unlock(l)	queue_read_unlock(l)
> +#define arch_write_unlock(l)	queue_write_unlock(l)
> +
> +#endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 35da513..de32799 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -412,6 +412,17 @@ config SIGNATURE
>  	  Implementation is done using GnuPG MPI library
>  
>  #
> +# Generic queue read/write lock
> +#
> +config QUEUE_RWLOCK
> +	bool "Generic queue read/write lock"
> +	depends on ARCH_QUEUE_RWLOCK
> +	help
> +	  Use a NUMA optimized queue read/write lock implementation. This
> +	  improves performance under lock contention on systems with more
> +	  than two sockets.
> +
> +#
>  # libfdt files, only selected if needed.
>  #
>  config LIBFDT
> diff --git a/lib/Makefile b/lib/Makefile
> index 7baccfd..2888c17 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN     $@
>  clean-files	+= oid_registry_data.c
>  
>  obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
> +obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
> diff --git a/lib/qrwlock.c b/lib/qrwlock.c
> new file mode 100644
> index 0000000..a206fae
> --- /dev/null
> +++ b/lib/qrwlock.c
> @@ -0,0 +1,246 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#include <linux/smp.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <asm-generic/qrwlock.h>
> +
> +/*
> + * Compared with regular read/write lock, the queue read/write lock has
> + * has the following advantages:
> + * 1. It is more deterministic. Even though there is a slight chance of
> + *    stealing the lock if come at the right moment, the granting of the
> + *    lock is mostly in FIFO order.
> + * 2. It is faster in high contention situation.
> + *
> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot waiting queue. The writers that follows will have to wait
> + * in the combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
> + * in high contention situation. The writer lock is also faster in single
> + * thread operations. Therefore, queue read/write lock can be considered as
> + * a replacement for those spinlocks that are highly contended as long as
> + * an increase in lock size is not an issue.
> + */
> +
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue read/writer lock structure
> + * @node: Node pointer to be added to the queue
> + */
> +static __always_inline void
> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *prev;
> +
> +	node->next = NULL;
> +	node->wait = true;
> +	barrier();
> +	prev = xchg(&lock->waitq, node);

"barrier()" isn't needed, as xchg() is a full blown smp_mb(), it also
acts as a compiler barrier.

> +	if (prev) {
> +		prev->next = node;
> +		smp_wmb();
> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (ACCESS_ONCE(node->wait))
> +			cpu_relax();
> +	} else
> +		node->wait = false;
> +}
> +
> +/**
> + * signal_next - Signal the next one in queue to be at the head
> + * @lock: Pointer to queue read/writer lock structure
> + * @node: Node pointer to the current head of queue
> + */
> +static __always_inline void
> +signal_next(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *next;
> +
> +	/*
> +	 * Notify the next one in queue or clear the waiting queue
> +	 */
> +	if (ACCESS_ONCE(lock->waitq) == node) {
> +		if (cmpxchg(&lock->waitq, node, NULL) == node)
> +			return;
> +	}
> +	/*
> +	 * Wait until the next one in queue set up the next field
> +	 */
> +	while (!likely(next = ACCESS_ONCE(node->next)))
> +		cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	ACCESS_ONCE(next->wait) = false;
> +	smp_wmb();
> +}
> +
> +/**
> + * queue_read_lock - acquire read lock of a queue read/write lock
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +void queue_read_lock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +	struct qrwnode node;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	while (likely(!old.writer)) {
> +		/*
> +		 * Atomically increment the reader count while writer is 0
> +		 */
> +		new.rw = old.rw;
> +		new.readers++;
> +
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			return;
> +		cpu_relax();
> +		old.rw = ACCESS_ONCE(lock->rw);
> +	}
> +	/*
> +	 * Slowpath
> +	 * Put the reader into the waiting queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * or when the reader stealing flag is set and readers are present.
> +	 * Then try to atomically inc the reader number.
> +	 */
> +	while (true) {
> +		old.rw = ACCESS_ONCE(lock->rw);
> +		if (old.writer && (!old.rsteal || !old.readers)) {
> +			cpu_relax();
> +			continue;
> +		}
> +		new.rw = old.rw;
> +		new.readers++;
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			break;
> +		cpu_relax();
> +	}
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_read_lock);
> +
> +
> +/*
> + * queue_read_trylock - try to acquire read lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +int queue_read_trylock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	if (unlikely(old.writer))
> +		return 0;
> +	new.rw = old.rw;
> +	new.readers++;
> +
> +	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +		return 1;
> +	cpu_relax();

What's the cpu_relax() for? It's not in a loop.


> +	return 0;
> +}
> +EXPORT_SYMBOL(queue_read_trylock);
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +void queue_write_lock(struct qrwlock *lock)
> +{
> +	struct qrwnode node, *next;
> +
> +	if (likely(!ACCESS_ONCE(lock->writer))) {
> +		/*
> +		 * Atomically set the writer to 1, then wait until reader
> +		 * count goes to 0.
> +		 */
> +		if (xchg(&lock->writer, 1) == 0) {
> +			while (ACCESS_ONCE(lock->readers))
> +				cpu_relax();
> +			return;
> +		}
> +		cpu_relax();

Another cpu_relax() outside of a loop.

> +	}
> +	/*
> +	 * Slowpath
> +	 * Put the writer into the waiting queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * and then atomically set it again.
> +	 */
> +	while (true) {
> +		if (ACCESS_ONCE(lock->writer)) {
> +			cpu_relax();
> +			continue;
> +		}
> +		if (xchg(&lock->writer, 1) != 0) {
> +			cpu_relax();
> +			continue;
> +		}
> +		break;
> +	}
> +	/*
> +	 * Wait until the reader count go to zero
> +	 */
> +	while (ACCESS_ONCE(lock->readers))
> +		cpu_relax();
> +
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_write_lock);
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +int queue_write_trylock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	if (!old.rw) {
> +		/*
> +		 * Atomically set the writer to 1 if readers = 0
> +		 */
> +		new.rw = old.rw;
> +		new.writer = 1;
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			return 1;
> +		cpu_relax();

Again the cpu_relax with no loop.

> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(queue_write_trylock);

I haven't seen anything bad about this with a quick review. But it
should have a more thorough review to check all corner cases.

-- Steve



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-15 14:39     ` Steven Rostedt
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Rostedt @ 2013-07-15 14:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 2013-07-12 at 21:34 -0400, Waiman Long wrote:

> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  include/asm-generic/qrwlock.h |  124 +++++++++++++++++++++
>  lib/Kconfig                   |   11 ++
>  lib/Makefile                  |    1 +
>  lib/qrwlock.c                 |  246 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 382 insertions(+), 0 deletions(-)
>  create mode 100644 include/asm-generic/qrwlock.h
>  create mode 100644 lib/qrwlock.c
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> new file mode 100644
> index 0000000..d758dd0
> --- /dev/null
> +++ b/include/asm-generic/qrwlock.h
> @@ -0,0 +1,124 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#ifndef __ASM_GENERIC_QRWLOCK_H
> +#define __ASM_GENERIC_QRWLOCK_H
> +
> +#include <linux/types.h>
> +#include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> +
> +#if (CONFIG_NR_CPUS < 65536)
> +typedef u16 __nrcpu_t;
> +typedef u32 __nrcpupair_t;
> +#else
> +typedef u32 __nrcpu_t;
> +typedef u64 __nrcpupair_t;
> +#endif
> +
> +/*
> + * The queue read/write lock data structure
> + * The reader stealing flag, if sea,t will enable reader at the head of the

"sea,t"?

> + * waiting queue to steal the read lock even if a writer is waiting. However,
> + * that may cause starving of a writer by a perpetual stream of readers.
> + */
> +struct qrwnode {
> +	struct qrwnode *next;
> +	bool		wait;	/* Waiting flag */
> +};
> +
> +typedef struct qrwlock {
> +	union {
> +		__nrcpupair_t rw;		/* Reader/writer number pair */
> +		struct {
> +			u8	  writer;	/* Set if a writer is waiting */
> +			u8	  rsteal;	/* Reader stealing flag */
> +			__nrcpu_t readers;	/* Number of active readers */
> +		};
> +	};
> +	struct qrwnode *waitq;			/* Tail of waiting queue */
> +} arch_rwlock_t;
> +
> +/**
> + * queue_read_can_lock- would read_trylock() succeed?
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +static inline int queue_read_can_lock(struct qrwlock *lock)
> +{
> +	return lock->writer == 0;
> +}
> +
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> +	return lock->rw == 0;
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Atomically decrement the reader count
> +	 */
> +	add_smp(&lock->readers, -1);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	ACCESS_ONCE(lock->writer) = 0;
> +	smp_wmb();
> +}
> +
> +/*
> + * External function declarations
> + */
> +extern void queue_read_lock(struct qrwlock *);
> +extern void queue_write_lock(struct qrwlock *);
> +extern int  queue_read_trylock(struct qrwlock *);
> +extern int  queue_write_trylock(struct qrwlock *);
> +
> +/*
> + * Initializier
> + */
> +#define	__ARCH_RW_LOCK_UNLOCKED	{ { .rw = 0 }, .waitq = NULL }
> +#define	__ARCH_RW_LOCK_UNLOCKED_RSTEAL	\
> +	{ { .writer = 0, .rsteal = 1, .readers = 0 }, waitq = NULL }
> +
> +/*
> + * Remapping read/write lock architecture specific functions to the
> + * corresponding queue read/write lock functions.
> + */
> +#define arch_read_can_lock(l)	queue_read_can_lock(l)
> +#define arch_write_can_lock(l)	queue_write_can_lock(l)
> +#define arch_read_lock(l)	queue_read_lock(l)
> +#define arch_write_lock(l)	queue_write_lock(l)
> +#define arch_read_trylock(l)	queue_read_trylock(l)
> +#define arch_write_trylock(l)	queue_write_trylock(l)
> +#define arch_read_unlock(l)	queue_read_unlock(l)
> +#define arch_write_unlock(l)	queue_write_unlock(l)
> +
> +#endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 35da513..de32799 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -412,6 +412,17 @@ config SIGNATURE
>  	  Implementation is done using GnuPG MPI library
>  
>  #
> +# Generic queue read/write lock
> +#
> +config QUEUE_RWLOCK
> +	bool "Generic queue read/write lock"
> +	depends on ARCH_QUEUE_RWLOCK
> +	help
> +	  Use a NUMA optimized queue read/write lock implementation. This
> +	  improves performance under lock contention on systems with more
> +	  than two sockets.
> +
> +#
>  # libfdt files, only selected if needed.
>  #
>  config LIBFDT
> diff --git a/lib/Makefile b/lib/Makefile
> index 7baccfd..2888c17 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN     $@
>  clean-files	+= oid_registry_data.c
>  
>  obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
> +obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
> diff --git a/lib/qrwlock.c b/lib/qrwlock.c
> new file mode 100644
> index 0000000..a206fae
> --- /dev/null
> +++ b/lib/qrwlock.c
> @@ -0,0 +1,246 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#include <linux/smp.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <asm-generic/qrwlock.h>
> +
> +/*
> + * Compared with regular read/write lock, the queue read/write lock has
> + * has the following advantages:
> + * 1. It is more deterministic. Even though there is a slight chance of
> + *    stealing the lock if come at the right moment, the granting of the
> + *    lock is mostly in FIFO order.
> + * 2. It is faster in high contention situation.
> + *
> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot waiting queue. The writers that follows will have to wait
> + * in the combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
> + * in high contention situation. The writer lock is also faster in single
> + * thread operations. Therefore, queue read/write lock can be considered as
> + * a replacement for those spinlocks that are highly contended as long as
> + * an increase in lock size is not an issue.
> + */
> +
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue read/writer lock structure
> + * @node: Node pointer to be added to the queue
> + */
> +static __always_inline void
> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *prev;
> +
> +	node->next = NULL;
> +	node->wait = true;
> +	barrier();
> +	prev = xchg(&lock->waitq, node);

"barrier()" isn't needed, as xchg() is a full blown smp_mb(), it also
acts as a compiler barrier.

> +	if (prev) {
> +		prev->next = node;
> +		smp_wmb();
> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (ACCESS_ONCE(node->wait))
> +			cpu_relax();
> +	} else
> +		node->wait = false;
> +}
> +
> +/**
> + * signal_next - Signal the next one in queue to be at the head
> + * @lock: Pointer to queue read/writer lock structure
> + * @node: Node pointer to the current head of queue
> + */
> +static __always_inline void
> +signal_next(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *next;
> +
> +	/*
> +	 * Notify the next one in queue or clear the waiting queue
> +	 */
> +	if (ACCESS_ONCE(lock->waitq) == node) {
> +		if (cmpxchg(&lock->waitq, node, NULL) == node)
> +			return;
> +	}
> +	/*
> +	 * Wait until the next one in queue set up the next field
> +	 */
> +	while (!likely(next = ACCESS_ONCE(node->next)))
> +		cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	ACCESS_ONCE(next->wait) = false;
> +	smp_wmb();
> +}
> +
> +/**
> + * queue_read_lock - acquire read lock of a queue read/write lock
> + * @lock: Pointer to queue read/writer lock structure
> + */
> +void queue_read_lock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +	struct qrwnode node;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	while (likely(!old.writer)) {
> +		/*
> +		 * Atomically increment the reader count while writer is 0
> +		 */
> +		new.rw = old.rw;
> +		new.readers++;
> +
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			return;
> +		cpu_relax();
> +		old.rw = ACCESS_ONCE(lock->rw);
> +	}
> +	/*
> +	 * Slowpath
> +	 * Put the reader into the waiting queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * or when the reader stealing flag is set and readers are present.
> +	 * Then try to atomically inc the reader number.
> +	 */
> +	while (true) {
> +		old.rw = ACCESS_ONCE(lock->rw);
> +		if (old.writer && (!old.rsteal || !old.readers)) {
> +			cpu_relax();
> +			continue;
> +		}
> +		new.rw = old.rw;
> +		new.readers++;
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			break;
> +		cpu_relax();
> +	}
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_read_lock);
> +
> +
> +/*
> + * queue_read_trylock - try to acquire read lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +int queue_read_trylock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	if (unlikely(old.writer))
> +		return 0;
> +	new.rw = old.rw;
> +	new.readers++;
> +
> +	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +		return 1;
> +	cpu_relax();

What's the cpu_relax() for? It's not in a loop.


> +	return 0;
> +}
> +EXPORT_SYMBOL(queue_read_trylock);
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + */
> +void queue_write_lock(struct qrwlock *lock)
> +{
> +	struct qrwnode node, *next;
> +
> +	if (likely(!ACCESS_ONCE(lock->writer))) {
> +		/*
> +		 * Atomically set the writer to 1, then wait until reader
> +		 * count goes to 0.
> +		 */
> +		if (xchg(&lock->writer, 1) == 0) {
> +			while (ACCESS_ONCE(lock->readers))
> +				cpu_relax();
> +			return;
> +		}
> +		cpu_relax();

Another cpu_relax() outside of a loop.

> +	}
> +	/*
> +	 * Slowpath
> +	 * Put the writer into the waiting queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * and then atomically set it again.
> +	 */
> +	while (true) {
> +		if (ACCESS_ONCE(lock->writer)) {
> +			cpu_relax();
> +			continue;
> +		}
> +		if (xchg(&lock->writer, 1) != 0) {
> +			cpu_relax();
> +			continue;
> +		}
> +		break;
> +	}
> +	/*
> +	 * Wait until the reader count go to zero
> +	 */
> +	while (ACCESS_ONCE(lock->readers))
> +		cpu_relax();
> +
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_write_lock);
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue read/write lock
> + * @lock : Pointer to queue read/writer lock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +int queue_write_trylock(struct qrwlock *lock)
> +{
> +	struct qrwlock old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->rw);
> +	if (!old.rw) {
> +		/*
> +		 * Atomically set the writer to 1 if readers = 0
> +		 */
> +		new.rw = old.rw;
> +		new.writer = 1;
> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
> +			return 1;
> +		cpu_relax();

Again the cpu_relax with no loop.

> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(queue_write_trylock);

I haven't seen anything bad about this with a quick review. But it
should have a more thorough review to check all corner cases.

-- Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-15 14:39     ` Steven Rostedt
@ 2013-07-15 20:44       ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-15 20:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 10:39 AM, Steven Rostedt wrote:
> On Fri, 2013-07-12 at 21:34 -0400, Waiman Long wrote:
>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>
>> +/*
>> + * The queue read/write lock data structure
>> + * The reader stealing flag, if sea,t will enable reader at the head of the
> "sea,t"?

Should be "if set,". Thank for spotting the typo. It will be fixed in 
the next version.

>> +/**
>> + * wait_in_queue - Add to queue and wait until it is at the head
>> + * @lock: Pointer to queue read/writer lock structure
>> + * @node: Node pointer to be added to the queue
>> + */
>> +static __always_inline void
>> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *prev;
>> +
>> +	node->next = NULL;
>> +	node->wait = true;
>> +	barrier();
>> +	prev = xchg(&lock->waitq, node);
> "barrier()" isn't needed, as xchg() is a full blown smp_mb(), it also
> acts as a compiler barrier.

Will remove barrier().

>> +/*
>> + * queue_read_trylock - try to acquire read lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +int queue_read_trylock(struct qrwlock *lock)
>> +{
>> +	struct qrwlock old, new;
>> +
>> +	old.rw = ACCESS_ONCE(lock->rw);
>> +	if (unlikely(old.writer))
>> +		return 0;
>> +	new.rw = old.rw;
>> +	new.readers++;
>> +
>> +	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
>> +		return 1;
>> +	cpu_relax();
> What's the cpu_relax() for? It's not in a loop.

I put a cpu_relax() after each cacheline contention event. You are right 
that we don't need a cpu_relax() in the trylock() function here.

>
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(queue_read_trylock);
>> +
>> +/**
>> + * queue_write_lock - acquire write lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + */
>> +void queue_write_lock(struct qrwlock *lock)
>> +{
>> +	struct qrwnode node, *next;
>> +
>> +	if (likely(!ACCESS_ONCE(lock->writer))) {
>> +		/*
>> +		 * Atomically set the writer to 1, then wait until reader
>> +		 * count goes to 0.
>> +		 */
>> +		if (xchg(&lock->writer, 1) == 0) {
>> +			while (ACCESS_ONCE(lock->readers))
>> +				cpu_relax();
>> +			return;
>> +		}
>> +		cpu_relax();
> Another cpu_relax() outside of a loop.

I can remove that one too.

>> +
>> +/**
>> + * queue_write_trylock - try to acquire write lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +int queue_write_trylock(struct qrwlock *lock)
>> +{
>> +	struct qrwlock old, new;
>> +
>> +	old.rw = ACCESS_ONCE(lock->rw);
>> +	if (!old.rw) {
>> +		/*
>> +		 * Atomically set the writer to 1 if readers = 0
>> +		 */
>> +		new.rw = old.rw;
>> +		new.writer = 1;
>> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
>> +			return 1;
>> +		cpu_relax();
> Again the cpu_relax with no loop.

Ditto.

>> +	}
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(queue_write_trylock);
> I haven't seen anything bad about this with a quick review. But it
> should have a more thorough review to check all corner cases.
>
> -- Steve
>

Thank for your time.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-15 20:44       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-15 20:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 10:39 AM, Steven Rostedt wrote:
> On Fri, 2013-07-12 at 21:34 -0400, Waiman Long wrote:
>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>
>> +/*
>> + * The queue read/write lock data structure
>> + * The reader stealing flag, if sea,t will enable reader at the head of the
> "sea,t"?

Should be "if set,". Thank for spotting the typo. It will be fixed in 
the next version.

>> +/**
>> + * wait_in_queue - Add to queue and wait until it is at the head
>> + * @lock: Pointer to queue read/writer lock structure
>> + * @node: Node pointer to be added to the queue
>> + */
>> +static __always_inline void
>> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *prev;
>> +
>> +	node->next = NULL;
>> +	node->wait = true;
>> +	barrier();
>> +	prev = xchg(&lock->waitq, node);
> "barrier()" isn't needed, as xchg() is a full blown smp_mb(), it also
> acts as a compiler barrier.

Will remove barrier().

>> +/*
>> + * queue_read_trylock - try to acquire read lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +int queue_read_trylock(struct qrwlock *lock)
>> +{
>> +	struct qrwlock old, new;
>> +
>> +	old.rw = ACCESS_ONCE(lock->rw);
>> +	if (unlikely(old.writer))
>> +		return 0;
>> +	new.rw = old.rw;
>> +	new.readers++;
>> +
>> +	if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
>> +		return 1;
>> +	cpu_relax();
> What's the cpu_relax() for? It's not in a loop.

I put a cpu_relax() after each cacheline contention event. You are right 
that we don't need a cpu_relax() in the trylock() function here.

>
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(queue_read_trylock);
>> +
>> +/**
>> + * queue_write_lock - acquire write lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + */
>> +void queue_write_lock(struct qrwlock *lock)
>> +{
>> +	struct qrwnode node, *next;
>> +
>> +	if (likely(!ACCESS_ONCE(lock->writer))) {
>> +		/*
>> +		 * Atomically set the writer to 1, then wait until reader
>> +		 * count goes to 0.
>> +		 */
>> +		if (xchg(&lock->writer, 1) == 0) {
>> +			while (ACCESS_ONCE(lock->readers))
>> +				cpu_relax();
>> +			return;
>> +		}
>> +		cpu_relax();
> Another cpu_relax() outside of a loop.

I can remove that one too.

>> +
>> +/**
>> + * queue_write_trylock - try to acquire write lock of a queue read/write lock
>> + * @lock : Pointer to queue read/writer lock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +int queue_write_trylock(struct qrwlock *lock)
>> +{
>> +	struct qrwlock old, new;
>> +
>> +	old.rw = ACCESS_ONCE(lock->rw);
>> +	if (!old.rw) {
>> +		/*
>> +		 * Atomically set the writer to 1 if readers = 0
>> +		 */
>> +		new.rw = old.rw;
>> +		new.writer = 1;
>> +		if (cmpxchg(&lock->rw, old.rw, new.rw) == old.rw)
>> +			return 1;
>> +		cpu_relax();
> Again the cpu_relax with no loop.

Ditto.

>> +	}
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(queue_write_trylock);
> I haven't seen anything bad about this with a quick review. But it
> should have a more thorough review to check all corner cases.
>
> -- Steve
>

Thank for your time.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-13  1:34   ` Waiman Long
@ 2013-07-15 22:31     ` Thomas Gleixner
  -1 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2013-07-15 22:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 12 Jul 2013, Waiman Long wrote:

> This patch introduces a read/write lock implementation that put waiting
> readers and writers into a queue instead of actively contending the
> lock like the regular read/write lock. This will improve performance in
> highly contended situation by reducing the cache line bouncing effect.
> 
> In addition, the queue read/write lock is more deterministic even
> though there is still a smaller chance for lock stealing if the reader
> or writer comes at the right moment. Other than that, lock granting
> is done in a FIFO manner. The only downside is the size increase in
> the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
> 64-bit systems.
> 
> This patch allows the replacement of architecture specific
> implementation of read/write lock by this generic version of queue
> read/write lock. Two new config parameters are introduced:
> 
> 1. QUEUE_RWLOCK
>    A select-able option that enables the building and replacement of
>    architecture specific read/write lock.
> 2. ARCH_QUEUE_RWLOCK
>    Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK
> 
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
> table shows the average time for a single lock/unlock sequence:
> 
> Lock Type		Time (ns)
> ---------		---------
> Ticket spinlock		  15.7
> Read lock		  17.0
> Write lock		  17.2
> Queue read lock		  31.1
> Queue write lock	  13.6
> 
> While the queue read lock is almost double the time of a read lock
> or spinlock, the queue write lock is the fastest of them all. The
> execution time can probably be reduced a bit by allowing inlining of
> the lock fast paths like the other locks.

So you have the writer side faster than the ticket lock, but what
about the reader side? And what about the contended case? What about
high frequent readers with low frequent writers cases?

> To see how the queue write lock can be used as a replacement for ticket
> spinlock (just like rwsem can be used as replacement of mutex), the
> mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
> workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
> write lock and a regular write lock. When running on a 8-socket 80-core
> DL980 system, the performance improvement was shown in the table below.
> 
> +-----------------+------------+------------+-----------+---------+
> |  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
> |                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 10 - 100              |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
> | 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 200 - 1000            |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
> | 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 1100 - 2000           |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
> | 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
> +-----------------+------------+------------+-----------+---------+
> 
> Apparently, the regular read/write lock performs even better than
> the queue read/write lock in some cases.  This is probably due to the

The regular rwlock performs better in most cases. This is the full
list comparing both against the ticket lock.

    qrlock  	   rwlock
    +20.7  	   +44.4 
    +30.1  	   +42.9

    +56.3  	   +63.3
    +52.9  	   +48.8

    +54.4  	   +65.1
    +49.2  	   +26.5

So you try to sell that qrwlock as a replacement for ticket spinlocks,
while at the same time you omit the fact that we have an even better
implementation (except for the last test case) already in the
kernel. What's the point of this exercise?

> fact that mb_cache_spinlock is in a separate cache line from the data
> being manipulated.

This is not about probably. What has the fact that the spinlock is in
a different cache line to do with the replacement by a different
implementation? Exactly nothing.

All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
different cache line than the data which is protected by it.

So what makes the performance difference between the three
implementations? Definitely not the cache line association as that is
the same for all implementations. And just for the record, we have
tools to measure that.

We really need proper explanations and not some guess based drivel to
assess that.

Further down you write in the code:

> + * Compared with regular read/write lock, the queue read/write lock has
> + * has the following advantages:
> + * 1. It is more deterministic. Even though there is a slight chance of

Why is it more deterministic than the existing implementation?

> + *    stealing the lock if come at the right moment, the granting of the
> + *    lock is mostly in FIFO order.

> + * 2. It is faster in high contention situation.

Again, why is it faster?

> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot waiting queue. The writers that follows will have to wait
> + * in the combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
> + * in high contention situation. The writer lock is also faster in single
> + * thread operations. Therefore, queue read/write lock can be considered as
> + * a replacement for those spinlocks that are highly contended as long as
> + * an increase in lock size is not an issue.

So is it faster in the general case or only for the high contention or
single thread operation cases?

And you still miss to explain WHY it is faster. Can you please explain
proper WHY it is faster and WHY we can't apply that technique you
implemented for qrwlocks to writer only locks (aka spinlocks) with a
smaller lock size?

> Looking at the fserver and new_fserver workloads (ext4 FS) where the
> journal->j_state_lock (a read/write lock) is a bottleneck especially
> when HT is on, we see a slight different story. The j_state_lock is
> an embedded read/write lock which is in the same cacheline as some
> of the data being manipulated. The replacement by a queue read/write
> lock gives the following improvement.

Again, the fact that the lock is embedded and in the same cache line
is not explaining anything here. Especially not why this is only
relevant when HT is enabled. Please provide profound arguments backed
by solid data collected with the proper tools.

> +--------------------+-------------+--------------+---------------+
> |      Workload      |mean % change|mean % change |mean % change  |
> |                    |10-100 users |200-1000 users|1100-2000 users|
> +--------------------+-------------+--------------+---------------+
> |fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
> |fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
> |new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
> |new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
> +--------------------+-------------+--------------+---------------+

So fserver gets a 34% improvement for HT ON, while new_fserver gets a
40% improvement for HT OFF. Any useful explanations except that the HT
on/off annotations are swapped? 

Without explanations this is completely useless, really.

And again we want to see the behaviour on machines which do not have a
gazillion of sockets before we enable this unconditionally, as you try
to do in the next patch. Just get it: 8 socket machines are nice toys,
but 99% of the users do not have one.

Aside of that, you are replacing all RW locks unconditionally by this
new fangled thing, but did you actually run tests which look at other
rwlock usage sites than the particular one you care about?

I really doubt that. Why? Because it depends on the workload. And
fserver triggers high frequency writers on the j_state_lock. But
there are enough usage sites of rwlocks where the writer frequency is
extremly low. So how does the almost doubled reader time affect those
users of rwlocks?

And reading further down your patch, I can confirm my suspicion.

> +void queue_write_lock(struct qrwlock *lock)
> +{
....
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * and then atomically set it again.
> +	 */

You are optimizing for the high frequency writer case. And that's not
the primary use case for rwlocks. That's the special use case for the
jbd2 journal_state_lock which CANNOT be generalized for all other
rwlock usage sites.

So how can you justify:

> +config QUEUE_RWLOCK
> +	bool "Generic queue read/write lock"
> +	depends on ARCH_QUEUE_RWLOCK
> +	help
> +	  Use a NUMA optimized queue read/write lock implementation. This
> +	  improves performance under lock contention on systems with more
> +	  than two sockets.
> +

And in Patch 2/2:

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
        bool
        depends on STA2X11
 
+config ARCH_QUEUE_RWLOCK
+       def_bool y
+

It's not justified at all, unless you provide enough proof that it
does not hurt workloads which have totally different rwlock usage
patterns. And proof means numbers, not some handwaving about probably
related  to cachelines or whatever.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-15 22:31     ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2013-07-15 22:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 12 Jul 2013, Waiman Long wrote:

> This patch introduces a read/write lock implementation that put waiting
> readers and writers into a queue instead of actively contending the
> lock like the regular read/write lock. This will improve performance in
> highly contended situation by reducing the cache line bouncing effect.
> 
> In addition, the queue read/write lock is more deterministic even
> though there is still a smaller chance for lock stealing if the reader
> or writer comes at the right moment. Other than that, lock granting
> is done in a FIFO manner. The only downside is the size increase in
> the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
> 64-bit systems.
> 
> This patch allows the replacement of architecture specific
> implementation of read/write lock by this generic version of queue
> read/write lock. Two new config parameters are introduced:
> 
> 1. QUEUE_RWLOCK
>    A select-able option that enables the building and replacement of
>    architecture specific read/write lock.
> 2. ARCH_QUEUE_RWLOCK
>    Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK
> 
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
> table shows the average time for a single lock/unlock sequence:
> 
> Lock Type		Time (ns)
> ---------		---------
> Ticket spinlock		  15.7
> Read lock		  17.0
> Write lock		  17.2
> Queue read lock		  31.1
> Queue write lock	  13.6
> 
> While the queue read lock is almost double the time of a read lock
> or spinlock, the queue write lock is the fastest of them all. The
> execution time can probably be reduced a bit by allowing inlining of
> the lock fast paths like the other locks.

So you have the writer side faster than the ticket lock, but what
about the reader side? And what about the contended case? What about
high frequent readers with low frequent writers cases?

> To see how the queue write lock can be used as a replacement for ticket
> spinlock (just like rwsem can be used as replacement of mutex), the
> mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
> workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
> write lock and a regular write lock. When running on a 8-socket 80-core
> DL980 system, the performance improvement was shown in the table below.
> 
> +-----------------+------------+------------+-----------+---------+
> |  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
> |                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 10 - 100              |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
> | 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 200 - 1000            |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
> | 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
> +-----------------+-----------------------------------------------+
> |                 |              User Range 1100 - 2000           |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
> | 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
> +-----------------+------------+------------+-----------+---------+
> 
> Apparently, the regular read/write lock performs even better than
> the queue read/write lock in some cases.  This is probably due to the

The regular rwlock performs better in most cases. This is the full
list comparing both against the ticket lock.

    qrlock  	   rwlock
    +20.7  	   +44.4 
    +30.1  	   +42.9

    +56.3  	   +63.3
    +52.9  	   +48.8

    +54.4  	   +65.1
    +49.2  	   +26.5

So you try to sell that qrwlock as a replacement for ticket spinlocks,
while at the same time you omit the fact that we have an even better
implementation (except for the last test case) already in the
kernel. What's the point of this exercise?

> fact that mb_cache_spinlock is in a separate cache line from the data
> being manipulated.

This is not about probably. What has the fact that the spinlock is in
a different cache line to do with the replacement by a different
implementation? Exactly nothing.

All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
different cache line than the data which is protected by it.

So what makes the performance difference between the three
implementations? Definitely not the cache line association as that is
the same for all implementations. And just for the record, we have
tools to measure that.

We really need proper explanations and not some guess based drivel to
assess that.

Further down you write in the code:

> + * Compared with regular read/write lock, the queue read/write lock has
> + * has the following advantages:
> + * 1. It is more deterministic. Even though there is a slight chance of

Why is it more deterministic than the existing implementation?

> + *    stealing the lock if come at the right moment, the granting of the
> + *    lock is mostly in FIFO order.

> + * 2. It is faster in high contention situation.

Again, why is it faster?

> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot waiting queue. The writers that follows will have to wait
> + * in the combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
> + * in high contention situation. The writer lock is also faster in single
> + * thread operations. Therefore, queue read/write lock can be considered as
> + * a replacement for those spinlocks that are highly contended as long as
> + * an increase in lock size is not an issue.

So is it faster in the general case or only for the high contention or
single thread operation cases?

And you still miss to explain WHY it is faster. Can you please explain
proper WHY it is faster and WHY we can't apply that technique you
implemented for qrwlocks to writer only locks (aka spinlocks) with a
smaller lock size?

> Looking at the fserver and new_fserver workloads (ext4 FS) where the
> journal->j_state_lock (a read/write lock) is a bottleneck especially
> when HT is on, we see a slight different story. The j_state_lock is
> an embedded read/write lock which is in the same cacheline as some
> of the data being manipulated. The replacement by a queue read/write
> lock gives the following improvement.

Again, the fact that the lock is embedded and in the same cache line
is not explaining anything here. Especially not why this is only
relevant when HT is enabled. Please provide profound arguments backed
by solid data collected with the proper tools.

> +--------------------+-------------+--------------+---------------+
> |      Workload      |mean % change|mean % change |mean % change  |
> |                    |10-100 users |200-1000 users|1100-2000 users|
> +--------------------+-------------+--------------+---------------+
> |fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
> |fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
> |new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
> |new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
> +--------------------+-------------+--------------+---------------+

So fserver gets a 34% improvement for HT ON, while new_fserver gets a
40% improvement for HT OFF. Any useful explanations except that the HT
on/off annotations are swapped? 

Without explanations this is completely useless, really.

And again we want to see the behaviour on machines which do not have a
gazillion of sockets before we enable this unconditionally, as you try
to do in the next patch. Just get it: 8 socket machines are nice toys,
but 99% of the users do not have one.

Aside of that, you are replacing all RW locks unconditionally by this
new fangled thing, but did you actually run tests which look at other
rwlock usage sites than the particular one you care about?

I really doubt that. Why? Because it depends on the workload. And
fserver triggers high frequency writers on the j_state_lock. But
there are enough usage sites of rwlocks where the writer frequency is
extremly low. So how does the almost doubled reader time affect those
users of rwlocks?

And reading further down your patch, I can confirm my suspicion.

> +void queue_write_lock(struct qrwlock *lock)
> +{
....
> +	/*
> +	 * At the head of the wait queue now, wait until no writer is pending
> +	 * and then atomically set it again.
> +	 */

You are optimizing for the high frequency writer case. And that's not
the primary use case for rwlocks. That's the special use case for the
jbd2 journal_state_lock which CANNOT be generalized for all other
rwlock usage sites.

So how can you justify:

> +config QUEUE_RWLOCK
> +	bool "Generic queue read/write lock"
> +	depends on ARCH_QUEUE_RWLOCK
> +	help
> +	  Use a NUMA optimized queue read/write lock implementation. This
> +	  improves performance under lock contention on systems with more
> +	  than two sockets.
> +

And in Patch 2/2:

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
        bool
        depends on STA2X11
 
+config ARCH_QUEUE_RWLOCK
+       def_bool y
+

It's not justified at all, unless you provide enough proof that it
does not hurt workloads which have totally different rwlock usage
patterns. And proof means numbers, not some handwaving about probably
related  to cachelines or whatever.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-15 22:31     ` Thomas Gleixner
@ 2013-07-16  1:19       ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-16  1:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2013, Waiman Long wrote:
>
>> In term of single-thread performance (no contention), a 256K
>> lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
>> table shows the average time for a single lock/unlock sequence:
>>
>> Lock Type		Time (ns)
>> ---------		---------
>> Ticket spinlock		  15.7
>> Read lock		  17.0
>> Write lock		  17.2
>> Queue read lock		  31.1
>> Queue write lock	  13.6
>>
>> While the queue read lock is almost double the time of a read lock
>> or spinlock, the queue write lock is the fastest of them all. The
>> execution time can probably be reduced a bit by allowing inlining of
>> the lock fast paths like the other locks.
> So you have the writer side faster than the ticket lock, but what
> about the reader side? And what about the contended case? What about
> high frequent readers with low frequent writers cases?

Yes, the queue read lock is almost twice the time of a read lock. I am 
trying to find way to reduce that.

>> To see how the queue write lock can be used as a replacement for ticket
>> spinlock (just like rwsem can be used as replacement of mutex), the
>> mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
>> workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
>> write lock and a regular write lock. When running on a 8-socket 80-core
>> DL980 system, the performance improvement was shown in the table below.
>>
>> +-----------------+------------+------------+-----------+---------+
>> |  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
>> |                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 10 - 100              |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
>> | 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 200 - 1000            |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
>> | 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 1100 - 2000           |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
>> | 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
>> +-----------------+------------+------------+-----------+---------+
>>
>> Apparently, the regular read/write lock performs even better than
>> the queue read/write lock in some cases.  This is probably due to the
> The regular rwlock performs better in most cases. This is the full
> list comparing both against the ticket lock.
>
>      qrlock  	   rwlock
>      +20.7  	   +44.4
>      +30.1  	   +42.9
>
>      +56.3  	   +63.3
>      +52.9  	   +48.8
>
>      +54.4  	   +65.1
>      +49.2  	   +26.5
>
> So you try to sell that qrwlock as a replacement for ticket spinlocks,
> while at the same time you omit the fact that we have an even better
> implementation (except for the last test case) already in the
> kernel. What's the point of this exercise?

The main point is that the regular rwlock is not fair while the queue 
rwlock is close to as fair as the ticket spinlock. The LWN article 
http://lwn.net/Articles/364583/ mentioned about eliminating rwlock 
altogether precisely because of this unfairness as it can cause livelock 
in certain scenerio. I also saw slides to advise again using rwlock 
because of this.

>> fact that mb_cache_spinlock is in a separate cache line from the data
>> being manipulated.
> This is not about probably. What has the fact that the spinlock is in
> a different cache line to do with the replacement by a different
> implementation? Exactly nothing.
>
> All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
> different cache line than the data which is protected by it.
>
> So what makes the performance difference between the three
> implementations? Definitely not the cache line association as that is
> the same for all implementations. And just for the record, we have
> tools to measure that.
>
> We really need proper explanations and not some guess based drivel to
> assess that.

I will collect more data about that.

> Further down you write in the code:
>
>> + * Compared with regular read/write lock, the queue read/write lock has
>> + * has the following advantages:
>> + * 1. It is more deterministic. Even though there is a slight chance of
> Why is it more deterministic than the existing implementation?

Deterministic means that that a process can acquire a lock within a 
reasonable time period without being starved for a long time. The 
qrwlock grants lock in FIFO order in most cases. That is what I mean by 
being more deterministic.

>
>> + *    stealing the lock if come at the right moment, the granting of the
>> + *    lock is mostly in FIFO order.
>> + * 2. It is faster in high contention situation.
> Again, why is it faster?

The current rwlock implementation suffers from a thundering herd 
problem. When many readers are waiting for the lock hold by a writer, 
they will all jump in more or less at the same time when the writer 
releases the lock. That is not the case with qrwlock. It has been shown 
in many cases that avoiding this thundering herd problem can lead to 
better performance.

>> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
>> + * and 12 bytes larger in 64-bit systems.
>> + *
>> + * There are two queues for writers. The writer field of the lock is a
>> + * one-slot waiting queue. The writers that follows will have to wait
>> + * in the combined reader/writer queue (waitq).
>> + *
>> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
>> + * in high contention situation. The writer lock is also faster in single
>> + * thread operations. Therefore, queue read/write lock can be considered as
>> + * a replacement for those spinlocks that are highly contended as long as
>> + * an increase in lock size is not an issue.
> So is it faster in the general case or only for the high contention or
> single thread operation cases?
>
> And you still miss to explain WHY it is faster. Can you please explain
> proper WHY it is faster and WHY we can't apply that technique you
> implemented for qrwlocks to writer only locks (aka spinlocks) with a
> smaller lock size?

I will try to collect more data to justify the usefulness of qrwlock.

>> Looking at the fserver and new_fserver workloads (ext4 FS) where the
>> journal->j_state_lock (a read/write lock) is a bottleneck especially
>> when HT is on, we see a slight different story. The j_state_lock is
>> an embedded read/write lock which is in the same cacheline as some
>> of the data being manipulated. The replacement by a queue read/write
>> lock gives the following improvement.
> Again, the fact that the lock is embedded and in the same cache line
> is not explaining anything here. Especially not why this is only
> relevant when HT is enabled. Please provide profound arguments backed
> by solid data collected with the proper tools.

It is the thundering herd problem that I mentioned above which can cause 
a lot of cacheline bouncing. If the process that got the lock try to 
access data around the lock, it can be slowed down by quite a lot.

>> +--------------------+-------------+--------------+---------------+
>> |      Workload      |mean % change|mean % change |mean % change  |
>> |                    |10-100 users |200-1000 users|1100-2000 users|
>> +--------------------+-------------+--------------+---------------+
>> |fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
>> |fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
>> |new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
>> |new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
>> +--------------------+-------------+--------------+---------------+
> So fserver gets a 34% improvement for HT ON, while new_fserver gets a
> 40% improvement for HT OFF. Any useful explanations except that the HT
> on/off annotations are swapped?
>
> Without explanations this is completely useless, really.

The reason is that rwlock contention accounts for only 1-2% of total CPU 
time when HT is off. When HT is on, the contention account for more than 
20%. That is why there isn't that much difference with HT off. I should 
have listed the perf data in the commit message. I will do that in the 
next version.

> And again we want to see the behaviour on machines which do not have a
> gazillion of sockets before we enable this unconditionally, as you try
> to do in the next patch. Just get it: 8 socket machines are nice toys,
> but 99% of the users do not have one.
>
> Aside of that, you are replacing all RW locks unconditionally by this
> new fangled thing, but did you actually run tests which look at other
> rwlock usage sites than the particular one you care about?

Users have the choice of using the old rwlock or the queue rwlock by 
selecting or unselecting the QUEUE_RWLOCK config parameter. I am not 
forcing the unconditional replacement of rwlock by qrwlock.

I will collect more data for the other use cases.

> I really doubt that. Why? Because it depends on the workload. And
> fserver triggers high frequency writers on the j_state_lock. But
> there are enough usage sites of rwlocks where the writer frequency is
> extremly low. So how does the almost doubled reader time affect those
> users of rwlocks?
>
> And reading further down your patch, I can confirm my suspicion.
>
>> +void queue_write_lock(struct qrwlock *lock)
>> +{
> ....
>> +	/*
>> +	 * At the head of the wait queue now, wait until no writer is pending
>> +	 * and then atomically set it again.
>> +	 */
> You are optimizing for the high frequency writer case. And that's not
> the primary use case for rwlocks. That's the special use case for the
> jbd2 journal_state_lock which CANNOT be generalized for all other
> rwlock usage sites.

It is true that this lock is kind of optimized for writers. For reader 
heavy code, the performance may not be as good as the rwlock for 
uncontended cases. However, I do believe that the fairness attribute of 
the qrwlock far outweigh the slight performance overhead of read 
lock/unlock. Furthermore, the lock/unlock sequence contributes only a 
very tiny percentage of total CPU time in uncontended cases. A slight 
increase may not really have a material impact on performance. Again, as 
promised, I will try to collect some more performance data for reader 
heavy usage cases.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-16  1:19       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-16  1:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2013, Waiman Long wrote:
>
>> In term of single-thread performance (no contention), a 256K
>> lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
>> table shows the average time for a single lock/unlock sequence:
>>
>> Lock Type		Time (ns)
>> ---------		---------
>> Ticket spinlock		  15.7
>> Read lock		  17.0
>> Write lock		  17.2
>> Queue read lock		  31.1
>> Queue write lock	  13.6
>>
>> While the queue read lock is almost double the time of a read lock
>> or spinlock, the queue write lock is the fastest of them all. The
>> execution time can probably be reduced a bit by allowing inlining of
>> the lock fast paths like the other locks.
> So you have the writer side faster than the ticket lock, but what
> about the reader side? And what about the contended case? What about
> high frequent readers with low frequent writers cases?

Yes, the queue read lock is almost twice the time of a read lock. I am 
trying to find way to reduce that.

>> To see how the queue write lock can be used as a replacement for ticket
>> spinlock (just like rwsem can be used as replacement of mutex), the
>> mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
>> workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
>> write lock and a regular write lock. When running on a 8-socket 80-core
>> DL980 system, the performance improvement was shown in the table below.
>>
>> +-----------------+------------+------------+-----------+---------+
>> |  Configuration  |  Mean JPM  |  Mean JPM  | Mean JPM  | qrwlock |
>> |                 |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 10 - 100              |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   441374   |   532774   |  637205   | +20.7%  |
>> | 8 nodes, HT on  |   449373   |   584387   |  641965   | +30.1%  |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 200 - 1000            |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   226870   |   354490   |  371593   | +56.3%  |
>> | 8 nodes, HT on  |   205997   |   314891   |  306378   | +52.9%  |
>> +-----------------+-----------------------------------------------+
>> |                 |              User Range 1100 - 2000           |
>> +-----------------+-----------------------------------------------+
>> | 8 nodes, HT off |   208234   |   321420   |  343694   | +54.4%  |
>> | 8 nodes, HT on  |   199612   |   297853   |  252569   | +49.2%  |
>> +-----------------+------------+------------+-----------+---------+
>>
>> Apparently, the regular read/write lock performs even better than
>> the queue read/write lock in some cases.  This is probably due to the
> The regular rwlock performs better in most cases. This is the full
> list comparing both against the ticket lock.
>
>      qrlock  	   rwlock
>      +20.7  	   +44.4
>      +30.1  	   +42.9
>
>      +56.3  	   +63.3
>      +52.9  	   +48.8
>
>      +54.4  	   +65.1
>      +49.2  	   +26.5
>
> So you try to sell that qrwlock as a replacement for ticket spinlocks,
> while at the same time you omit the fact that we have an even better
> implementation (except for the last test case) already in the
> kernel. What's the point of this exercise?

The main point is that the regular rwlock is not fair while the queue 
rwlock is close to as fair as the ticket spinlock. The LWN article 
http://lwn.net/Articles/364583/ mentioned about eliminating rwlock 
altogether precisely because of this unfairness as it can cause livelock 
in certain scenerio. I also saw slides to advise again using rwlock 
because of this.

>> fact that mb_cache_spinlock is in a separate cache line from the data
>> being manipulated.
> This is not about probably. What has the fact that the spinlock is in
> a different cache line to do with the replacement by a different
> implementation? Exactly nothing.
>
> All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
> different cache line than the data which is protected by it.
>
> So what makes the performance difference between the three
> implementations? Definitely not the cache line association as that is
> the same for all implementations. And just for the record, we have
> tools to measure that.
>
> We really need proper explanations and not some guess based drivel to
> assess that.

I will collect more data about that.

> Further down you write in the code:
>
>> + * Compared with regular read/write lock, the queue read/write lock has
>> + * has the following advantages:
>> + * 1. It is more deterministic. Even though there is a slight chance of
> Why is it more deterministic than the existing implementation?

Deterministic means that that a process can acquire a lock within a 
reasonable time period without being starved for a long time. The 
qrwlock grants lock in FIFO order in most cases. That is what I mean by 
being more deterministic.

>
>> + *    stealing the lock if come at the right moment, the granting of the
>> + *    lock is mostly in FIFO order.
>> + * 2. It is faster in high contention situation.
> Again, why is it faster?

The current rwlock implementation suffers from a thundering herd 
problem. When many readers are waiting for the lock hold by a writer, 
they will all jump in more or less at the same time when the writer 
releases the lock. That is not the case with qrwlock. It has been shown 
in many cases that avoiding this thundering herd problem can lead to 
better performance.

>> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
>> + * and 12 bytes larger in 64-bit systems.
>> + *
>> + * There are two queues for writers. The writer field of the lock is a
>> + * one-slot waiting queue. The writers that follows will have to wait
>> + * in the combined reader/writer queue (waitq).
>> + *
>> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
>> + * in high contention situation. The writer lock is also faster in single
>> + * thread operations. Therefore, queue read/write lock can be considered as
>> + * a replacement for those spinlocks that are highly contended as long as
>> + * an increase in lock size is not an issue.
> So is it faster in the general case or only for the high contention or
> single thread operation cases?
>
> And you still miss to explain WHY it is faster. Can you please explain
> proper WHY it is faster and WHY we can't apply that technique you
> implemented for qrwlocks to writer only locks (aka spinlocks) with a
> smaller lock size?

I will try to collect more data to justify the usefulness of qrwlock.

>> Looking at the fserver and new_fserver workloads (ext4 FS) where the
>> journal->j_state_lock (a read/write lock) is a bottleneck especially
>> when HT is on, we see a slight different story. The j_state_lock is
>> an embedded read/write lock which is in the same cacheline as some
>> of the data being manipulated. The replacement by a queue read/write
>> lock gives the following improvement.
> Again, the fact that the lock is embedded and in the same cache line
> is not explaining anything here. Especially not why this is only
> relevant when HT is enabled. Please provide profound arguments backed
> by solid data collected with the proper tools.

It is the thundering herd problem that I mentioned above which can cause 
a lot of cacheline bouncing. If the process that got the lock try to 
access data around the lock, it can be slowed down by quite a lot.

>> +--------------------+-------------+--------------+---------------+
>> |      Workload      |mean % change|mean % change |mean % change  |
>> |                    |10-100 users |200-1000 users|1100-2000 users|
>> +--------------------+-------------+--------------+---------------+
>> |fserver (HT off)    |    +0.3%    |    -0.1%     |    +1.9%      |
>> |fserver (HT on)     |    -0.1%    |   +32.2%     |   +34.7%      |
>> |new_fserver (HT on) |    +0.8%    |    +0.9%     |    +0.9%      |
>> |new_fserver (HT off)|    -1.2%    |   +29.8%     |   +40.5%      |
>> +--------------------+-------------+--------------+---------------+
> So fserver gets a 34% improvement for HT ON, while new_fserver gets a
> 40% improvement for HT OFF. Any useful explanations except that the HT
> on/off annotations are swapped?
>
> Without explanations this is completely useless, really.

The reason is that rwlock contention accounts for only 1-2% of total CPU 
time when HT is off. When HT is on, the contention account for more than 
20%. That is why there isn't that much difference with HT off. I should 
have listed the perf data in the commit message. I will do that in the 
next version.

> And again we want to see the behaviour on machines which do not have a
> gazillion of sockets before we enable this unconditionally, as you try
> to do in the next patch. Just get it: 8 socket machines are nice toys,
> but 99% of the users do not have one.
>
> Aside of that, you are replacing all RW locks unconditionally by this
> new fangled thing, but did you actually run tests which look at other
> rwlock usage sites than the particular one you care about?

Users have the choice of using the old rwlock or the queue rwlock by 
selecting or unselecting the QUEUE_RWLOCK config parameter. I am not 
forcing the unconditional replacement of rwlock by qrwlock.

I will collect more data for the other use cases.

> I really doubt that. Why? Because it depends on the workload. And
> fserver triggers high frequency writers on the j_state_lock. But
> there are enough usage sites of rwlocks where the writer frequency is
> extremly low. So how does the almost doubled reader time affect those
> users of rwlocks?
>
> And reading further down your patch, I can confirm my suspicion.
>
>> +void queue_write_lock(struct qrwlock *lock)
>> +{
> ....
>> +	/*
>> +	 * At the head of the wait queue now, wait until no writer is pending
>> +	 * and then atomically set it again.
>> +	 */
> You are optimizing for the high frequency writer case. And that's not
> the primary use case for rwlocks. That's the special use case for the
> jbd2 journal_state_lock which CANNOT be generalized for all other
> rwlock usage sites.

It is true that this lock is kind of optimized for writers. For reader 
heavy code, the performance may not be as good as the rwlock for 
uncontended cases. However, I do believe that the fairness attribute of 
the qrwlock far outweigh the slight performance overhead of read 
lock/unlock. Furthermore, the lock/unlock sequence contributes only a 
very tiny percentage of total CPU time in uncontended cases. A slight 
increase may not really have a material impact on performance. Again, as 
promised, I will try to collect some more performance data for reader 
heavy usage cases.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-16  1:19       ` Waiman Long
  (?)
@ 2013-07-18  7:42         ` Ingo Molnar
  -1 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-18  7:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Waiman Long <waiman.long@hp.com> wrote:

> >
> >>+ *    stealing the lock if come at the right moment, the granting of the
> >>+ *    lock is mostly in FIFO order.
> >>+ * 2. It is faster in high contention situation.
> >
> > Again, why is it faster?
> 
> The current rwlock implementation suffers from a thundering herd 
> problem. When many readers are waiting for the lock hold by a writer, 
> they will all jump in more or less at the same time when the writer 
> releases the lock. That is not the case with qrwlock. It has been shown 
> in many cases that avoiding this thundering herd problem can lead to 
> better performance.

Btw., it's possible to further optimize this "writer releases the lock to 
multiple readers spinning" thundering herd scenario in the classic 
read_lock() case, without changing the queueing model.

Right now read_lock() fast path is a single atomic instruction. When a 
writer releases the lock then it makes it available to all readers and 
each reader will execute a LOCK DEC instruction which will succeed.

This is the relevant code in arch/x86/lib/rwlock.S [edited for 
readability]:

__read_lock_failed():

0:      LOCK_PREFIX
        READ_LOCK_SIZE(inc) (%__lock_ptr)

1:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      1b

        LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
        js      0b

        ret

This is where we could optimize: instead of signalling to each reader that 
it's fine to decrease the count and letting dozens of readers do that on 
the same cache-line, which ping-pongs around the numa cross-connect 
touching every other CPU as they execute the LOCK DEC instruction, we 
could let the _writer_ modify the count on unlock in essence, to the exact 
value that readers expect.

Since read_lock() can never abort this should be relatively 
straightforward: the INC above could be left out, and the writer side 
needs to detect that there are no other writers waiting and can set the 
count to 'reader locked' value - which the readers will detect without 
modifying the cache line:

__read_lock_failed():

0:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      0b

        ret

(Unless I'm missing something that is.)

That way the current write_unlock() followed by a 'thundering herd' of 
__read_lock_failed() atomic accesses is transformed into an efficient 
read-only broadcast of information with only a single update to the 
cacheline: the writer-updated cacheline propagates in parallel to every 
CPU and is cached there.

On typical hardware this will be broadcast to all CPUs as part of regular 
MESI invalidation bus traffic.

reader unlock will still have to modify the cacheline, so rwlocks will 
still have a fundamental scalability limit even in the read-only usecase.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18  7:42         ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-18  7:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc


* Waiman Long <waiman.long@hp.com> wrote:

> >
> >>+ *    stealing the lock if come at the right moment, the granting of the
> >>+ *    lock is mostly in FIFO order.
> >>+ * 2. It is faster in high contention situation.
> >
> > Again, why is it faster?
> 
> The current rwlock implementation suffers from a thundering herd 
> problem. When many readers are waiting for the lock hold by a writer, 
> they will all jump in more or less at the same time when the writer 
> releases the lock. That is not the case with qrwlock. It has been shown 
> in many cases that avoiding this thundering herd problem can lead to 
> better performance.

Btw., it's possible to further optimize this "writer releases the lock to 
multiple readers spinning" thundering herd scenario in the classic 
read_lock() case, without changing the queueing model.

Right now read_lock() fast path is a single atomic instruction. When a 
writer releases the lock then it makes it available to all readers and 
each reader will execute a LOCK DEC instruction which will succeed.

This is the relevant code in arch/x86/lib/rwlock.S [edited for 
readability]:

__read_lock_failed():

0:      LOCK_PREFIX
        READ_LOCK_SIZE(inc) (%__lock_ptr)

1:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      1b

        LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
        js      0b

        ret

This is where we could optimize: instead of signalling to each reader that 
it's fine to decrease the count and letting dozens of readers do that on 
the same cache-line, which ping-pongs around the numa cross-connect 
touching every other CPU as they execute the LOCK DEC instruction, we 
could let the _writer_ modify the count on unlock in essence, to the exact 
value that readers expect.

Since read_lock() can never abort this should be relatively 
straightforward: the INC above could be left out, and the writer side 
needs to detect that there are no other writers waiting and can set the 
count to 'reader locked' value - which the readers will detect without 
modifying the cache line:

__read_lock_failed():

0:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      0b

        ret

(Unless I'm missing something that is.)

That way the current write_unlock() followed by a 'thundering herd' of 
__read_lock_failed() atomic accesses is transformed into an efficient 
read-only broadcast of information with only a single update to the 
cacheline: the writer-updated cacheline propagates in parallel to every 
CPU and is cached there.

On typical hardware this will be broadcast to all CPUs as part of regular 
MESI invalidation bus traffic.

reader unlock will still have to modify the cacheline, so rwlocks will 
still have a fundamental scalability limit even in the read-only usecase.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18  7:42         ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-18  7:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Waiman Long <waiman.long@hp.com> wrote:

> >
> >>+ *    stealing the lock if come at the right moment, the granting of the
> >>+ *    lock is mostly in FIFO order.
> >>+ * 2. It is faster in high contention situation.
> >
> > Again, why is it faster?
> 
> The current rwlock implementation suffers from a thundering herd 
> problem. When many readers are waiting for the lock hold by a writer, 
> they will all jump in more or less at the same time when the writer 
> releases the lock. That is not the case with qrwlock. It has been shown 
> in many cases that avoiding this thundering herd problem can lead to 
> better performance.

Btw., it's possible to further optimize this "writer releases the lock to 
multiple readers spinning" thundering herd scenario in the classic 
read_lock() case, without changing the queueing model.

Right now read_lock() fast path is a single atomic instruction. When a 
writer releases the lock then it makes it available to all readers and 
each reader will execute a LOCK DEC instruction which will succeed.

This is the relevant code in arch/x86/lib/rwlock.S [edited for 
readability]:

__read_lock_failed():

0:      LOCK_PREFIX
        READ_LOCK_SIZE(inc) (%__lock_ptr)

1:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      1b

        LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
        js      0b

        ret

This is where we could optimize: instead of signalling to each reader that 
it's fine to decrease the count and letting dozens of readers do that on 
the same cache-line, which ping-pongs around the numa cross-connect 
touching every other CPU as they execute the LOCK DEC instruction, we 
could let the _writer_ modify the count on unlock in essence, to the exact 
value that readers expect.

Since read_lock() can never abort this should be relatively 
straightforward: the INC above could be left out, and the writer side 
needs to detect that there are no other writers waiting and can set the 
count to 'reader locked' value - which the readers will detect without 
modifying the cache line:

__read_lock_failed():

0:      rep; nop
        READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
        js      0b

        ret

(Unless I'm missing something that is.)

That way the current write_unlock() followed by a 'thundering herd' of 
__read_lock_failed() atomic accesses is transformed into an efficient 
read-only broadcast of information with only a single update to the 
cacheline: the writer-updated cacheline propagates in parallel to every 
CPU and is cached there.

On typical hardware this will be broadcast to all CPUs as part of regular 
MESI invalidation bus traffic.

reader unlock will still have to modify the cacheline, so rwlocks will 
still have a fundamental scalability limit even in the read-only usecase.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-16  1:19       ` Waiman Long
@ 2013-07-18 10:22         ` Thomas Gleixner
  -1 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2013-07-18 10:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

Waiman,

On Mon, 15 Jul 2013, Waiman Long wrote:
> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
> > On Fri, 12 Jul 2013, Waiman Long wrote:
> > > Apparently, the regular read/write lock performs even better than
> > > the queue read/write lock in some cases.  This is probably due to the
> > The regular rwlock performs better in most cases. This is the full
> > list comparing both against the ticket lock.
> > 
> >      qrlock  	   rwlock
> >      +20.7  	   +44.4
> >      +30.1  	   +42.9
> > 
> >      +56.3  	   +63.3
> >      +52.9  	   +48.8
> > 
> >      +54.4  	   +65.1
> >      +49.2  	   +26.5
> > 
> > So you try to sell that qrwlock as a replacement for ticket spinlocks,
> > while at the same time you omit the fact that we have an even better
> > implementation (except for the last test case) already in the
> > kernel. What's the point of this exercise?
> 
> The main point is that the regular rwlock is not fair while the
> queue rwlock is close to as fair as the ticket spinlock. The LWN
> article http://lwn.net/Articles/364583/ mentioned about eliminating
> rwlock altogether precisely because of this unfairness as it can
> cause livelock in certain scenerio. I also saw slides to advise
> again using rwlock because of this.

I'm well aware of this. But that does not explain anything of what I
asked.

> > > + * has the following advantages:
> > > + * 1. It is more deterministic. Even though there is a slight chance
> > > of
> > Why is it more deterministic than the existing implementation?
> 
> Deterministic means that that a process can acquire a lock within a
> reasonable time period without being starved for a long time. The qrwlock
> grants lock in FIFO order in most cases. That is what I mean by being more
> deterministic.

That's exactly the kind of explanation we want to have in the code and
the changelog.

> > 
> > > + *    stealing the lock if come at the right moment, the granting of
> > > the
> > > + *    lock is mostly in FIFO order.
> > > + * 2. It is faster in high contention situation.
> > Again, why is it faster?
> 
> The current rwlock implementation suffers from a thundering herd problem.
> When many readers are waiting for the lock hold by a writer, they will all
> jump in more or less at the same time when the writer releases the lock.
> That is not the case with qrwlock. It has been shown in many cases that
> avoiding this thundering herd problem can lead to better performance.

That makes sense and wants to be documented as well. You could have
avoided a lot of the discussion if you had included these details
right away.

> > > + * an increase in lock size is not an issue.
> > So is it faster in the general case or only for the high contention or
> > single thread operation cases?
> > 
> > And you still miss to explain WHY it is faster. Can you please explain
> > proper WHY it is faster and WHY we can't apply that technique you
> > implemented for qrwlocks to writer only locks (aka spinlocks) with a
> > smaller lock size?
> 
> I will try to collect more data to justify the usefulness of qrwlock.

And please provide a proper argument why we can't use the same
technique for spinlocks.

> > Aside of that, you are replacing all RW locks unconditionally by this
> > new fangled thing, but did you actually run tests which look at other
> > rwlock usage sites than the particular one you care about?
> 
> Users have the choice of using the old rwlock or the queue rwlock by
> selecting or unselecting the QUEUE_RWLOCK config parameter. I am not
> forcing the unconditional replacement of rwlock by qrwlock.

Looking at patch 2/2:

+config ARCH_QUEUE_RWLOCK
+       def_bool y

What's conditional about that? Where is the choice?

> > You are optimizing for the high frequency writer case. And that's not
> > the primary use case for rwlocks. That's the special use case for the
> > jbd2 journal_state_lock which CANNOT be generalized for all other
> > rwlock usage sites.
> 
> It is true that this lock is kind of optimized for writers. For
> reader heavy code, the performance may not be as good as the rwlock
> for uncontended cases. However, I do believe that the fairness
> attribute of the qrwlock far outweigh the slight performance
> overhead of read lock/unlock.  Furthermore, the lock/unlock sequence
> contributes only a very tiny percentage of total CPU time in
> uncontended cases. A slight increase may not really have a material
> impact on performance. Again, as promised, I will try to collect
> some more performance data for reader heavy usage cases.

Yes, please. We really need this information and if it turns out, that
it does not affect reader heavy sides, I have no objections against
the technology itself.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 10:22         ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2013-07-18 10:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

Waiman,

On Mon, 15 Jul 2013, Waiman Long wrote:
> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
> > On Fri, 12 Jul 2013, Waiman Long wrote:
> > > Apparently, the regular read/write lock performs even better than
> > > the queue read/write lock in some cases.  This is probably due to the
> > The regular rwlock performs better in most cases. This is the full
> > list comparing both against the ticket lock.
> > 
> >      qrlock  	   rwlock
> >      +20.7  	   +44.4
> >      +30.1  	   +42.9
> > 
> >      +56.3  	   +63.3
> >      +52.9  	   +48.8
> > 
> >      +54.4  	   +65.1
> >      +49.2  	   +26.5
> > 
> > So you try to sell that qrwlock as a replacement for ticket spinlocks,
> > while at the same time you omit the fact that we have an even better
> > implementation (except for the last test case) already in the
> > kernel. What's the point of this exercise?
> 
> The main point is that the regular rwlock is not fair while the
> queue rwlock is close to as fair as the ticket spinlock. The LWN
> article http://lwn.net/Articles/364583/ mentioned about eliminating
> rwlock altogether precisely because of this unfairness as it can
> cause livelock in certain scenerio. I also saw slides to advise
> again using rwlock because of this.

I'm well aware of this. But that does not explain anything of what I
asked.

> > > + * has the following advantages:
> > > + * 1. It is more deterministic. Even though there is a slight chance
> > > of
> > Why is it more deterministic than the existing implementation?
> 
> Deterministic means that that a process can acquire a lock within a
> reasonable time period without being starved for a long time. The qrwlock
> grants lock in FIFO order in most cases. That is what I mean by being more
> deterministic.

That's exactly the kind of explanation we want to have in the code and
the changelog.

> > 
> > > + *    stealing the lock if come at the right moment, the granting of
> > > the
> > > + *    lock is mostly in FIFO order.
> > > + * 2. It is faster in high contention situation.
> > Again, why is it faster?
> 
> The current rwlock implementation suffers from a thundering herd problem.
> When many readers are waiting for the lock hold by a writer, they will all
> jump in more or less at the same time when the writer releases the lock.
> That is not the case with qrwlock. It has been shown in many cases that
> avoiding this thundering herd problem can lead to better performance.

That makes sense and wants to be documented as well. You could have
avoided a lot of the discussion if you had included these details
right away.

> > > + * an increase in lock size is not an issue.
> > So is it faster in the general case or only for the high contention or
> > single thread operation cases?
> > 
> > And you still miss to explain WHY it is faster. Can you please explain
> > proper WHY it is faster and WHY we can't apply that technique you
> > implemented for qrwlocks to writer only locks (aka spinlocks) with a
> > smaller lock size?
> 
> I will try to collect more data to justify the usefulness of qrwlock.

And please provide a proper argument why we can't use the same
technique for spinlocks.

> > Aside of that, you are replacing all RW locks unconditionally by this
> > new fangled thing, but did you actually run tests which look at other
> > rwlock usage sites than the particular one you care about?
> 
> Users have the choice of using the old rwlock or the queue rwlock by
> selecting or unselecting the QUEUE_RWLOCK config parameter. I am not
> forcing the unconditional replacement of rwlock by qrwlock.

Looking at patch 2/2:

+config ARCH_QUEUE_RWLOCK
+       def_bool y

What's conditional about that? Where is the choice?

> > You are optimizing for the high frequency writer case. And that's not
> > the primary use case for rwlocks. That's the special use case for the
> > jbd2 journal_state_lock which CANNOT be generalized for all other
> > rwlock usage sites.
> 
> It is true that this lock is kind of optimized for writers. For
> reader heavy code, the performance may not be as good as the rwlock
> for uncontended cases. However, I do believe that the fairness
> attribute of the qrwlock far outweigh the slight performance
> overhead of read lock/unlock.  Furthermore, the lock/unlock sequence
> contributes only a very tiny percentage of total CPU time in
> uncontended cases. A slight increase may not really have a material
> impact on performance. Again, as promised, I will try to collect
> some more performance data for reader heavy usage cases.

Yes, please. We really need this information and if it turns out, that
it does not affect reader heavy sides, I have no objections against
the technology itself.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18  7:42         ` Ingo Molnar
  (?)
@ 2013-07-18 13:40           ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-18 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>> + *    lock is mostly in FIFO order.
>>>> + * 2. It is faster in high contention situation.
>>> Again, why is it faster?
>> The current rwlock implementation suffers from a thundering herd
>> problem. When many readers are waiting for the lock hold by a writer,
>> they will all jump in more or less at the same time when the writer
>> releases the lock. That is not the case with qrwlock. It has been shown
>> in many cases that avoiding this thundering herd problem can lead to
>> better performance.
> Btw., it's possible to further optimize this "writer releases the lock to
> multiple readers spinning" thundering herd scenario in the classic
> read_lock() case, without changing the queueing model.
>
> Right now read_lock() fast path is a single atomic instruction. When a
> writer releases the lock then it makes it available to all readers and
> each reader will execute a LOCK DEC instruction which will succeed.
>
> This is the relevant code in arch/x86/lib/rwlock.S [edited for
> readability]:
>
> __read_lock_failed():
>
> 0:      LOCK_PREFIX
>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>
> 1:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      1b
>
>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>          js      0b
>
>          ret
>
> This is where we could optimize: instead of signalling to each reader that
> it's fine to decrease the count and letting dozens of readers do that on
> the same cache-line, which ping-pongs around the numa cross-connect
> touching every other CPU as they execute the LOCK DEC instruction, we
> could let the _writer_ modify the count on unlock in essence, to the exact
> value that readers expect.
>
> Since read_lock() can never abort this should be relatively
> straightforward: the INC above could be left out, and the writer side
> needs to detect that there are no other writers waiting and can set the
> count to 'reader locked' value - which the readers will detect without
> modifying the cache line:
>
> __read_lock_failed():
>
> 0:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      0b
>
>          ret
>
> (Unless I'm missing something that is.)
>
> That way the current write_unlock() followed by a 'thundering herd' of
> __read_lock_failed() atomic accesses is transformed into an efficient
> read-only broadcast of information with only a single update to the
> cacheline: the writer-updated cacheline propagates in parallel to every
> CPU and is cached there.
>
> On typical hardware this will be broadcast to all CPUs as part of regular
> MESI invalidation bus traffic.
>
> reader unlock will still have to modify the cacheline, so rwlocks will
> still have a fundamental scalability limit even in the read-only usecase.

I think that will work. The only drawback that I can see is the fairness 
argument. The current read/write lock implementation is unfair to the 
writer. That change will make it even more unfair to the writer and 
there is no easy way to detect a waiting writer unless we change the 
structure to add such a field. As a result, a steady stream of readers 
will have a higher chance of blocking out a writer indefinitely.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 13:40           ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-18 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc

On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>> + *    lock is mostly in FIFO order.
>>>> + * 2. It is faster in high contention situation.
>>> Again, why is it faster?
>> The current rwlock implementation suffers from a thundering herd
>> problem. When many readers are waiting for the lock hold by a writer,
>> they will all jump in more or less at the same time when the writer
>> releases the lock. That is not the case with qrwlock. It has been shown
>> in many cases that avoiding this thundering herd problem can lead to
>> better performance.
> Btw., it's possible to further optimize this "writer releases the lock to
> multiple readers spinning" thundering herd scenario in the classic
> read_lock() case, without changing the queueing model.
>
> Right now read_lock() fast path is a single atomic instruction. When a
> writer releases the lock then it makes it available to all readers and
> each reader will execute a LOCK DEC instruction which will succeed.
>
> This is the relevant code in arch/x86/lib/rwlock.S [edited for
> readability]:
>
> __read_lock_failed():
>
> 0:      LOCK_PREFIX
>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>
> 1:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      1b
>
>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>          js      0b
>
>          ret
>
> This is where we could optimize: instead of signalling to each reader that
> it's fine to decrease the count and letting dozens of readers do that on
> the same cache-line, which ping-pongs around the numa cross-connect
> touching every other CPU as they execute the LOCK DEC instruction, we
> could let the _writer_ modify the count on unlock in essence, to the exact
> value that readers expect.
>
> Since read_lock() can never abort this should be relatively
> straightforward: the INC above could be left out, and the writer side
> needs to detect that there are no other writers waiting and can set the
> count to 'reader locked' value - which the readers will detect without
> modifying the cache line:
>
> __read_lock_failed():
>
> 0:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      0b
>
>          ret
>
> (Unless I'm missing something that is.)
>
> That way the current write_unlock() followed by a 'thundering herd' of
> __read_lock_failed() atomic accesses is transformed into an efficient
> read-only broadcast of information with only a single update to the
> cacheline: the writer-updated cacheline propagates in parallel to every
> CPU and is cached there.
>
> On typical hardware this will be broadcast to all CPUs as part of regular
> MESI invalidation bus traffic.
>
> reader unlock will still have to modify the cacheline, so rwlocks will
> still have a fundamental scalability limit even in the read-only usecase.

I think that will work. The only drawback that I can see is the fairness 
argument. The current read/write lock implementation is unfair to the 
writer. That change will make it even more unfair to the writer and 
there is no easy way to detect a waiting writer unless we change the 
structure to add such a field. As a result, a steady stream of readers 
will have a higher chance of blocking out a writer indefinitely.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 13:40           ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-18 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>> + *    lock is mostly in FIFO order.
>>>> + * 2. It is faster in high contention situation.
>>> Again, why is it faster?
>> The current rwlock implementation suffers from a thundering herd
>> problem. When many readers are waiting for the lock hold by a writer,
>> they will all jump in more or less at the same time when the writer
>> releases the lock. That is not the case with qrwlock. It has been shown
>> in many cases that avoiding this thundering herd problem can lead to
>> better performance.
> Btw., it's possible to further optimize this "writer releases the lock to
> multiple readers spinning" thundering herd scenario in the classic
> read_lock() case, without changing the queueing model.
>
> Right now read_lock() fast path is a single atomic instruction. When a
> writer releases the lock then it makes it available to all readers and
> each reader will execute a LOCK DEC instruction which will succeed.
>
> This is the relevant code in arch/x86/lib/rwlock.S [edited for
> readability]:
>
> __read_lock_failed():
>
> 0:      LOCK_PREFIX
>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>
> 1:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      1b
>
>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>          js      0b
>
>          ret
>
> This is where we could optimize: instead of signalling to each reader that
> it's fine to decrease the count and letting dozens of readers do that on
> the same cache-line, which ping-pongs around the numa cross-connect
> touching every other CPU as they execute the LOCK DEC instruction, we
> could let the _writer_ modify the count on unlock in essence, to the exact
> value that readers expect.
>
> Since read_lock() can never abort this should be relatively
> straightforward: the INC above could be left out, and the writer side
> needs to detect that there are no other writers waiting and can set the
> count to 'reader locked' value - which the readers will detect without
> modifying the cache line:
>
> __read_lock_failed():
>
> 0:      rep; nop
>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>          js      0b
>
>          ret
>
> (Unless I'm missing something that is.)
>
> That way the current write_unlock() followed by a 'thundering herd' of
> __read_lock_failed() atomic accesses is transformed into an efficient
> read-only broadcast of information with only a single update to the
> cacheline: the writer-updated cacheline propagates in parallel to every
> CPU and is cached there.
>
> On typical hardware this will be broadcast to all CPUs as part of regular
> MESI invalidation bus traffic.
>
> reader unlock will still have to modify the cacheline, so rwlocks will
> still have a fundamental scalability limit even in the read-only usecase.

I think that will work. The only drawback that I can see is the fairness 
argument. The current read/write lock implementation is unfair to the 
writer. That change will make it even more unfair to the writer and 
there is no easy way to detect a waiting writer unless we change the 
structure to add such a field. As a result, a steady stream of readers 
will have a higher chance of blocking out a writer indefinitely.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 10:22         ` Thomas Gleixner
@ 2013-07-18 14:19           ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-18 14:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
> Waiman,
>
> On Mon, 15 Jul 2013, Waiman Long wrote:
>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2013, Waiman Long wrote:
>>>> Apparently, the regular read/write lock performs even better than
>>>> the queue read/write lock in some cases.  This is probably due to the
>>> The regular rwlock performs better in most cases. This is the full
>>> list comparing both against the ticket lock.
>>>
>>>       qrlock  	   rwlock
>>>       +20.7  	   +44.4
>>>       +30.1  	   +42.9
>>>
>>>       +56.3  	   +63.3
>>>       +52.9  	   +48.8
>>>
>>>       +54.4  	   +65.1
>>>       +49.2  	   +26.5
>>>
>>> So you try to sell that qrwlock as a replacement for ticket spinlocks,
>>> while at the same time you omit the fact that we have an even better
>>> implementation (except for the last test case) already in the
>>> kernel. What's the point of this exercise?
>> The main point is that the regular rwlock is not fair while the
>> queue rwlock is close to as fair as the ticket spinlock. The LWN
>> article http://lwn.net/Articles/364583/ mentioned about eliminating
>> rwlock altogether precisely because of this unfairness as it can
>> cause livelock in certain scenerio. I also saw slides to advise
>> again using rwlock because of this.
> I'm well aware of this. But that does not explain anything of what I
> asked.
>
>>>> + * has the following advantages:
>>>> + * 1. It is more deterministic. Even though there is a slight chance
>>>> of
>>> Why is it more deterministic than the existing implementation?
>> Deterministic means that that a process can acquire a lock within a
>> reasonable time period without being starved for a long time. The qrwlock
>> grants lock in FIFO order in most cases. That is what I mean by being more
>> deterministic.
> That's exactly the kind of explanation we want to have in the code and
> the changelog.

Sometimes I may take things for granted without explaining them in more 
details. Thank for reminding me and I will try to get more data in to 
support my arguments.

>>>> + *    stealing the lock if come at the right moment, the granting of
>>>> the
>>>> + *    lock is mostly in FIFO order.
>>>> + * 2. It is faster in high contention situation.
>>> Again, why is it faster?
>> The current rwlock implementation suffers from a thundering herd problem.
>> When many readers are waiting for the lock hold by a writer, they will all
>> jump in more or less at the same time when the writer releases the lock.
>> That is not the case with qrwlock. It has been shown in many cases that
>> avoiding this thundering herd problem can lead to better performance.
> That makes sense and wants to be documented as well. You could have
> avoided a lot of the discussion if you had included these details
> right away.
>
>>>> + * an increase in lock size is not an issue.
>>> So is it faster in the general case or only for the high contention or
>>> single thread operation cases?
>>>
>>> And you still miss to explain WHY it is faster. Can you please explain
>>> proper WHY it is faster and WHY we can't apply that technique you
>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>> smaller lock size?
>> I will try to collect more data to justify the usefulness of qrwlock.
> And please provide a proper argument why we can't use the same
> technique for spinlocks.

Of course, we can use the same technique for spinlock. Since we only 
need 1 bit for lock, we could combine the lock bit with the queue 
address with a little bit more overhead in term of coding and speed.  
That will make the new lock 4 bytes in size for 32-bit code & 8 bytes 
for 64-bit code. That could solve a lot of performance problem that we 
have with spinlock. However, I am aware that increasing the size of 
spinlock (for 64-bit systems) may break a lot of inherent alignment in 
many of the data structures. That is why I am not proposing such a 
change right now. But if there is enough interest, we could certainly go 
ahead and see how things go.

>>> Aside of that, you are replacing all RW locks unconditionally by this
>>> new fangled thing, but did you actually run tests which look at other
>>> rwlock usage sites than the particular one you care about?
>> Users have the choice of using the old rwlock or the queue rwlock by
>> selecting or unselecting the QUEUE_RWLOCK config parameter. I am not
>> forcing the unconditional replacement of rwlock by qrwlock.
> Looking at patch 2/2:
>
> +config ARCH_QUEUE_RWLOCK
> +       def_bool y
>
> What's conditional about that? Where is the choice?

The queue read/write lock replaces the classic read/write lock in the 
arch-dependent layer. We will need to make changes to each architecture 
to make queue read/write lock work. The presence of ARCH_QUEUE_RWLOCK 
just indicates that changes are made in that architecture to support the 
use of queue read/write lock. It doesn't mean that it is enabled by 
default. Users still need to choose it (QUEUE_RWLOCK) from the 
configuration menu. The ARCH_QUEUE_RWLOCK does prevent the menu option 
from even showing up for those architectures that have not been changed 
to support this feature. As I don't have test machines for the other 
architectures to verify the changes, my patch set only enables queue 
read/write lock for x86 only.

>>> You are optimizing for the high frequency writer case. And that's not
>>> the primary use case for rwlocks. That's the special use case for the
>>> jbd2 journal_state_lock which CANNOT be generalized for all other
>>> rwlock usage sites.
>> It is true that this lock is kind of optimized for writers. For
>> reader heavy code, the performance may not be as good as the rwlock
>> for uncontended cases. However, I do believe that the fairness
>> attribute of the qrwlock far outweigh the slight performance
>> overhead of read lock/unlock.  Furthermore, the lock/unlock sequence
>> contributes only a very tiny percentage of total CPU time in
>> uncontended cases. A slight increase may not really have a material
>> impact on performance. Again, as promised, I will try to collect
>> some more performance data for reader heavy usage cases.
> Yes, please. We really need this information and if it turns out, that
> it does not affect reader heavy sides, I have no objections against
> the technology itself.

I am collecting more performance data right now for the next revision of 
the patch. I had optimized the fast path of the lock to make write lock 
2/3 the time of a spinlock and the read lock is now 1.5X of the spinlock 
time instead of 2X.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 14:19           ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-18 14:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman,
	Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Chandramouleeswaran, Aswin, Norton, Scott J

On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
> Waiman,
>
> On Mon, 15 Jul 2013, Waiman Long wrote:
>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2013, Waiman Long wrote:
>>>> Apparently, the regular read/write lock performs even better than
>>>> the queue read/write lock in some cases.  This is probably due to the
>>> The regular rwlock performs better in most cases. This is the full
>>> list comparing both against the ticket lock.
>>>
>>>       qrlock  	   rwlock
>>>       +20.7  	   +44.4
>>>       +30.1  	   +42.9
>>>
>>>       +56.3  	   +63.3
>>>       +52.9  	   +48.8
>>>
>>>       +54.4  	   +65.1
>>>       +49.2  	   +26.5
>>>
>>> So you try to sell that qrwlock as a replacement for ticket spinlocks,
>>> while at the same time you omit the fact that we have an even better
>>> implementation (except for the last test case) already in the
>>> kernel. What's the point of this exercise?
>> The main point is that the regular rwlock is not fair while the
>> queue rwlock is close to as fair as the ticket spinlock. The LWN
>> article http://lwn.net/Articles/364583/ mentioned about eliminating
>> rwlock altogether precisely because of this unfairness as it can
>> cause livelock in certain scenerio. I also saw slides to advise
>> again using rwlock because of this.
> I'm well aware of this. But that does not explain anything of what I
> asked.
>
>>>> + * has the following advantages:
>>>> + * 1. It is more deterministic. Even though there is a slight chance
>>>> of
>>> Why is it more deterministic than the existing implementation?
>> Deterministic means that that a process can acquire a lock within a
>> reasonable time period without being starved for a long time. The qrwlock
>> grants lock in FIFO order in most cases. That is what I mean by being more
>> deterministic.
> That's exactly the kind of explanation we want to have in the code and
> the changelog.

Sometimes I may take things for granted without explaining them in more 
details. Thank for reminding me and I will try to get more data in to 
support my arguments.

>>>> + *    stealing the lock if come at the right moment, the granting of
>>>> the
>>>> + *    lock is mostly in FIFO order.
>>>> + * 2. It is faster in high contention situation.
>>> Again, why is it faster?
>> The current rwlock implementation suffers from a thundering herd problem.
>> When many readers are waiting for the lock hold by a writer, they will all
>> jump in more or less at the same time when the writer releases the lock.
>> That is not the case with qrwlock. It has been shown in many cases that
>> avoiding this thundering herd problem can lead to better performance.
> That makes sense and wants to be documented as well. You could have
> avoided a lot of the discussion if you had included these details
> right away.
>
>>>> + * an increase in lock size is not an issue.
>>> So is it faster in the general case or only for the high contention or
>>> single thread operation cases?
>>>
>>> And you still miss to explain WHY it is faster. Can you please explain
>>> proper WHY it is faster and WHY we can't apply that technique you
>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>> smaller lock size?
>> I will try to collect more data to justify the usefulness of qrwlock.
> And please provide a proper argument why we can't use the same
> technique for spinlocks.

Of course, we can use the same technique for spinlock. Since we only 
need 1 bit for lock, we could combine the lock bit with the queue 
address with a little bit more overhead in term of coding and speed.  
That will make the new lock 4 bytes in size for 32-bit code & 8 bytes 
for 64-bit code. That could solve a lot of performance problem that we 
have with spinlock. However, I am aware that increasing the size of 
spinlock (for 64-bit systems) may break a lot of inherent alignment in 
many of the data structures. That is why I am not proposing such a 
change right now. But if there is enough interest, we could certainly go 
ahead and see how things go.

>>> Aside of that, you are replacing all RW locks unconditionally by this
>>> new fangled thing, but did you actually run tests which look at other
>>> rwlock usage sites than the particular one you care about?
>> Users have the choice of using the old rwlock or the queue rwlock by
>> selecting or unselecting the QUEUE_RWLOCK config parameter. I am not
>> forcing the unconditional replacement of rwlock by qrwlock.
> Looking at patch 2/2:
>
> +config ARCH_QUEUE_RWLOCK
> +       def_bool y
>
> What's conditional about that? Where is the choice?

The queue read/write lock replaces the classic read/write lock in the 
arch-dependent layer. We will need to make changes to each architecture 
to make queue read/write lock work. The presence of ARCH_QUEUE_RWLOCK 
just indicates that changes are made in that architecture to support the 
use of queue read/write lock. It doesn't mean that it is enabled by 
default. Users still need to choose it (QUEUE_RWLOCK) from the 
configuration menu. The ARCH_QUEUE_RWLOCK does prevent the menu option 
from even showing up for those architectures that have not been changed 
to support this feature. As I don't have test machines for the other 
architectures to verify the changes, my patch set only enables queue 
read/write lock for x86 only.

>>> You are optimizing for the high frequency writer case. And that's not
>>> the primary use case for rwlocks. That's the special use case for the
>>> jbd2 journal_state_lock which CANNOT be generalized for all other
>>> rwlock usage sites.
>> It is true that this lock is kind of optimized for writers. For
>> reader heavy code, the performance may not be as good as the rwlock
>> for uncontended cases. However, I do believe that the fairness
>> attribute of the qrwlock far outweigh the slight performance
>> overhead of read lock/unlock.  Furthermore, the lock/unlock sequence
>> contributes only a very tiny percentage of total CPU time in
>> uncontended cases. A slight increase may not really have a material
>> impact on performance. Again, as promised, I will try to collect
>> some more performance data for reader heavy usage cases.
> Yes, please. We really need this information and if it turns out, that
> it does not affect reader heavy sides, I have no objections against
> the technology itself.

I am collecting more performance data right now for the next revision of 
the patch. I had optimized the fast path of the lock to make write lock 
2/3 the time of a spinlock and the read lock is now 1.5X of the spinlock 
time instead of 2X.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 13:40           ` Waiman Long
  (?)
@ 2013-07-19  8:40             ` Ingo Molnar
  -1 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-19  8:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-19  8:40             ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-19  8:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-19  8:40             ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-19  8:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J


* Waiman Long <waiman.long@hp.com> wrote:

> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
> >* Waiman Long<waiman.long@hp.com>  wrote:
> >
> >>>>+ *    stealing the lock if come at the right moment, the granting of the
> >>>>+ *    lock is mostly in FIFO order.
> >>>>+ * 2. It is faster in high contention situation.
> >>>Again, why is it faster?
> >>The current rwlock implementation suffers from a thundering herd
> >>problem. When many readers are waiting for the lock hold by a writer,
> >>they will all jump in more or less at the same time when the writer
> >>releases the lock. That is not the case with qrwlock. It has been shown
> >>in many cases that avoiding this thundering herd problem can lead to
> >>better performance.
> >Btw., it's possible to further optimize this "writer releases the lock to
> >multiple readers spinning" thundering herd scenario in the classic
> >read_lock() case, without changing the queueing model.
> >
> >Right now read_lock() fast path is a single atomic instruction. When a
> >writer releases the lock then it makes it available to all readers and
> >each reader will execute a LOCK DEC instruction which will succeed.
> >
> >This is the relevant code in arch/x86/lib/rwlock.S [edited for
> >readability]:
> >
> >__read_lock_failed():
> >
> >0:      LOCK_PREFIX
> >         READ_LOCK_SIZE(inc) (%__lock_ptr)
> >
> >1:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      1b
> >
> >         LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >This is where we could optimize: instead of signalling to each reader that
> >it's fine to decrease the count and letting dozens of readers do that on
> >the same cache-line, which ping-pongs around the numa cross-connect
> >touching every other CPU as they execute the LOCK DEC instruction, we
> >could let the _writer_ modify the count on unlock in essence, to the exact
> >value that readers expect.
> >
> >Since read_lock() can never abort this should be relatively
> >straightforward: the INC above could be left out, and the writer side
> >needs to detect that there are no other writers waiting and can set the
> >count to 'reader locked' value - which the readers will detect without
> >modifying the cache line:
> >
> >__read_lock_failed():
> >
> >0:      rep; nop
> >         READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
> >         js      0b
> >
> >         ret
> >
> >(Unless I'm missing something that is.)
> >
> >That way the current write_unlock() followed by a 'thundering herd' of
> >__read_lock_failed() atomic accesses is transformed into an efficient
> >read-only broadcast of information with only a single update to the
> >cacheline: the writer-updated cacheline propagates in parallel to every
> >CPU and is cached there.
> >
> >On typical hardware this will be broadcast to all CPUs as part of regular
> >MESI invalidation bus traffic.
> >
> >reader unlock will still have to modify the cacheline, so rwlocks will
> >still have a fundamental scalability limit even in the read-only usecase.
> 
> I think that will work. The only drawback that I can see is the fairness 
> argument. The current read/write lock implementation is unfair to the 
> writer. That change will make it even more unfair to the writer and 
> there is no easy way to detect a waiting writer unless we change the 
> structure to add such a field. As a result, a steady stream of readers 
> will have a higher chance of blocking out a writer indefinitely.

The effect will have to be measured - but I don't think it's particularly 
hard to tune the fairness balance between readers and writers: the change 
I suggested would only affect the case when a writer already holding the 
lock unlocks it.

But when a writer already holds the lock it can decide to pass that lock 
to another writer-spinning instead of unlocking to all readers. This too 
should be relatively straightforward to implement because neither 
read_lock() nor write_lock() can abort and race.

Instead of doing:

static inline void arch_write_unlock(arch_rwlock_t *rw)
{
        asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
                     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}

the current owner could check whether there are other writers waiting and 
could drop into a slowpath that passes ownership to one of the writers via 
toggling bit 30 or so. This reduces the max number of writers by a factor 
of 2.

But I'd implement this only if it proves to be a problem in practice.

I'd strongly suggest to first address the thundering herd problem of 
write_unlock() and see how it affects scalability - before totally 
replacing it all with a new, fundamentally heavier locking primitive!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-19  8:40             ` Ingo Molnar
  (?)
@ 2013-07-19 15:30               ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-19 15:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin

On 07/19/2013 04:40 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
>>> * Waiman Long<waiman.long@hp.com>   wrote:
>>>
>>>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>>>> + *    lock is mostly in FIFO order.
>>>>>> + * 2. It is faster in high contention situation.
>>>>> Again, why is it faster?
>>>> The current rwlock implementation suffers from a thundering herd
>>>> problem. When many readers are waiting for the lock hold by a writer,
>>>> they will all jump in more or less at the same time when the writer
>>>> releases the lock. That is not the case with qrwlock. It has been shown
>>>> in many cases that avoiding this thundering herd problem can lead to
>>>> better performance.
>>> Btw., it's possible to further optimize this "writer releases the lock to
>>> multiple readers spinning" thundering herd scenario in the classic
>>> read_lock() case, without changing the queueing model.
>>>
>>> Right now read_lock() fast path is a single atomic instruction. When a
>>> writer releases the lock then it makes it available to all readers and
>>> each reader will execute a LOCK DEC instruction which will succeed.
>>>
>>> This is the relevant code in arch/x86/lib/rwlock.S [edited for
>>> readability]:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      LOCK_PREFIX
>>>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>>>
>>> 1:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      1b
>>>
>>>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> This is where we could optimize: instead of signalling to each reader that
>>> it's fine to decrease the count and letting dozens of readers do that on
>>> the same cache-line, which ping-pongs around the numa cross-connect
>>> touching every other CPU as they execute the LOCK DEC instruction, we
>>> could let the _writer_ modify the count on unlock in essence, to the exact
>>> value that readers expect.
>>>
>>> Since read_lock() can never abort this should be relatively
>>> straightforward: the INC above could be left out, and the writer side
>>> needs to detect that there are no other writers waiting and can set the
>>> count to 'reader locked' value - which the readers will detect without
>>> modifying the cache line:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> (Unless I'm missing something that is.)
>>>
>>> That way the current write_unlock() followed by a 'thundering herd' of
>>> __read_lock_failed() atomic accesses is transformed into an efficient
>>> read-only broadcast of information with only a single update to the
>>> cacheline: the writer-updated cacheline propagates in parallel to every
>>> CPU and is cached there.
>>>
>>> On typical hardware this will be broadcast to all CPUs as part of regular
>>> MESI invalidation bus traffic.
>>>
>>> reader unlock will still have to modify the cacheline, so rwlocks will
>>> still have a fundamental scalability limit even in the read-only usecase.
>> I think that will work. The only drawback that I can see is the fairness
>> argument. The current read/write lock implementation is unfair to the
>> writer. That change will make it even more unfair to the writer and
>> there is no easy way to detect a waiting writer unless we change the
>> structure to add such a field. As a result, a steady stream of readers
>> will have a higher chance of blocking out a writer indefinitely.
> The effect will have to be measured - but I don't think it's particularly
> hard to tune the fairness balance between readers and writers: the change
> I suggested would only affect the case when a writer already holding the
> lock unlocks it.
>
> But when a writer already holds the lock it can decide to pass that lock
> to another writer-spinning instead of unlocking to all readers. This too
> should be relatively straightforward to implement because neither
> read_lock() nor write_lock() can abort and race.
>
> Instead of doing:
>
> static inline void arch_write_unlock(arch_rwlock_t *rw)
> {
>          asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
>                       : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
> }
>
> the current owner could check whether there are other writers waiting and
> could drop into a slowpath that passes ownership to one of the writers via
> toggling bit 30 or so. This reduces the max number of writers by a factor
> of 2.
>
> But I'd implement this only if it proves to be a problem in practice.
>
> I'd strongly suggest to first address the thundering herd problem of
> write_unlock() and see how it affects scalability - before totally
> replacing it all with a new, fundamentally heavier locking primitive!
>
> Thanks,
>
> 	Ingo
I had run some performance tests using the fserver and new_fserver 
benchmarks
(on ext4 filesystems) of the AIM7 test suite on a 80-core DL980 with
HT on. The following kernels were used:

1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
    replaced by a rwlock
2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
    by Ingo
3. Modified 3.10.1 kernel + queue read/write lock
4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
    lock behavior

The last one is with the read lock stealing flag set in the qrwlock
structure to give priority to readers and behave more like the classic
read/write lock with less fairness.

The following table shows the averaged results in the 200-1000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 245598 | 274457 | 403348 | 411941 |
| % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
| % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
+-----------------+--------+--------+--------+--------+

The following table shows the averaged results in the 1100-2000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 230295 | 263562 | 347161 | 333908 |
| % change from 1 |   0%   | +14.5% | +50.8% | +45.0% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 241154 | 274571 | 338166 | 369550 |
| % change from 1 |   0%   | +13.9% | +40.2% | +53.2% |
+-----------------+--------+--------+--------+--------+

For these 2 benchmarks, the read/write lock bottleneck is in the
j_state_lock of the journal_s structure in include/linux/jbd2.h.

The following table show the amount of CPU time spent on the slowpaths
of the read and write lock for each of the different kernels listed
above as reported by perf with the fserver benchmark at 1500 users:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| Read slowpath   | 16.9%  | 12.2%  | 11.7%  | 15.3%  |
| Write slowpath  | 13.3%  | 11.5%  | 0.02%  | 0.09%  |
+-----------------+--------+--------+--------+--------+

The small write slowpath numbers indicates that it is a reader heavy
lock with writers probably holding the lock a lot longer than the
readers.

It can be seen that the Ingo's changes does improve performance by
about 10-15% in this reader-heavy high contention scenario. It also 
reduces the read slowpath time relative to the write slowpath. However,
the queue read/write lock can improve performance by more than 50%. The
queue read/write lock is also much more fair to the writers as the
writers waiting time drop significantly.

In low contention situation, all the 4 kernels perform similarly with
1-2% performance variation and so it is hard to draw conclusion.

Regards,
Longman



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-19 15:30               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-19 15:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc

On 07/19/2013 04:40 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
>>> * Waiman Long<waiman.long@hp.com>   wrote:
>>>
>>>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>>>> + *    lock is mostly in FIFO order.
>>>>>> + * 2. It is faster in high contention situation.
>>>>> Again, why is it faster?
>>>> The current rwlock implementation suffers from a thundering herd
>>>> problem. When many readers are waiting for the lock hold by a writer,
>>>> they will all jump in more or less at the same time when the writer
>>>> releases the lock. That is not the case with qrwlock. It has been shown
>>>> in many cases that avoiding this thundering herd problem can lead to
>>>> better performance.
>>> Btw., it's possible to further optimize this "writer releases the lock to
>>> multiple readers spinning" thundering herd scenario in the classic
>>> read_lock() case, without changing the queueing model.
>>>
>>> Right now read_lock() fast path is a single atomic instruction. When a
>>> writer releases the lock then it makes it available to all readers and
>>> each reader will execute a LOCK DEC instruction which will succeed.
>>>
>>> This is the relevant code in arch/x86/lib/rwlock.S [edited for
>>> readability]:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      LOCK_PREFIX
>>>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>>>
>>> 1:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      1b
>>>
>>>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> This is where we could optimize: instead of signalling to each reader that
>>> it's fine to decrease the count and letting dozens of readers do that on
>>> the same cache-line, which ping-pongs around the numa cross-connect
>>> touching every other CPU as they execute the LOCK DEC instruction, we
>>> could let the _writer_ modify the count on unlock in essence, to the exact
>>> value that readers expect.
>>>
>>> Since read_lock() can never abort this should be relatively
>>> straightforward: the INC above could be left out, and the writer side
>>> needs to detect that there are no other writers waiting and can set the
>>> count to 'reader locked' value - which the readers will detect without
>>> modifying the cache line:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> (Unless I'm missing something that is.)
>>>
>>> That way the current write_unlock() followed by a 'thundering herd' of
>>> __read_lock_failed() atomic accesses is transformed into an efficient
>>> read-only broadcast of information with only a single update to the
>>> cacheline: the writer-updated cacheline propagates in parallel to every
>>> CPU and is cached there.
>>>
>>> On typical hardware this will be broadcast to all CPUs as part of regular
>>> MESI invalidation bus traffic.
>>>
>>> reader unlock will still have to modify the cacheline, so rwlocks will
>>> still have a fundamental scalability limit even in the read-only usecase.
>> I think that will work. The only drawback that I can see is the fairness
>> argument. The current read/write lock implementation is unfair to the
>> writer. That change will make it even more unfair to the writer and
>> there is no easy way to detect a waiting writer unless we change the
>> structure to add such a field. As a result, a steady stream of readers
>> will have a higher chance of blocking out a writer indefinitely.
> The effect will have to be measured - but I don't think it's particularly
> hard to tune the fairness balance between readers and writers: the change
> I suggested would only affect the case when a writer already holding the
> lock unlocks it.
>
> But when a writer already holds the lock it can decide to pass that lock
> to another writer-spinning instead of unlocking to all readers. This too
> should be relatively straightforward to implement because neither
> read_lock() nor write_lock() can abort and race.
>
> Instead of doing:
>
> static inline void arch_write_unlock(arch_rwlock_t *rw)
> {
>          asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
>                       : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
> }
>
> the current owner could check whether there are other writers waiting and
> could drop into a slowpath that passes ownership to one of the writers via
> toggling bit 30 or so. This reduces the max number of writers by a factor
> of 2.
>
> But I'd implement this only if it proves to be a problem in practice.
>
> I'd strongly suggest to first address the thundering herd problem of
> write_unlock() and see how it affects scalability - before totally
> replacing it all with a new, fundamentally heavier locking primitive!
>
> Thanks,
>
> 	Ingo
I had run some performance tests using the fserver and new_fserver 
benchmarks
(on ext4 filesystems) of the AIM7 test suite on a 80-core DL980 with
HT on. The following kernels were used:

1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
    replaced by a rwlock
2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
    by Ingo
3. Modified 3.10.1 kernel + queue read/write lock
4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
    lock behavior

The last one is with the read lock stealing flag set in the qrwlock
structure to give priority to readers and behave more like the classic
read/write lock with less fairness.

The following table shows the averaged results in the 200-1000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 245598 | 274457 | 403348 | 411941 |
| % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
| % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
+-----------------+--------+--------+--------+--------+

The following table shows the averaged results in the 1100-2000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 230295 | 263562 | 347161 | 333908 |
| % change from 1 |   0%   | +14.5% | +50.8% | +45.0% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 241154 | 274571 | 338166 | 369550 |
| % change from 1 |   0%   | +13.9% | +40.2% | +53.2% |
+-----------------+--------+--------+--------+--------+

For these 2 benchmarks, the read/write lock bottleneck is in the
j_state_lock of the journal_s structure in include/linux/jbd2.h.

The following table show the amount of CPU time spent on the slowpaths
of the read and write lock for each of the different kernels listed
above as reported by perf with the fserver benchmark at 1500 users:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| Read slowpath   | 16.9%  | 12.2%  | 11.7%  | 15.3%  |
| Write slowpath  | 13.3%  | 11.5%  | 0.02%  | 0.09%  |
+-----------------+--------+--------+--------+--------+

The small write slowpath numbers indicates that it is a reader heavy
lock with writers probably holding the lock a lot longer than the
readers.

It can be seen that the Ingo's changes does improve performance by
about 10-15% in this reader-heavy high contention scenario. It also 
reduces the read slowpath time relative to the write slowpath. However,
the queue read/write lock can improve performance by more than 50%. The
queue read/write lock is also much more fair to the writers as the
writers waiting time drop significantly.

In low contention situation, all the 4 kernels perform similarly with
1-2% performance variation and so it is hard to draw conclusion.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-19 15:30               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-19 15:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin

On 07/19/2013 04:40 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> On 07/18/2013 03:42 AM, Ingo Molnar wrote:
>>> * Waiman Long<waiman.long@hp.com>   wrote:
>>>
>>>>>> + *    stealing the lock if come at the right moment, the granting of the
>>>>>> + *    lock is mostly in FIFO order.
>>>>>> + * 2. It is faster in high contention situation.
>>>>> Again, why is it faster?
>>>> The current rwlock implementation suffers from a thundering herd
>>>> problem. When many readers are waiting for the lock hold by a writer,
>>>> they will all jump in more or less at the same time when the writer
>>>> releases the lock. That is not the case with qrwlock. It has been shown
>>>> in many cases that avoiding this thundering herd problem can lead to
>>>> better performance.
>>> Btw., it's possible to further optimize this "writer releases the lock to
>>> multiple readers spinning" thundering herd scenario in the classic
>>> read_lock() case, without changing the queueing model.
>>>
>>> Right now read_lock() fast path is a single atomic instruction. When a
>>> writer releases the lock then it makes it available to all readers and
>>> each reader will execute a LOCK DEC instruction which will succeed.
>>>
>>> This is the relevant code in arch/x86/lib/rwlock.S [edited for
>>> readability]:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      LOCK_PREFIX
>>>          READ_LOCK_SIZE(inc) (%__lock_ptr)
>>>
>>> 1:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      1b
>>>
>>>          LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> This is where we could optimize: instead of signalling to each reader that
>>> it's fine to decrease the count and letting dozens of readers do that on
>>> the same cache-line, which ping-pongs around the numa cross-connect
>>> touching every other CPU as they execute the LOCK DEC instruction, we
>>> could let the _writer_ modify the count on unlock in essence, to the exact
>>> value that readers expect.
>>>
>>> Since read_lock() can never abort this should be relatively
>>> straightforward: the INC above could be left out, and the writer side
>>> needs to detect that there are no other writers waiting and can set the
>>> count to 'reader locked' value - which the readers will detect without
>>> modifying the cache line:
>>>
>>> __read_lock_failed():
>>>
>>> 0:      rep; nop
>>>          READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
>>>          js      0b
>>>
>>>          ret
>>>
>>> (Unless I'm missing something that is.)
>>>
>>> That way the current write_unlock() followed by a 'thundering herd' of
>>> __read_lock_failed() atomic accesses is transformed into an efficient
>>> read-only broadcast of information with only a single update to the
>>> cacheline: the writer-updated cacheline propagates in parallel to every
>>> CPU and is cached there.
>>>
>>> On typical hardware this will be broadcast to all CPUs as part of regular
>>> MESI invalidation bus traffic.
>>>
>>> reader unlock will still have to modify the cacheline, so rwlocks will
>>> still have a fundamental scalability limit even in the read-only usecase.
>> I think that will work. The only drawback that I can see is the fairness
>> argument. The current read/write lock implementation is unfair to the
>> writer. That change will make it even more unfair to the writer and
>> there is no easy way to detect a waiting writer unless we change the
>> structure to add such a field. As a result, a steady stream of readers
>> will have a higher chance of blocking out a writer indefinitely.
> The effect will have to be measured - but I don't think it's particularly
> hard to tune the fairness balance between readers and writers: the change
> I suggested would only affect the case when a writer already holding the
> lock unlocks it.
>
> But when a writer already holds the lock it can decide to pass that lock
> to another writer-spinning instead of unlocking to all readers. This too
> should be relatively straightforward to implement because neither
> read_lock() nor write_lock() can abort and race.
>
> Instead of doing:
>
> static inline void arch_write_unlock(arch_rwlock_t *rw)
> {
>          asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
>                       : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
> }
>
> the current owner could check whether there are other writers waiting and
> could drop into a slowpath that passes ownership to one of the writers via
> toggling bit 30 or so. This reduces the max number of writers by a factor
> of 2.
>
> But I'd implement this only if it proves to be a problem in practice.
>
> I'd strongly suggest to first address the thundering herd problem of
> write_unlock() and see how it affects scalability - before totally
> replacing it all with a new, fundamentally heavier locking primitive!
>
> Thanks,
>
> 	Ingo
I had run some performance tests using the fserver and new_fserver 
benchmarks
(on ext4 filesystems) of the AIM7 test suite on a 80-core DL980 with
HT on. The following kernels were used:

1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
    replaced by a rwlock
2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
    by Ingo
3. Modified 3.10.1 kernel + queue read/write lock
4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
    lock behavior

The last one is with the read lock stealing flag set in the qrwlock
structure to give priority to readers and behave more like the classic
read/write lock with less fairness.

The following table shows the averaged results in the 200-1000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 245598 | 274457 | 403348 | 411941 |
| % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
| % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
+-----------------+--------+--------+--------+--------+

The following table shows the averaged results in the 1100-2000
user ranges:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| fserver JPM     | 230295 | 263562 | 347161 | 333908 |
| % change from 1 |   0%   | +14.5% | +50.8% | +45.0% |
+-----------------+--------+--------+--------+--------+
| new-fserver JPM | 241154 | 274571 | 338166 | 369550 |
| % change from 1 |   0%   | +13.9% | +40.2% | +53.2% |
+-----------------+--------+--------+--------+--------+

For these 2 benchmarks, the read/write lock bottleneck is in the
j_state_lock of the journal_s structure in include/linux/jbd2.h.

The following table show the amount of CPU time spent on the slowpaths
of the read and write lock for each of the different kernels listed
above as reported by perf with the fserver benchmark at 1500 users:

+-----------------+--------+--------+--------+--------+
|  Kernel         |    1   |    2   |    3   |   4    |
+-----------------+--------+--------+--------+--------+
| Read slowpath   | 16.9%  | 12.2%  | 11.7%  | 15.3%  |
| Write slowpath  | 13.3%  | 11.5%  | 0.02%  | 0.09%  |
+-----------------+--------+--------+--------+--------+

The small write slowpath numbers indicates that it is a reader heavy
lock with writers probably holding the lock a lot longer than the
readers.

It can be seen that the Ingo's changes does improve performance by
about 10-15% in this reader-heavy high contention scenario. It also 
reduces the read slowpath time relative to the write slowpath. However,
the queue read/write lock can improve performance by more than 50%. The
queue read/write lock is also much more fair to the writers as the
writers waiting time drop significantly.

In low contention situation, all the 4 kernels perform similarly with
1-2% performance variation and so it is hard to draw conclusion.

Regards,
Longman



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 14:19           ` Waiman Long
  (?)
@ 2013-07-21  5:42             ` Raghavendra K T
  -1 siblings, 0 replies; 49+ messages in thread
From: Raghavendra K T @ 2013-07-21  5:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/18/2013 07:49 PM, Waiman Long wrote:
> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>> Waiman,
>>
>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
[...]
>>
>>>>> + * an increase in lock size is not an issue.
>>>> So is it faster in the general case or only for the high contention or
>>>> single thread operation cases?
>>>>
>>>> And you still miss to explain WHY it is faster. Can you please explain
>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>> smaller lock size?
>>> I will try to collect more data to justify the usefulness of qrwlock.
>> And please provide a proper argument why we can't use the same
>> technique for spinlocks.
>
> Of course, we can use the same technique for spinlock. Since we only
> need 1 bit for lock, we could combine the lock bit with the queue
> address with a little bit more overhead in term of coding and speed.
> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
> for 64-bit code. That could solve a lot of performance problem that we
> have with spinlock. However, I am aware that increasing the size of
> spinlock (for 64-bit systems) may break a lot of inherent alignment in
> many of the data structures. That is why I am not proposing such a
> change right now. But if there is enough interest, we could certainly go
> ahead and see how things go.

keeping apart the lock size part, for spinlocks, is it that
  fastpath overhead is less significant in low contention scenarios for
qlocks?

Also let me know if you have POC implementation for the spinlocks that
you can share. I am happy to test that.

sorry. different context:
apart from AIM7 fserver, is there any other benchmark to exercise this
qrwlock series? (to help in the testing).


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-21  5:42             ` Raghavendra K T
  0 siblings, 0 replies; 49+ messages in thread
From: Raghavendra K T @ 2013-07-21  5:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc

On 07/18/2013 07:49 PM, Waiman Long wrote:
> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>> Waiman,
>>
>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
[...]
>>
>>>>> + * an increase in lock size is not an issue.
>>>> So is it faster in the general case or only for the high contention or
>>>> single thread operation cases?
>>>>
>>>> And you still miss to explain WHY it is faster. Can you please explain
>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>> smaller lock size?
>>> I will try to collect more data to justify the usefulness of qrwlock.
>> And please provide a proper argument why we can't use the same
>> technique for spinlocks.
>
> Of course, we can use the same technique for spinlock. Since we only
> need 1 bit for lock, we could combine the lock bit with the queue
> address with a little bit more overhead in term of coding and speed.
> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
> for 64-bit code. That could solve a lot of performance problem that we
> have with spinlock. However, I am aware that increasing the size of
> spinlock (for 64-bit systems) may break a lot of inherent alignment in
> many of the data structures. That is why I am not proposing such a
> change right now. But if there is enough interest, we could certainly go
> ahead and see how things go.

keeping apart the lock size part, for spinlocks, is it that
  fastpath overhead is less significant in low contention scenarios for
qlocks?

Also let me know if you have POC implementation for the spinlocks that
you can share. I am happy to test that.

sorry. different context:
apart from AIM7 fserver, is there any other benchmark to exercise this
qrwlock series? (to help in the testing).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-21  5:42             ` Raghavendra K T
  0 siblings, 0 replies; 49+ messages in thread
From: Raghavendra K T @ 2013-07-21  5:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/18/2013 07:49 PM, Waiman Long wrote:
> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>> Waiman,
>>
>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
[...]
>>
>>>>> + * an increase in lock size is not an issue.
>>>> So is it faster in the general case or only for the high contention or
>>>> single thread operation cases?
>>>>
>>>> And you still miss to explain WHY it is faster. Can you please explain
>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>> smaller lock size?
>>> I will try to collect more data to justify the usefulness of qrwlock.
>> And please provide a proper argument why we can't use the same
>> technique for spinlocks.
>
> Of course, we can use the same technique for spinlock. Since we only
> need 1 bit for lock, we could combine the lock bit with the queue
> address with a little bit more overhead in term of coding and speed.
> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
> for 64-bit code. That could solve a lot of performance problem that we
> have with spinlock. However, I am aware that increasing the size of
> spinlock (for 64-bit systems) may break a lot of inherent alignment in
> many of the data structures. That is why I am not proposing such a
> change right now. But if there is enough interest, we could certainly go
> ahead and see how things go.

keeping apart the lock size part, for spinlocks, is it that
  fastpath overhead is less significant in low contention scenarios for
qlocks?

Also let me know if you have POC implementation for the spinlocks that
you can share. I am happy to test that.

sorry. different context:
apart from AIM7 fserver, is there any other benchmark to exercise this
qrwlock series? (to help in the testing).


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-19 15:30               ` Waiman Long
  (?)
@ 2013-07-22 10:34                 ` Ingo Molnar
  -1 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-22 10:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin


* Waiman Long <waiman.long@hp.com> wrote:

> I had run some performance tests using the fserver and new_fserver 
> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core 
> DL980 with HT on. The following kernels were used:
> 
> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>    replaced by a rwlock
> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>    by Ingo
> 3. Modified 3.10.1 kernel + queue read/write lock
> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>    lock behavior
> 
> The last one is with the read lock stealing flag set in the qrwlock
> structure to give priority to readers and behave more like the classic
> read/write lock with less fairness.
> 
> The following table shows the averaged results in the 200-1000
> user ranges:
> 
> +-----------------+--------+--------+--------+--------+
> |  Kernel         |    1   |    2   |    3   |   4    |
> +-----------------+--------+--------+--------+--------+
> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
> +-----------------+--------+--------+--------+--------+
> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
> +-----------------+--------+--------+--------+--------+

So it's not just herding that is a problem.

I'm wondering, how sensitive is this particular benchmark to fairness? 
I.e. do the 200-1000 simulated users each perform the same number of ops, 
so that any smearing of execution time via unfairness gets amplified?

I.e. does steady-state throughput go up by 60%+ too with your changes?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-22 10:34                 ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-22 10:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc


* Waiman Long <waiman.long@hp.com> wrote:

> I had run some performance tests using the fserver and new_fserver 
> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core 
> DL980 with HT on. The following kernels were used:
> 
> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>    replaced by a rwlock
> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>    by Ingo
> 3. Modified 3.10.1 kernel + queue read/write lock
> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>    lock behavior
> 
> The last one is with the read lock stealing flag set in the qrwlock
> structure to give priority to readers and behave more like the classic
> read/write lock with less fairness.
> 
> The following table shows the averaged results in the 200-1000
> user ranges:
> 
> +-----------------+--------+--------+--------+--------+
> |  Kernel         |    1   |    2   |    3   |   4    |
> +-----------------+--------+--------+--------+--------+
> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
> +-----------------+--------+--------+--------+--------+
> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
> +-----------------+--------+--------+--------+--------+

So it's not just herding that is a problem.

I'm wondering, how sensitive is this particular benchmark to fairness? 
I.e. do the 200-1000 simulated users each perform the same number of ops, 
so that any smearing of execution time via unfairness gets amplified?

I.e. does steady-state throughput go up by 60%+ too with your changes?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-22 10:34                 ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2013-07-22 10:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin


* Waiman Long <waiman.long@hp.com> wrote:

> I had run some performance tests using the fserver and new_fserver 
> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core 
> DL980 with HT on. The following kernels were used:
> 
> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>    replaced by a rwlock
> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>    by Ingo
> 3. Modified 3.10.1 kernel + queue read/write lock
> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>    lock behavior
> 
> The last one is with the read lock stealing flag set in the qrwlock
> structure to give priority to readers and behave more like the classic
> read/write lock with less fairness.
> 
> The following table shows the averaged results in the 200-1000
> user ranges:
> 
> +-----------------+--------+--------+--------+--------+
> |  Kernel         |    1   |    2   |    3   |   4    |
> +-----------------+--------+--------+--------+--------+
> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
> +-----------------+--------+--------+--------+--------+
> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
> +-----------------+--------+--------+--------+--------+

So it's not just herding that is a problem.

I'm wondering, how sensitive is this particular benchmark to fairness? 
I.e. do the 200-1000 simulated users each perform the same number of ops, 
so that any smearing of execution time via unfairness gets amplified?

I.e. does steady-state throughput go up by 60%+ too with your changes?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-21  5:42             ` Raghavendra K T
  (?)
@ 2013-07-23 23:54               ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-23 23:54 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/21/2013 01:42 AM, Raghavendra K T wrote:
> On 07/18/2013 07:49 PM, Waiman Long wrote:
>> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>>> Waiman,
>>>
>>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
> [...]
>>>
>>>>>> + * an increase in lock size is not an issue.
>>>>> So is it faster in the general case or only for the high 
>>>>> contention or
>>>>> single thread operation cases?
>>>>>
>>>>> And you still miss to explain WHY it is faster. Can you please 
>>>>> explain
>>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>>> smaller lock size?
>>>> I will try to collect more data to justify the usefulness of qrwlock.
>>> And please provide a proper argument why we can't use the same
>>> technique for spinlocks.
>>
>> Of course, we can use the same technique for spinlock. Since we only
>> need 1 bit for lock, we could combine the lock bit with the queue
>> address with a little bit more overhead in term of coding and speed.
>> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
>> for 64-bit code. That could solve a lot of performance problem that we
>> have with spinlock. However, I am aware that increasing the size of
>> spinlock (for 64-bit systems) may break a lot of inherent alignment in
>> many of the data structures. That is why I am not proposing such a
>> change right now. But if there is enough interest, we could certainly go
>> ahead and see how things go.
>
> keeping apart the lock size part, for spinlocks, is it that
>  fastpath overhead is less significant in low contention scenarios for
> qlocks?

Fastpath speed is an important consideration for accepting changes to 
lock, especially if the critical section is short. This is the 
impression that I got so far. When the critical section is long, 
however, the speed of the fastpath will be less important.

> Also let me know if you have POC implementation for the spinlocks that
> you can share. I am happy to test that.

I don't any POC implementation for the spinlocks as I am aware that any 
increase in spinlock size will cause it hard to get merged. I could make 
one after I finish the current set of patches that I am working on.

> sorry. different context:
> apart from AIM7 fserver, is there any other benchmark to exercise this
> qrwlock series? (to help in the testing).
>
For the AIM7 test suite, the fserver & new_fserver with ext4 are the 
best ones for exercising the qrwlock series, but you do need to have a 
lot of cores to see the effect. I haven't try to find other suitable 
benchmark tests yet.

Actually, improving fserver and new_fserver performance is not my 
primary objective. My primary goal is to have a fair rwlock 
implementation that can be used to replace selected spinlocks that is in 
high contention without losing the fairness attribute of the ticket 
spinlock, just like the replacement of mutex by rwsem.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-23 23:54               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-23 23:54 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc

On 07/21/2013 01:42 AM, Raghavendra K T wrote:
> On 07/18/2013 07:49 PM, Waiman Long wrote:
>> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>>> Waiman,
>>>
>>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
> [...]
>>>
>>>>>> + * an increase in lock size is not an issue.
>>>>> So is it faster in the general case or only for the high 
>>>>> contention or
>>>>> single thread operation cases?
>>>>>
>>>>> And you still miss to explain WHY it is faster. Can you please 
>>>>> explain
>>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>>> smaller lock size?
>>>> I will try to collect more data to justify the usefulness of qrwlock.
>>> And please provide a proper argument why we can't use the same
>>> technique for spinlocks.
>>
>> Of course, we can use the same technique for spinlock. Since we only
>> need 1 bit for lock, we could combine the lock bit with the queue
>> address with a little bit more overhead in term of coding and speed.
>> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
>> for 64-bit code. That could solve a lot of performance problem that we
>> have with spinlock. However, I am aware that increasing the size of
>> spinlock (for 64-bit systems) may break a lot of inherent alignment in
>> many of the data structures. That is why I am not proposing such a
>> change right now. But if there is enough interest, we could certainly go
>> ahead and see how things go.
>
> keeping apart the lock size part, for spinlocks, is it that
>  fastpath overhead is less significant in low contention scenarios for
> qlocks?

Fastpath speed is an important consideration for accepting changes to 
lock, especially if the critical section is short. This is the 
impression that I got so far. When the critical section is long, 
however, the speed of the fastpath will be less important.

> Also let me know if you have POC implementation for the spinlocks that
> you can share. I am happy to test that.

I don't any POC implementation for the spinlocks as I am aware that any 
increase in spinlock size will cause it hard to get merged. I could make 
one after I finish the current set of patches that I am working on.

> sorry. different context:
> apart from AIM7 fserver, is there any other benchmark to exercise this
> qrwlock series? (to help in the testing).
>
For the AIM7 test suite, the fserver & new_fserver with ext4 are the 
best ones for exercising the qrwlock series, but you do need to have a 
lot of cores to see the effect. I haven't try to find other suitable 
benchmark tests yet.

Actually, improving fserver and new_fserver performance is not my 
primary objective. My primary goal is to have a fair rwlock 
implementation that can be used to replace selected spinlocks that is in 
high contention without losing the fairness attribute of the ticket 
spinlock, just like the replacement of mutex by rwsem.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-23 23:54               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-23 23:54 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J

On 07/21/2013 01:42 AM, Raghavendra K T wrote:
> On 07/18/2013 07:49 PM, Waiman Long wrote:
>> On 07/18/2013 06:22 AM, Thomas Gleixner wrote:
>>> Waiman,
>>>
>>> On Mon, 15 Jul 2013, Waiman Long wrote:
>>>> On 07/15/2013 06:31 PM, Thomas Gleixner wrote:
>>>>> On Fri, 12 Jul 2013, Waiman Long wrote:
> [...]
>>>
>>>>>> + * an increase in lock size is not an issue.
>>>>> So is it faster in the general case or only for the high 
>>>>> contention or
>>>>> single thread operation cases?
>>>>>
>>>>> And you still miss to explain WHY it is faster. Can you please 
>>>>> explain
>>>>> proper WHY it is faster and WHY we can't apply that technique you
>>>>> implemented for qrwlocks to writer only locks (aka spinlocks) with a
>>>>> smaller lock size?
>>>> I will try to collect more data to justify the usefulness of qrwlock.
>>> And please provide a proper argument why we can't use the same
>>> technique for spinlocks.
>>
>> Of course, we can use the same technique for spinlock. Since we only
>> need 1 bit for lock, we could combine the lock bit with the queue
>> address with a little bit more overhead in term of coding and speed.
>> That will make the new lock 4 bytes in size for 32-bit code & 8 bytes
>> for 64-bit code. That could solve a lot of performance problem that we
>> have with spinlock. However, I am aware that increasing the size of
>> spinlock (for 64-bit systems) may break a lot of inherent alignment in
>> many of the data structures. That is why I am not proposing such a
>> change right now. But if there is enough interest, we could certainly go
>> ahead and see how things go.
>
> keeping apart the lock size part, for spinlocks, is it that
>  fastpath overhead is less significant in low contention scenarios for
> qlocks?

Fastpath speed is an important consideration for accepting changes to 
lock, especially if the critical section is short. This is the 
impression that I got so far. When the critical section is long, 
however, the speed of the fastpath will be less important.

> Also let me know if you have POC implementation for the spinlocks that
> you can share. I am happy to test that.

I don't any POC implementation for the spinlocks as I am aware that any 
increase in spinlock size will cause it hard to get merged. I could make 
one after I finish the current set of patches that I am working on.

> sorry. different context:
> apart from AIM7 fserver, is there any other benchmark to exercise this
> qrwlock series? (to help in the testing).
>
For the AIM7 test suite, the fserver & new_fserver with ext4 are the 
best ones for exercising the qrwlock series, but you do need to have a 
lot of cores to see the effect. I haven't try to find other suitable 
benchmark tests yet.

Actually, improving fserver and new_fserver performance is not my 
primary objective. My primary goal is to have a fair rwlock 
implementation that can be used to replace selected spinlocks that is in 
high contention without losing the fairness attribute of the ticket 
spinlock, just like the replacement of mutex by rwsem.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-22 10:34                 ` Ingo Molnar
  (?)
@ 2013-07-24  0:03                   ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-24  0:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin

On 07/22/2013 06:34 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> I had run some performance tests using the fserver and new_fserver
>> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core
>> DL980 with HT on. The following kernels were used:
>>
>> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>>     replaced by a rwlock
>> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>>     by Ingo
>> 3. Modified 3.10.1 kernel + queue read/write lock
>> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>>     lock behavior
>>
>> The last one is with the read lock stealing flag set in the qrwlock
>> structure to give priority to readers and behave more like the classic
>> read/write lock with less fairness.
>>
>> The following table shows the averaged results in the 200-1000
>> user ranges:
>>
>> +-----------------+--------+--------+--------+--------+
>> |  Kernel         |    1   |    2   |    3   |   4    |
>> +-----------------+--------+--------+--------+--------+
>> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
>> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
>> +-----------------+--------+--------+--------+--------+
>> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
>> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
>> +-----------------+--------+--------+--------+--------+
> So it's not just herding that is a problem.
>
> I'm wondering, how sensitive is this particular benchmark to fairness?
> I.e. do the 200-1000 simulated users each perform the same number of ops,
> so that any smearing of execution time via unfairness gets amplified?
>
> I.e. does steady-state throughput go up by 60%+ too with your changes?

For this particular benchmark, there are interplay of different locks 
that determine the overall performance of the system. Yes, I got steady 
state performance gain of 60%+ with the qrwlock change with the modified 
mbcache.c. Without the modified mbcache.c file, the performance gain 
drop to 20-30%. I am still trying to find out more about the performance 
variations in different situations.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-24  0:03                   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-24  0:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Sc

On 07/22/2013 06:34 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> I had run some performance tests using the fserver and new_fserver
>> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core
>> DL980 with HT on. The following kernels were used:
>>
>> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>>     replaced by a rwlock
>> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>>     by Ingo
>> 3. Modified 3.10.1 kernel + queue read/write lock
>> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>>     lock behavior
>>
>> The last one is with the read lock stealing flag set in the qrwlock
>> structure to give priority to readers and behave more like the classic
>> read/write lock with less fairness.
>>
>> The following table shows the averaged results in the 200-1000
>> user ranges:
>>
>> +-----------------+--------+--------+--------+--------+
>> |  Kernel         |    1   |    2   |    3   |   4    |
>> +-----------------+--------+--------+--------+--------+
>> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
>> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
>> +-----------------+--------+--------+--------+--------+
>> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
>> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
>> +-----------------+--------+--------+--------+--------+
> So it's not just herding that is a problem.
>
> I'm wondering, how sensitive is this particular benchmark to fairness?
> I.e. do the 200-1000 simulated users each perform the same number of ops,
> so that any smearing of execution time via unfairness gets amplified?
>
> I.e. does steady-state throughput go up by 60%+ too with your changes?

For this particular benchmark, there are interplay of different locks 
that determine the overall performance of the system. Yes, I got steady 
state performance gain of 60%+ with the qrwlock change with the modified 
mbcache.c. Without the modified mbcache.c file, the performance gain 
drop to 20-30%. I am still trying to find out more about the performance 
variations in different situations.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-24  0:03                   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-24  0:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Richard Weinberger, Catalin Marinas,
	Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita,
	Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Chandramouleeswaran, Aswin,
	Norton, Scott J, George Spelvin

On 07/22/2013 06:34 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> I had run some performance tests using the fserver and new_fserver
>> benchmarks (on ext4 filesystems) of the AIM7 test suite on a 80-core
>> DL980 with HT on. The following kernels were used:
>>
>> 1. Modified 3.10.1 kernel with mb_cache_spinlock in fs/mbcache.c
>>     replaced by a rwlock
>> 2. Modified 3.10.1 kernel + modified __read_lock_failed code as suggested
>>     by Ingo
>> 3. Modified 3.10.1 kernel + queue read/write lock
>> 4. Modified 3.10.1 kernel + queue read/write lock in classic read/write
>>     lock behavior
>>
>> The last one is with the read lock stealing flag set in the qrwlock
>> structure to give priority to readers and behave more like the classic
>> read/write lock with less fairness.
>>
>> The following table shows the averaged results in the 200-1000
>> user ranges:
>>
>> +-----------------+--------+--------+--------+--------+
>> |  Kernel         |    1   |    2   |    3   |   4    |
>> +-----------------+--------+--------+--------+--------+
>> | fserver JPM     | 245598 | 274457 | 403348 | 411941 |
>> | % change from 1 |   0%   | +11.8% | +64.2% | +67.7% |
>> +-----------------+--------+--------+--------+--------+
>> | new-fserver JPM | 231549 | 269807 | 399093 | 399418 |
>> | % change from 1 |   0%   | +16.5% | +72.4% | +72.5% |
>> +-----------------+--------+--------+--------+--------+
> So it's not just herding that is a problem.
>
> I'm wondering, how sensitive is this particular benchmark to fairness?
> I.e. do the 200-1000 simulated users each perform the same number of ops,
> so that any smearing of execution time via unfairness gets amplified?
>
> I.e. does steady-state throughput go up by 60%+ too with your changes?

For this particular benchmark, there are interplay of different locks 
that determine the overall performance of the system. Yes, I got steady 
state performance gain of 60%+ with the qrwlock change with the modified 
mbcache.c. Without the modified mbcache.c file, the performance gain 
drop to 20-30%. I am still trying to find out more about the performance 
variations in different situations.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-19 21:11       ` George Spelvin
@ 2013-07-19 21:35         ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2013-07-19 21:35 UTC (permalink / raw)
  To: George Spelvin; +Cc: JBeulich, linux-arch, linux-kernel, mingo, tglx

On 07/19/2013 05:11 PM, George Spelvin wrote:
>> What I have in mind is to have 2 separate rwlock initializers - one for
>> fair and one for reader-bias behavior. So the lock owners can decide
>> what behavior do they want with a one line change.
> That's definitely a nicer patch, if it will work.  I was imagining that,
> even for a single (type of) lock, only a few uses require reader bias
> (because they might be recursive, or are in an interrupt), but you'd
> want most read_lock sites to be fair.

Yes, fair rwlock will be the default.

> Deciding on a per-lock basis means that one potentially recursive call
> means you can't use fair queueing anywhere.
>
> I was hoping that the number of necessary unfair calls would
> be small enough that making the read_lock default fair and
> only marking the unfair call sites would be enough.
>
> But I don't really know until doing a survey of the calls.
I think so. The queue read/write lock, if merged, will be an optional 
feature for people to try out to see if they see any problem in any of 
the existing rwlock. So far, I didn't encounter any problem in my testing.

BTW, I also tried my version of the rwlock without the waiting queue. In 
high contention case, it performs slightly better than the 
__read_lock_failed changes suggested by Ingo at least for the 
reader-bias one. It is still not as good as the full version with the 
waiting queue. I should be able to provide more performance data next week.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-19 15:43     ` Waiman Long
@ 2013-07-19 21:11       ` George Spelvin
  2013-07-19 21:35         ` Waiman Long
  0 siblings, 1 reply; 49+ messages in thread
From: George Spelvin @ 2013-07-19 21:11 UTC (permalink / raw)
  To: linux, waiman.long; +Cc: JBeulich, linux-arch, linux-kernel, mingo, tglx

> What I have in mind is to have 2 separate rwlock initializers - one for 
> fair and one for reader-bias behavior. So the lock owners can decide 
> what behavior do they want with a one line change.

That's definitely a nicer patch, if it will work.  I was imagining that,
even for a single (type of) lock, only a few uses require reader bias
(because they might be recursive, or are in an interrupt), but you'd
want most read_lock sites to be fair.

Deciding on a per-lock basis means that one potentially recursive call
means you can't use fair queueing anywhere.

I was hoping that the number of necessary unfair calls would
be small enough that making the read_lock default fair and
only marking the unfair call sites would be enough.

But I don't really know until doing a survey of the calls.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 18:46   ` George Spelvin
@ 2013-07-19 15:43     ` Waiman Long
  2013-07-19 21:11       ` George Spelvin
  0 siblings, 1 reply; 49+ messages in thread
From: Waiman Long @ 2013-07-19 15:43 UTC (permalink / raw)
  To: George Spelvin; +Cc: JBeulich, linux-arch, linux-kernel, mingo, tglx

On 07/18/2013 02:46 PM, George Spelvin wrote:
>> Thank for the revision, I will make such a change in the next version of
>> my patch.
> I'm relying on you to correct any technical errors in my description.
> I just meant "something more like this", not impose that exact wording.
>
>> As I said in my response to Ingo, that change will make the lock more
>> unfair to the writers. However, I can run some tests to find out the
>> performance impact of such a way on the benchmarks that I used.
> You can do a lot with the basic structure if you're willing to
> go to XADD in the _lock_failed handlers.
>
> Adding writer priority is possible, but at least some call sites have
> to be reader-priority because they're recursive or are in interrupts,
> and readers don't disable interrupts.
>
> The basic solution is for writers to first detect if they're the *first*
> writer.  If a write lock fails, look and see if it's>  -RW_LOCK_BIAS.
>
> If not, drop it and spin reading until it's>= 0, then try again.
> (Possibly with XADD this time to combine the acquire and add.)
>
> But when we *are* the first writer, subtract an additional -RW_LOCK_BIAS/2.
> This tells pending readers "writer wating, back off!".
>
>
> What readers do when they see that bit set in rwlock->lock-1 depends on
> whether they're fair or unfair:
>
> - Fair readers re-increment the lock and spin waiting for it to become>
>    -RW_LOCK_BIAS/2, at which point they try to reacquire.
> - Unfair readers say "ah, that means I can go ahead, thank you!".
>
>
> A waiting writer waits for the lock to equal -RW_LOCK_BIAS/2, meaning
> all readers have left.  Then it adds -RW_LOCK_BIAS/2, meaning "write
> lock taken" and goes ahead.
>
> At this point, there is a thundering horde while the held-off readers
> get back in line.  But hopefully it overlaps with the writer's lock tenure.
>
> When the writer drops its lock, the waiting readers go ahead, and the
> spinning writers advance in a thundering horde to contend for first place.
> (Again, the latency for this is hopefully covered by the readers'
> lock tenure.)

Thank for the suggestion. What you have proposed will be somewhat 
similar to what my new code is doing with readers/writers spinning on 
the cacheline without the waiting queue. I need to run some tests to see 
if it can really help.

> I count 531 calls to read_lock in the kernel (10 of them are in
> lib/locking-selftest.c, called RL).  I don't have any idea how difficult
> it would be to divide them into read_lock_fair and read_lock_unfair.

What I have in mind is to have 2 separate rwlock initializers - one for 
fair and one for reader-bias behavior. So the lock owners can decide 
what behavior do they want with a one line change.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 13:43 ` Waiman Long
@ 2013-07-18 18:46   ` George Spelvin
  2013-07-19 15:43     ` Waiman Long
  0 siblings, 1 reply; 49+ messages in thread
From: George Spelvin @ 2013-07-18 18:46 UTC (permalink / raw)
  To: linux, waiman.long; +Cc: JBeulich, linux-arch, linux-kernel, mingo, tglx

> Thank for the revision, I will make such a change in the next version of 
> my patch.

I'm relying on you to correct any technical errors in my description.
I just meant "something more like this", not impose that exact wording.

> As I said in my response to Ingo, that change will make the lock more 
> unfair to the writers. However, I can run some tests to find out the 
> performance impact of such a way on the benchmarks that I used.

You can do a lot with the basic structure if you're willing to
go to XADD in the _lock_failed handlers.

Adding writer priority is possible, but at least some call sites have
to be reader-priority because they're recursive or are in interrupts,
and readers don't disable interrupts.

The basic solution is for writers to first detect if they're the *first*
writer.  If a write lock fails, look and see if it's > -RW_LOCK_BIAS.

If not, drop it and spin reading until it's >= 0, then try again.
(Possibly with XADD this time to combine the acquire and add.)

But when we *are* the first writer, subtract an additional -RW_LOCK_BIAS/2.
This tells pending readers "writer wating, back off!".


What readers do when they see that bit set in rwlock->lock-1 depends on
whether they're fair or unfair:

- Fair readers re-increment the lock and spin waiting for it to become >
  -RW_LOCK_BIAS/2, at which point they try to reacquire.
- Unfair readers say "ah, that means I can go ahead, thank you!".


A waiting writer waits for the lock to equal -RW_LOCK_BIAS/2, meaning
all readers have left.  Then it adds -RW_LOCK_BIAS/2, meaning "write
lock taken" and goes ahead.

At this point, there is a thundering horde while the held-off readers
get back in line.  But hopefully it overlaps with the writer's lock tenure.

When the writer drops its lock, the waiting readers go ahead, and the
spinning writers advance in a thundering horde to contend for first place.
(Again, the latency for this is hopefully covered by the readers'
lock tenure.)


I count 531 calls to read_lock in the kernel (10 of them are in
lib/locking-selftest.c, called RL).  I don't have any idea how difficult
it would be to divide them into read_lock_fair and read_lock_unfair.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
  2013-07-18 12:55 [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation George Spelvin
@ 2013-07-18 13:43 ` Waiman Long
  2013-07-18 18:46   ` George Spelvin
  0 siblings, 1 reply; 49+ messages in thread
From: Waiman Long @ 2013-07-18 13:43 UTC (permalink / raw)
  To: George Spelvin; +Cc: JBeulich, linix-kernel, linux-arch, mingo, tglx

On 07/18/2013 08:55 AM, George Spelvin wrote:
> In the interest of more useful Kconfig help, could I recommend the
> following text:
>
> config QUEUE_RWLOCK
> 	bool "Generic queue read/write lock"
> 	depends on ARCH_QUEUE_RWLOCK
> 	help
> 	  Use an alternative reader-writer lock (rwlock) implementation,
> 	  optimized for larger NUMA systems.  These locks use more memory,
> 	  but perform better under high contention.  (Specifically, readers
> 	  waiting for a writer to release the lock will be queued rather
> 	  than all spinning on the same cache line.)
>
> 	  The kernel will operate correctly either way; this only
> 	  affects performance.
>
> 	  For common desktop and server systems systems with only one
> 	  or two CPU sockets, the performance benefits are not worth
> 	  the additional memory; say N here.
>
> My goal is to give someone stumbling across this question for the first
> time in "make oldconfig" the information htey need to answer it.

Thank for the revision, I will make such a change in the next version of 
my patch.

> That said, I think Ingo's idea for simplfying the waiting reader side
> is excellent and should be tried before bifurcating the implementation.
>
> Looking at the lock system, it *seems* like that patch to __read_lock_failed
> is literally the *only* thing that needs changing, since the write
> lock/unlock is all done with relative add/sub anyway.  But I keep thinking
> "there must have been a reason why it wasn't done that way in the first
> place".

As I said in my response to Ingo, that change will make the lock more 
unfair to the writers. However, I can run some tests to find out the 
performance impact of such a way on the benchmarks that I used.

Regards,
Longman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 13:18 George Spelvin
  0 siblings, 0 replies; 49+ messages in thread
From: George Spelvin @ 2013-07-18 13:18 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel

In the interest of more useful Kconfig help, could I recommend the
following text:

config QUEUE_RWLOCK
	bool "Generic queue read/write lock"
	depends on ARCH_QUEUE_RWLOCK
	help
	  Use an alternative reader-writer lock (rwlock) implementation,
	  optimized for larger NUMA systems.  These locks use more memory,
	  but perform better under high contention.  (Specifically, readers
	  waiting for a writer to release the lock will be queued rather
	  than all spinning on the same cache line.)

	  The kernel will operate correctly either way; this only
	  affects performance.

	  For common desktop and server systems systems with only one
	  or two CPU sockets, the performance benefits are not worth
	  the additional memory; say N here.

My goal is to give someone stumbling across this question for the first
time in "make oldconfig" the information htey need to answer it.


That said, I think Ingo's idea for simplfying the waiting reader side
is excellent and should be tried before bifurcating the implementation.

Looking at the lock system, it *seems* like that patch to __read_lock_failed
is literally the *only* thing that needs changing, since the write
lock/unlock is all done with relative add/sub anyway.  But I keep thinking
"there must have been a reason why it wasn't done that way in the first
place".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation
@ 2013-07-18 12:55 George Spelvin
  2013-07-18 13:43 ` Waiman Long
  0 siblings, 1 reply; 49+ messages in thread
From: George Spelvin @ 2013-07-18 12:55 UTC (permalink / raw)
  To: Waiman.Long; +Cc: JBeulich, linix-kernel, linux, linux-arch, mingo, tglx

In the interest of more useful Kconfig help, could I recommend the
following text:

config QUEUE_RWLOCK
	bool "Generic queue read/write lock"
	depends on ARCH_QUEUE_RWLOCK
	help
	  Use an alternative reader-writer lock (rwlock) implementation,
	  optimized for larger NUMA systems.  These locks use more memory,
	  but perform better under high contention.  (Specifically, readers
	  waiting for a writer to release the lock will be queued rather
	  than all spinning on the same cache line.)

	  The kernel will operate correctly either way; this only
	  affects performance.

	  For common desktop and server systems systems with only one
	  or two CPU sockets, the performance benefits are not worth
	  the additional memory; say N here.

My goal is to give someone stumbling across this question for the first
time in "make oldconfig" the information htey need to answer it.


That said, I think Ingo's idea for simplfying the waiting reader side
is excellent and should be tried before bifurcating the implementation.

Looking at the lock system, it *seems* like that patch to __read_lock_failed
is literally the *only* thing that needs changing, since the write
lock/unlock is all done with relative add/sub anyway.  But I keep thinking
"there must have been a reason why it wasn't done that way in the first
place".

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2013-07-24  0:03 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-13  1:34 [PATCH RFC 0/2] qrwlock: Introducing a queue read/write lock implementation Waiman Long
2013-07-13  1:34 ` Waiman Long
2013-07-13  1:34 ` [PATCH RFC 1/2] qrwlock: A " Waiman Long
2013-07-13  1:34   ` Waiman Long
2013-07-15 14:39   ` Steven Rostedt
2013-07-15 14:39     ` Steven Rostedt
2013-07-15 20:44     ` Waiman Long
2013-07-15 20:44       ` Waiman Long
2013-07-15 22:31   ` Thomas Gleixner
2013-07-15 22:31     ` Thomas Gleixner
2013-07-16  1:19     ` Waiman Long
2013-07-16  1:19       ` Waiman Long
2013-07-18  7:42       ` Ingo Molnar
2013-07-18  7:42         ` Ingo Molnar
2013-07-18  7:42         ` Ingo Molnar
2013-07-18 13:40         ` Waiman Long
2013-07-18 13:40           ` Waiman Long
2013-07-18 13:40           ` Waiman Long
2013-07-19  8:40           ` Ingo Molnar
2013-07-19  8:40             ` Ingo Molnar
2013-07-19  8:40             ` Ingo Molnar
2013-07-19 15:30             ` Waiman Long
2013-07-19 15:30               ` Waiman Long
2013-07-19 15:30               ` Waiman Long
2013-07-22 10:34               ` Ingo Molnar
2013-07-22 10:34                 ` Ingo Molnar
2013-07-22 10:34                 ` Ingo Molnar
2013-07-24  0:03                 ` Waiman Long
2013-07-24  0:03                   ` Waiman Long
2013-07-24  0:03                   ` Waiman Long
2013-07-18 10:22       ` Thomas Gleixner
2013-07-18 10:22         ` Thomas Gleixner
2013-07-18 14:19         ` Waiman Long
2013-07-18 14:19           ` Waiman Long
2013-07-21  5:42           ` Raghavendra K T
2013-07-21  5:42             ` Raghavendra K T
2013-07-21  5:42             ` Raghavendra K T
2013-07-23 23:54             ` Waiman Long
2013-07-23 23:54               ` Waiman Long
2013-07-23 23:54               ` Waiman Long
2013-07-13  1:34 ` [PATCH RFC 2/2] x86 qrwlock: Enable x86 to use queue read/write lock Waiman Long
2013-07-13  1:34   ` Waiman Long
2013-07-18 12:55 [PATCH RFC 1/2] qrwlock: A queue read/write lock implementation George Spelvin
2013-07-18 13:43 ` Waiman Long
2013-07-18 18:46   ` George Spelvin
2013-07-19 15:43     ` Waiman Long
2013-07-19 21:11       ` George Spelvin
2013-07-19 21:35         ` Waiman Long
2013-07-18 13:18 George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.