[PATCH] x86_64 : support atomic ops with 64 bits integer values

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86_64 : support atomic ops with 64 bits integer values
@ 2008-08-16  7:39 Mathieu Desnoyers
  2008-08-16 15:04 ` H. Peter Anvin
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-16  7:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar, Linus Torvalds,
	Joe Perches, linux-kernel

x86_64 add/sub atomic ops does not seems to accept integer values bigger
than 32 bits as immediates. Intel's add/sub documentation specifies they
have to be passed as registers.

http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Machine-Constraints.html#Machine-Constraints

states :

e
    32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions).
Z
    32-bit unsigned integer constant, or a symbolic reference known to fit that range (for immediate operands in zero-extending x86-64 instructions). 

Since add/sub does sign extension, using the "e" constraint seems appropriate.

It applies to 2.6.27-rc, 2.6.26, 2.6.25...

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Joe Perches <joe@perches.com>
---
 include/asm-x86/atomic_64.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/atomic_64.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/atomic_64.h	2008-08-16 03:19:35.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/atomic_64.h	2008-08-16 03:21:18.000000000 -0400
@@ -228,7 +228,7 @@ static inline void atomic64_add(long i, 
 {
 	asm volatile(LOCK_PREFIX "addq %1,%0"
 		     : "=m" (v->counter)
-		     : "ir" (i), "m" (v->counter));
+		     : "er" (i), "m" (v->counter));
 }
 
 /**
@@ -242,7 +242,7 @@ static inline void atomic64_sub(long i, 
 {
 	asm volatile(LOCK_PREFIX "subq %1,%0"
 		     : "=m" (v->counter)
-		     : "ir" (i), "m" (v->counter));
+		     : "er" (i), "m" (v->counter));
 }
 
 /**
@@ -260,7 +260,7 @@ static inline int atomic64_sub_and_test(
 
 	asm volatile(LOCK_PREFIX "subq %2,%0; sete %1"
 		     : "=m" (v->counter), "=qm" (c)
-		     : "ir" (i), "m" (v->counter) : "memory");
+	  	     : "er" (i), "m" (v->counter) : "memory");
 	return c;
 }
 
@@ -341,7 +341,7 @@ static inline int atomic64_add_negative(
 
 	asm volatile(LOCK_PREFIX "addq %2,%0; sets %1"
 		     : "=m" (v->counter), "=qm" (c)
-		     : "ir" (i), "m" (v->counter) : "memory");
+		     : "er" (i), "m" (v->counter) : "memory");
 	return c;
 }
 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86_64 : support atomic ops with 64 bits integer values
  2008-08-16  7:39 [PATCH] x86_64 : support atomic ops with 64 bits integer values Mathieu Desnoyers
@ 2008-08-16 15:04 ` H. Peter Anvin
  2008-08-16 15:43   ` Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2008-08-16 15:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar, Linus Torvalds,
	Joe Perches, linux-kernel

Mathieu Desnoyers wrote:
> x86_64 add/sub atomic ops does not seems to accept integer values bigger
> than 32 bits as immediates. Intel's add/sub documentation specifies they
> have to be passed as registers.

This is correct; this is in fact true for all instructions except "mov".

Whether it's sign- or zero-extending is sometimes subtle, but not in 
these cases.

Do you happen to know if this is a manifest bug in the current kernel 
(i.e. if there is anywhere we're using more than ±2 GB as a constant to 
these functions?)

Either way, I'll queue this up to tip:x86/urgent if Ingo hasn't already 
since this is a pure bug fix.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86_64 : support atomic ops with 64 bits integer values
  2008-08-16 15:04 ` H. Peter Anvin
@ 2008-08-16 15:43   ` Mathieu Desnoyers
  2008-08-16 17:30     ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-16 15:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar, Linus Torvalds,
	Joe Perches, linux-kernel

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> x86_64 add/sub atomic ops does not seems to accept integer values bigger
>> than 32 bits as immediates. Intel's add/sub documentation specifies they
>> have to be passed as registers.
>
> This is correct; this is in fact true for all instructions except "mov".
>
> Whether it's sign- or zero-extending is sometimes subtle, but not in these 
> cases.
>
> Do you happen to know if this is a manifest bug in the current kernel (i.e. 
> if there is anywhere we're using more than ±2 GB as a constant to these 
> functions?)
>

No, I did not hit this on current kernel code and the effect is quite
esasy to detect : the assembler spits an error.

I have hit this problem when tying to implement a better rwlock design
than is currently in the mainline kernel (I know the RT kernel has a
hard time with rwlocks), and had to play with add/sub of large values.
The idea is to bring down the interrupt latency caused by rwlocks shared
between fast read-side interrupt handlers and slow thread context
read-sides (tasklist_lock is the perfect example). In that case, the
worse case interrupt latency is caused by the irq-disabled writer lock
when contended by the slow readers. I will probably post a RFC about
this in a near future.

Mathieu

> Either way, I'll queue this up to tip:x86/urgent if Ingo hasn't already 
> since this is a pure bug fix.
>
> 	-hpa

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86_64 : support atomic ops with 64 bits integer values
  2008-08-16 15:43   ` Mathieu Desnoyers
@ 2008-08-16 17:30     ` Linus Torvalds
  2008-08-16 21:19       ` [RFC PATCH] Fair rwlock Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-16 17:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, linux-kernel

On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> 
> I have hit this problem when tying to implement a better rwlock design
> than is currently in the mainline kernel (I know the RT kernel has a
> hard time with rwlocks)

Have you looked at my sleping rwlock trial thing?

It's very different from a spinning one, but I think the fast path should 
be identical, and that's the one I tried to make fairly optimal.

See 

	http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary

for a git tree. The sleeping version has two extra words for the sleep 
events, but those would be irrelevant for the spinning version.

The fastpath is

	movl $4,%eax
	lock ; xaddl %eax,(%rdi)
	testl $3,%eax
	jne __my_rwlock_rdlock

for the read-lock (the two low bits are contention bits, so you can make 
contention have any behaviour you want - including fairish, prefer-reads, 
or prefer-writes).

The write fastpath is

	xorl %eax,%eax
	movl $1,%edx
	lock ; cmpxchgl %edx,(%rdi)
	jne __my_rwlock_wrlock

and the "unlock" case is actually unnecessarily complex in my 
implementation, because it needs to

 - wake things up in case of a conflict (not true of a spinning version, 
   of course)
 - it's pthreads-compatible, so the same function needs to handle both a 
   read-unlock and a write-unlock.

but a spinning version should be much simpler.

Anyway, I haven't tried turning it into a spinning version, but it was 
very much designed to

 - work with both 32-bit and 64-bit x86 by making the fastpath only do 
   32-bit locked accesses
 - have any number of pending readers/writers (which is not a big deal for 
   a spinning one, but at least there are no CPU count overflows).
 - and because it is designed for sleeping, I'm pretty sure that you can 
   easily drop interrupts in the contention path, to make 
   write_lock_irq[save]() be reasonable.

In particular, the third bullet is the important one: because it's 
designed to have a "contention" path that has _extra_ information for the 
contended case, you could literally make the extra information have things 
like a list of pending writers, so that you can drop interrupts on one 
CPU, while you adding information to let the reader side know that if the 
read-lock happens on that CPU, it needs to be able to continue in order to 
not deadlock.

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH] Fair rwlock
  2008-08-16 17:30     ` Linus Torvalds
@ 2008-08-16 21:19       ` Mathieu Desnoyers
  2008-08-16 21:33         ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-16 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, linux-kernel

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > I have hit this problem when tying to implement a better rwlock design
> > than is currently in the mainline kernel (I know the RT kernel has a
> > hard time with rwlocks)
> 
> Have you looked at my sleping rwlock trial thing?
> 
[...]
>  - and because it is designed for sleeping, I'm pretty sure that you can 
>    easily drop interrupts in the contention path, to make 
>    write_lock_irq[save]() be reasonable.
> 
> In particular, the third bullet is the important one: because it's 
> designed to have a "contention" path that has _extra_ information for the 
> contended case, you could literally make the extra information have things 
> like a list of pending writers, so that you can drop interrupts on one 
> CPU, while you adding information to let the reader side know that if the 
> read-lock happens on that CPU, it needs to be able to continue in order to 
> not deadlock.
> 
> 		Linus

No, I just had a look at it, thanks for the pointer!

Tweakable contention behavior seems interesting, but I don't think it
deals with the fact that on a mainline kernel, when an interrupt handler
comes in and asks for a read lock, it has to get it on the spot. (RT
kernels can get away with that using threaded threaded interrupts, but
that's a completely different scheme). Therefore, the impact is that
interrupts must be disabled around write lock usage, and we end up in
the situation where this interrupt disable section can last for a long
time, given it waits for every readers (including ones which does not
disable interrupts nor softirqs) to complete.

Actually, I just used LTTng traces and eventually made a small patch to
lockdep to detect whenever a spinlock or a rwlock is used both with
interrupts enabled and disabled. Those sites are likely to produce very
high latencies and should IMHO be considered as bogus. The basic bogus
scenario is to have a spinlock held on CPU A with interrupts enabled
being interrupted and then a softirq runs. On CPU B, the same lock is
acquired with interrupts off. We therefore disable interrupts on CPU B
for the duration of the softirq currently running on the CPU A, which is
really not something that helps keeping short latencies. My preliminary
results shows that there are a lot of inconsistent spinlock/rwlock irq
on/off uses in the kernel.

This kind of scenario is pretty easy to fix for spinlocks (either move
the interrupt disable within the spinlock section if the spinlock is
never used by an interrupt handler or make sure that every users has
interrupts disabled).

The problem comes with rwlocks : it is correct to have readers both with
and without irq disable, even when interrupt handlers use the read lock.
However, the write lock has to disable interrupt in that case, and we
suffer from the high latency I pointed out. The tasklist_lock is the
perfect example of this. In the following patch, I try to address this
issue.

The core idea is this :

This "fair" rwlock locks out the threads first, then disables softirqs,
then locks out the softirqs, then disables irqs and locks out softirqs. Only
then is the writer allowed to modify the data structure.

A spinlock is used to insure writer mutual exclusion.

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/fair-rwlock.h |   64 +++++++++++
 lib/Makefile                |    2 
 lib/fair-rwlock.c           |  244 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 310 insertions(+)

Index: linux-2.6-lttng/include/linux/fair-rwlock.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/fair-rwlock.h	2008-08-16 03:29:12.000000000 -0400
@@ -0,0 +1,64 @@
+/*
+ * Fair rwlock
+ *
+ * Allows more fairness than the standard rwlock for this kind of common
+ * scenario (tasklist_lock) :
+ *
+ * - rwlock shared between
+ *   - Rare update in thread context
+ *   - Frequent slow read in thread context (task list iteration)
+ *   - Fast interrupt handler read
+ *
+ * The slow write must therefore disable interrupts around the write lock,
+ * but will therefore add up to the global interrupt latency; worse case being
+ * the duration of the slow read.
+ *
+ * This "fair" rwlock locks out the threads first, then disables softirqs,
+ * then locks out the softirqs, then disables irqs and locks out softirqs. Only
+ * then is the writer allowed to modify the data structure.
+ *
+ * A spinlock is used to insure writer mutual exclusion.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * August 2008
+ */
+
+#include <linux/spinlock.h>
+#include <asm/atomic.h>
+
+struct fair_rwlock {
+	atomic_long_t value;
+	spinlock_t wlock;
+};
+
+/* Reader lock */
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock);
+void fair_read_unlock(struct fair_rwlock *rwlock);
+
+/* Writer Lock */
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock);
+void fair_write_unlock_irq(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock);
+void fair_write_unlock_bh(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock);
+void fair_write_unlock(struct fair_rwlock *rwlock);
Index: linux-2.6-lttng/lib/Makefile
===================================================================
--- linux-2.6-lttng.orig/lib/Makefile	2008-08-16 03:22:14.000000000 -0400
+++ linux-2.6-lttng/lib/Makefile	2008-08-16 03:29:12.000000000 -0400
@@ -43,6 +43,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
+obj-y += fair-rwlock.o
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
Index: linux-2.6-lttng/lib/fair-rwlock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/lib/fair-rwlock.c	2008-08-16 16:48:26.000000000 -0400
@@ -0,0 +1,244 @@
+/*
+ * Fair rwlock
+ *
+ * Allows more fairness than the standard rwlock for this kind of common
+ * scenario (tasklist_lock) :
+ *
+ * - rwlock shared between
+ *   - Rare update in thread context
+ *   - Frequent slow read in thread context (task list iteration)
+ *   - Fast interrupt handler read
+ *
+ * The slow write must therefore disable interrupts around the write lock,
+ * but will therefore add up to the global interrupt latency; worse case being
+ * the duration of the slow read.
+ *
+ * This "fair" rwlock locks out the threads first, then disables softirqs,
+ * then locks out the softirqs, then disables irqs and locks out softirqs. Only
+ * then is the writer allowed to modify the data structure.
+ *
+ * A spinlock is used to insure writer mutual exclusion.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * August 2008
+ *
+ * Distributed under GPLv2
+ *
+ * concurrency between :
+ * - thread
+ * - softirq handlers
+ * - hardirq handlers
+ *
+ * System-wide max
+ *
+ * - bits 0 .. log_2(NR_CPUS)-1) are the thread count
+ *   (max number of threads : NR_CPUS)
+ * - bit 3*log_2(NR_CPUS) is the writer lock against threads
+ *
+ * - bits log_2(NR_CPUS) .. 2*log_2(NR_CPUS)-1 are the softirq count
+ *   (max # of softirqs: NR_CPUS)
+ * - bit 3*log_2(NR_CPUS)+1 is the writer lock against softirqs
+ *
+ * - bits 2*log_2(NR_CPUS) .. 3*log_2(NR_CPUS)-1 are the hardirq count
+ *   (max # of hardirqs: NR_CPUS)
+ * - bit 3*log_2(NR_CPUS)+2 is the writer lock against hardirqs
+ *
+ * e.g. : NR_CPUS = 256
+ *
+ * THREAD_RMASK:  0x000000ff
+ * SOFTIRQ_RMASK: 0x0000ff00
+ * HARDIRQ_RMASK: 0x00ff0000
+ * THREAD_WMASK:  0x01000000
+ * SOFTIRQ_WMASK: 0x02000000
+ * HARDIRQ_WMASK: 0x04000000
+ */
+
+#include <linux/fair-rwlock.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+
+#if (NR_CPUS > 512 && (BITS_PER_LONG == 32 || NR_CPUS > 1048576))
+#error "fair rwlock needs more bits per long to deal with that many CPUs"
+#endif
+
+#define THREAD_ROFFSET	1UL
+#define THREAD_RMASK	((NR_CPUS - 1) * THREAD_ROFFSET)
+#define SOFTIRQ_ROFFSET	(THREAD_RMASK + 1)
+#define SOFTIRQ_RMASK	((NR_CPUS - 1) * SOFTIRQ_ROFFSET)
+#define HARDIRQ_ROFFSET	((SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define HARDIRQ_RMASK	((NR_CPUS - 1) * HARDIRQ_ROFFSET)
+
+#define THREAD_WMASK	((HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define SOFTIRQ_WMASK	(THREAD_WMASK << 1)
+#define HARDIRQ_WMASK	(SOFTIRQ_WMASK << 1)
+
+#ifdef FAIR_RWLOCK_DEBUG
+#define printk_dbg printk
+#else
+#define printk_dbg(fmt, args...)
+#endif
+
+/* Reader lock */
+
+static void _fair_read_lock_ctx(struct fair_rwlock *rwlock,
+		long roffset, long wmask)
+{
+	long value;
+
+	atomic_long_add(roffset, &rwlock->value);
+	do {
+		value = atomic_long_read(&rwlock->value);
+		/* Order rwlock->value read wrt following reads */
+		smp_rmb();
+	} while (value & wmask);
+	printk_dbg("lib reader got in with value %lX, wmask %lX\n",
+		value, wmask);
+}
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock)
+{
+	if (in_irq())
+		_fair_read_lock_ctx(rwlock, HARDIRQ_ROFFSET, HARDIRQ_WMASK);
+	else if (in_softirq())
+		_fair_read_lock_ctx(rwlock, SOFTIRQ_ROFFSET, SOFTIRQ_WMASK);
+	else {
+		preempt_disable();
+		_fair_read_lock_ctx(rwlock, THREAD_ROFFSET, THREAD_WMASK);
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_lock);
+
+void fair_read_unlock(struct fair_rwlock *rwlock)
+{
+	/* atomic_long_sub orders reads */
+	if (in_irq())
+		atomic_long_sub(HARDIRQ_ROFFSET, &rwlock->value);
+	else if (in_softirq())
+		atomic_long_sub(SOFTIRQ_ROFFSET, &rwlock->value);
+	else {
+		atomic_long_sub(THREAD_ROFFSET, &rwlock->value);
+		preempt_enable();
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_unlock);
+
+/* Writer lock */
+
+/* Lock out a specific execution context from the lock */
+static void _fair_write_lock_ctx(struct fair_rwlock *rwlock,
+		long rmask, long wmask)
+{
+	long value;
+
+	for (;;) {
+		value = atomic_long_read(&rwlock->value);
+		if (value & rmask) {
+			/* Order value reads */
+			smp_rmb();
+			continue;
+		}
+		if (atomic_long_cmpxchg(&rwlock->value, value, value | wmask)
+				== value)
+			break;
+	}
+	printk_dbg("lib writer got in with value %lX, new %lX, rmask %lX\n",
+		value, value | wmask, rmask);
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock)
+{
+	spin_lock(&rwlock->wlock);
+
+	/* lock out threads */
+	_fair_write_lock_ctx(rwlock, THREAD_RMASK, THREAD_WMASK);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WMASK);
+
+	/* lock out hardirqs */
+	local_irq_disable();
+	_fair_write_lock_ctx(rwlock, HARDIRQ_RMASK, HARDIRQ_WMASK);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_irq);
+
+void fair_write_unlock_irq(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(THREAD_WMASK | SOFTIRQ_WMASK | HARDIRQ_WMASK,
+			&rwlock->value);
+	local_irq_enable();
+	local_bh_enable();
+	spin_unlock(&rwlock->wlock);
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_irq);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock)
+{
+	spin_lock(&rwlock->wlock);
+
+	/* lock out threads */
+	_fair_write_lock_ctx(rwlock, THREAD_RMASK, THREAD_WMASK);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WMASK);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_bh);
+
+void fair_write_unlock_bh(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(THREAD_WMASK | SOFTIRQ_WMASK, &rwlock->value);
+	local_bh_enable();
+	spin_unlock(&rwlock->wlock);
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_bh);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock)
+{
+	spin_lock(&rwlock->wlock);
+
+	/* lock out threads */
+	_fair_write_lock_ctx(rwlock, THREAD_RMASK, THREAD_WMASK);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock);
+
+void fair_write_unlock(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(THREAD_WMASK, &rwlock->value);
+	spin_unlock(&rwlock->wlock);
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock);


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair rwlock
  2008-08-16 21:19       ` [RFC PATCH] Fair rwlock Mathieu Desnoyers
@ 2008-08-16 21:33         ` Linus Torvalds
  2008-08-17  7:53           ` [RFC PATCH] Fair low-latency rwlock v3 Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-16 21:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, linux-kernel

On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> 
> Tweakable contention behavior seems interesting, but I don't think it
> deals with the fact that on a mainline kernel, when an interrupt handler
> comes in and asks for a read lock, it has to get it on the spot.

Right. Which is exactly why I'd suggest using the extra space for saying 
"this CPU is busy-looping waiting for a write lock", and then the 
read-lock contention case can say

 - ok, there's a pending write lock holder on _my_ CPU, so I need to just 
   succeed right away despite the fact that there is contention.

In other words, there are a few cases:

 - you actually *got* the write lock

   interrupts obviously have to be disabled here, because any reader can 
   simply not get the lock.

 - you are waiting to get the write lock.

   Mark this CPU as "pending" in the rwlock (by making the extended thing 
   be a queue, or by simply only ever allowing a single pendign CPU), and 
   re-enable interrupts while you wait for it. An interrupt that comes in 
   and wants a read-lock sees that it's pending, so it should then ignore 
   the contention bit, and only wait for the write bit to go away.

See? This way you only need to actually disable interrupts while holding 
literally the lock, not while waiting for it. While still giving priority 
to writers (except for the _one_ CPU, where new readers will have to get 
through).

So this way you can be fair, and not allow readers to starve a writer. The 
only reader that is allowed past a waiting writer is the reader on that 
same CPU.

And notice how the fast-path needs no spinlock or anything else - it's 
still just a single locked instruction. In comparison, if I read your 
example code right, it is absolutely horrid and has an extra spinlock 
access for the fair_write_lock case.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH] Fair low-latency rwlock v3
  2008-08-16 21:33         ` Linus Torvalds
@ 2008-08-17  7:53           ` Mathieu Desnoyers
  2008-08-17 16:17             ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-17  7:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, linux-kernel

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
[...]
> So this way you can be fair, and not allow readers to starve a writer. The 
> only reader that is allowed past a waiting writer is the reader on that 
> same CPU.
> 
> And notice how the fast-path needs no spinlock or anything else - it's 
> still just a single locked instruction. In comparison, if I read your 
> example code right, it is absolutely horrid and has an extra spinlock 
> access for the fair_write_lock case.
> 
> 			Linus

Hi Linus,

Using a writer subscription to the rwlock to make sure the readers stop taking
the lock when a writer is waiting for it is indeed a good way to insure
fairness. I just fixed the issues you pointed in my patch (spinlock removed,
added a write fastpath with a single atomic op) and also used the subscription
idea to get fairness for writers. Contention delay tests shows that fairness is
achieved pretty well. For periodical writers, busy looping readers and
periodical interrupt readers :

6 thread readers (busy-looping)
3 thread writers (1ms period)
2 periodical interrupt readers on 7/8 cpus (IPIs).

-  21us max. contention for writer
- 154us max. contention for thread readers
-  16us max. contention for interrupt readers

(benchmark details below)

I still perceive two potential problems with your approach. It's not related to
fairness between readers and writers, but more on the effect on interrupt
latency on the system. This is actually what my patch try to address.

First there is a long interrupt handler scenario, as bad as disabling interrupts
for a long time period, which is not fixed by your solution :

CPU A
thread context, takes the read lock, gets interrupted, softirq runs.

CPU B
thread context, subscribes for the writer lock, busy loops waiting for CPU A.
(in your solution, interrupts are enabled here so interrupt handlers on CPU B
can come in)

CPU C
interrupt context, takes the read lock, contended because CPU B busy loops
waiting for the writer lock.

-> CPU C will have an interrupt handler running for the duration of the softirq
on CPU A, which will impact interrupt latency on CPU C.

Second, I tend to think this it might be difficult to atomically set the writer
state and disable interrupts, unless we disable interrupts, cmpxchg the writer
test, reenable interrupts if it fails in a loop, which does not seem very neat.


Fair low-latency rwlock v3

Changelog since v2 :
Add writer fairness in addition to fairness wrt interrupts and softirqs.
Added contention delays performance tests for thread and interrupt contexts to
changelog.

Changelog since v1 :
- No more spinlock to protect against concurrent writes, it is done within
  the existing atomic variable.
- Write fastpath with a single atomic op.

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > I have hit this problem when tying to implement a better rwlock design
> > than is currently in the mainline kernel (I know the RT kernel has a
> > hard time with rwlocks)
> 
> Have you looked at my sleping rwlock trial thing?
> 
[...]
>  - and because it is designed for sleeping, I'm pretty sure that you can 
>    easily drop interrupts in the contention path, to make 
>    write_lock_irq[save]() be reasonable.
> 
> In particular, the third bullet is the important one: because it's 
> designed to have a "contention" path that has _extra_ information for the 
> contended case, you could literally make the extra information have things 
> like a list of pending writers, so that you can drop interrupts on one 
> CPU, while you adding information to let the reader side know that if the 
> read-lock happens on that CPU, it needs to be able to continue in order to 
> not deadlock.
> 
> 		Linus

No, I just had a look at it, thanks for the pointer!

Tweakable contention behavior seems interesting, but I don't think it
deals with the fact that on a mainline kernel, when an interrupt handler
comes in and asks for a read lock, it has to get it on the spot. (RT
kernels can get away with that using threaded threaded interrupts, but
that's a completely different scheme). Therefore, the impact is that
interrupts must be disabled around write lock usage, and we end up in
the situation where this interrupt disable section can last for a long
time, given it waits for every readers (including ones which does not
disable interrupts nor softirqs) to complete.

Actually, I just used LTTng traces and eventually made a small patch to
lockdep to detect whenever a spinlock or a rwlock is used both with
interrupts enabled and disabled. Those sites are likely to produce very
high latencies and should IMHO be considered as bogus. The basic bogus
scenario is to have a spinlock held on CPU A with interrupts enabled
being interrupted and then a softirq runs. On CPU B, the same lock is
acquired with interrupts off. We therefore disable interrupts on CPU B
for the duration of the softirq currently running on the CPU A, which is
really not something that helps keeping short latencies. My preliminary
results shows that there are a lot of inconsistent spinlock/rwlock irq
on/off uses in the kernel.

This kind of scenario is pretty easy to fix for spinlocks (either move
the interrupt disable within the spinlock section if the spinlock is
never used by an interrupt handler or make sure that every users has
interrupts disabled).

The problem comes with rwlocks : it is correct to have readers both with
and without irq disable, even when interrupt handlers use the read lock.
However, the write lock has to disable interrupt in that case, and we
suffer from the high latency I pointed out. The tasklist_lock is the
perfect example of this. In the following patch, I try to address this
issue.

The core idea is this :

This "fair" rwlock writer subscribes to the lock, which locks out the 
reader threads. Then, it takes waits until all reader threads exited their
critical section and takes the mutex (1 bit within the "long"). Then it
disables softirqs, locks out the softirqs, waits for all softirqs to exit their
critical section, disables irqs and locks out irqs. It then waits for irqs to
exit their critical section. Only then is the writer allowed to modify the data
structure.

The writer fast path checks for a non-contended lock (all bits set to 0) and
does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and irq
exclusion bits.

The reader does an atomic cmpxchg to check if there is a subscribed writer. If
not, it increments the reader count for its context (thread, softirq, irq).

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c

** Performance tests

Dual quad-core Xeon 2.0GHz E5405

* Lock contention delays, per context, 30s test

6 thread readers (no delay loop)
3 thread writers (no delay loop)
2 periodical interrupt readers on 7/8 cpus (IPIs).

writer_thread/0 iterations : 2757002, max contention 561144 cycles
writer_thread/1 iterations : 2550357, max contention 606036 cycles
writer_thread/2 iterations : 2561024, max contention 593082 cycles
reader_thread/0 iterations : 199444, max contention 5776327560 cycles
reader_thread/1 iterations : 129680, max contention 5768277564 cycles
reader_thread/2 iterations : 76323, max contention 5775994128 cycles
reader_thread/3 iterations : 110571, max contention 4336139232 cycles
reader_thread/4 iterations : 120402, max contention 5513734818 cycles
reader_thread/5 iterations : 227301, max contention 4510503438 cycles
interrupt_reader_thread/0 iterations : 225
interrupt readers on CPU 0, max contention : 46056 cycles
interrupt readers on CPU 1, max contention : 32694 cycles
interrupt readers on CPU 2, max contention : 57432 cycles
interrupt readers on CPU 3, max contention : 29520 cycles
interrupt readers on CPU 4, max contention : 25908 cycles
interrupt readers on CPU 5, max contention : 36246 cycles
interrupt readers on CPU 6, max contention : 17916 cycles
interrupt readers on CPU 7, max contention : 34866 cycles

* Lock contention delays, per context, 600s test

6 thread readers (no delay loop)
3 thread writers (1ms period)
2 periodical interrupt readers on 7/8 cpus (IPIs).

writer_thread/0 iterations : 75001, max contention 42168 cycles
writer_thread/1 iterations : 75008, max contention 43674 cycles
writer_thread/2 iterations : 74999, max contention 43656 cycles
reader_thread/0 iterations : 256202476, max contention 307458 cycles
reader_thread/1 iterations : 270441472, max contention 261648 cycles
reader_thread/2 iterations : 258978159, max contention 171468 cycles
reader_thread/3 iterations : 247473522, max contention 97344 cycles
reader_thread/4 iterations : 285541056, max contention 136842 cycles
reader_thread/5 iterations : 269070814, max contention 134052 cycles
interrupt_reader_thread/0 iterations : 5772
interrupt readers on CPU 0, max contention : 13602 cycles
interrupt readers on CPU 1, max contention : 20082 cycles
interrupt readers on CPU 2, max contention : 12846 cycles
interrupt readers on CPU 3, max contention : 17892 cycles
interrupt readers on CPU 4, max contention : 31572 cycles
interrupt readers on CPU 5, max contention : 27444 cycles
interrupt readers on CPU 6, max contention : 16140 cycles
interrupt readers on CPU 7, max contention : 16026 cycles


Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/fair-rwlock.h |   51 ++++++
 lib/Makefile                |    2 
 lib/fair-rwlock.c           |  368 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 421 insertions(+)

Index: linux-2.6-lttng/include/linux/fair-rwlock.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/fair-rwlock.h	2008-08-17 03:41:00.000000000 -0400
@@ -0,0 +1,51 @@
+#ifndef _LINUX_FAIR_RWLOCK_H
+#define _LINUX_FAIR_RWLOCK_H
+
+/*
+ * Fair low-latency rwlock
+ *
+ * Allows writer fairness wrt readers and also minimally impact the irq latency
+ * of the system.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * August 2008
+ */
+
+#include <asm/atomic.h>
+
+struct fair_rwlock {
+	atomic_long_t value;
+};
+
+/* Reader lock */
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock);
+void fair_read_unlock(struct fair_rwlock *rwlock);
+
+/* Writer Lock */
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock);
+void fair_write_unlock_irq(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock);
+void fair_write_unlock_bh(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock);
+void fair_write_unlock(struct fair_rwlock *rwlock);
+#endif /* _LINUX_FAIR_RWLOCK_H */
Index: linux-2.6-lttng/lib/Makefile
===================================================================
--- linux-2.6-lttng.orig/lib/Makefile	2008-08-16 03:22:14.000000000 -0400
+++ linux-2.6-lttng/lib/Makefile	2008-08-16 03:29:12.000000000 -0400
@@ -43,6 +43,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
+obj-y += fair-rwlock.o
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
Index: linux-2.6-lttng/lib/fair-rwlock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/lib/fair-rwlock.c	2008-08-17 03:44:44.000000000 -0400
@@ -0,0 +1,368 @@
+/*
+ * Fair low-latency rwlock
+ *
+ * Allows writer fairness wrt readers and also minimally impact the irq latency
+ * of the system.
+ *
+ * A typical case leading to long interrupt latencies :
+ *
+ * - rwlock shared between
+ *   - Rare update in thread context
+ *   - Frequent slow read in thread context (task list iteration)
+ *   - Fast interrupt handler read
+ *
+ * The slow write must therefore disable interrupts around the write lock,
+ * but will therefore add up to the global interrupt latency; worse case being
+ * the duration of the slow read.
+ *
+ * This "fair" rwlock writer subscribes to the lock, which locks out the reader
+ * threads. Then, it takes waits until all reader threads exited their critical
+ * section and takes the mutex (1 bit within the "long"). Then it disables
+ * softirqs, locks out the softirqs, waits for all softirqs to exit their
+ * critical section, disables irqs and locks out irqs. It then waits for irqs to
+ * exit their critical section. Only then is the writer allowed to modify the
+ * data structure.
+ *
+ * The writer fast path checks for a non-contended lock (all bits set to 0) and
+ * does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and
+ * irq exclusion bits.
+ *
+ * The reader does an atomic cmpxchg to check if there is a subscribed writer.
+ * If not, it increments the reader count for its context (thread, softirq,
+ * irq).
+ *
+ * rwlock bits :
+ *
+ * - bits 0 .. log_2(NR_CPUS)-1) are the thread readers count
+ *   (max number of threads : NR_CPUS)
+ * - bits log_2(NR_CPUS) .. 2*log_2(NR_CPUS)-1 are the softirq readers count
+ *   (max # of softirqs: NR_CPUS)
+ * - bits 2*log_2(NR_CPUS) .. 3*log_2(NR_CPUS)-1 are the hardirq readers count
+ *   (max # of hardirqs: NR_CPUS)
+ *
+ * - bits 3*log_2(NR_CPUS) .. 4*log_2(NR_CPUS)-1 are the writer subscribers
+ *   count. Locks against reader threads if non zero.
+ *   (max # of writer subscribers : NR_CPUS)
+ * - bit 4*log_2(NR_CPUS) is the write mutex
+ * - bit 4*log_2(NR_CPUS)+1 is the writer lock against softirqs
+ * - bit 4*log_2(NR_CPUS)+2 is the writer lock against hardirqs
+ *
+ * e.g. : NR_CPUS = 16
+ *
+ * THREAD_RMASK:      0x0000000f
+ * SOFTIRQ_RMASK:     0x000000f0
+ * HARDIRQ_RMASK:     0x00000f00
+ * SUBSCRIBERS_WMASK: 0x0000f000
+ * WRITER_MUTEX:      0x00010000
+ * SOFTIRQ_WMASK:     0x00020000
+ * HARDIRQ_WMASK:     0x00040000
+ *
+ * Bits usage :
+ *
+ * nr cpus for thread read
+ * nr cpus for softirq read
+ * nr cpus for hardirq read
+ *
+ * IRQ-safe write lock :
+ * nr cpus for write subscribers, disables new thread readers if non zero
+ * if (nr thread read == 0 && write mutex == 0)
+ * 1 bit for write mutex
+ * (softirq off)
+ * 1 bit for softirq exclusion
+ * if (nr softirq read == 0)
+ * (hardirq off)
+ * 1 bit for hardirq exclusion
+ * if (nr hardirq read == 0)
+ * -> locked
+ *
+ * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/fair-rwlock.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+
+#if (NR_CPUS > 64 && (BITS_PER_LONG == 32 || NR_CPUS > 32768))
+#error "fair rwlock needs more bits per long to deal with that many CPUs"
+#endif
+
+#define THREAD_ROFFSET	1UL
+#define THREAD_RMASK	((NR_CPUS - 1) * THREAD_ROFFSET)
+#define SOFTIRQ_ROFFSET	(THREAD_RMASK + 1)
+#define SOFTIRQ_RMASK	((NR_CPUS - 1) * SOFTIRQ_ROFFSET)
+#define HARDIRQ_ROFFSET	((SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define HARDIRQ_RMASK	((NR_CPUS - 1) * HARDIRQ_ROFFSET)
+
+#define SUBSCRIBERS_WOFFSET	\
+	((HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define SUBSCRIBERS_WMASK	\
+	((NR_CPUS - 1) * SUBSCRIBERS_WOFFSET)
+#define WRITER_MUTEX		\
+	((SUBSCRIBERS_WMASK | HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define SOFTIRQ_WMASK	(WRITER_MUTEX << 1)
+#define SOFTIRQ_WOFFSET	SOFTIRQ_WMASK
+#define HARDIRQ_WMASK	(SOFTIRQ_WMASK << 1)
+#define HARDIRQ_WOFFSET	HARDIRQ_WMASK
+
+#ifdef FAIR_RWLOCK_DEBUG
+#define printk_dbg printk
+#else
+#define printk_dbg(fmt, args...)
+#endif
+
+/* Reader lock */
+
+static void _fair_read_lock_ctx(struct fair_rwlock *rwlock,
+		long roffset, long wmask)
+{
+	long value;
+
+	for (;;) {
+		value = atomic_long_read(&rwlock->value);
+		if (value & wmask) {
+			/* Order value reads */
+			smp_rmb();
+			continue;
+		}
+		if (atomic_long_cmpxchg(&rwlock->value, value, value + roffset)
+				== value)
+			break;
+	}
+
+	printk_dbg("lib reader got in with value %lX, wmask %lX\n",
+		value, wmask);
+}
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock)
+{
+	if (in_irq())
+		_fair_read_lock_ctx(rwlock, HARDIRQ_ROFFSET, HARDIRQ_WMASK);
+	else if (in_softirq())
+		_fair_read_lock_ctx(rwlock, SOFTIRQ_ROFFSET, SOFTIRQ_WMASK);
+	else {
+		preempt_disable();
+		_fair_read_lock_ctx(rwlock, THREAD_ROFFSET, SUBSCRIBERS_WMASK);
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_lock);
+
+void fair_read_unlock(struct fair_rwlock *rwlock)
+{
+	/* atomic_long_sub orders reads */
+	if (in_irq())
+		atomic_long_sub(HARDIRQ_ROFFSET, &rwlock->value);
+	else if (in_softirq())
+		atomic_long_sub(SOFTIRQ_ROFFSET, &rwlock->value);
+	else {
+		atomic_long_sub(THREAD_ROFFSET, &rwlock->value);
+		preempt_enable();
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_unlock);
+
+/* Writer lock */
+
+/*
+ * Lock out a specific execution context from the read lock. Wait for both the
+ * rmask and the wmask to be empty before proceeding to take the lock.
+ */
+static void _fair_write_lock_ctx_wait(struct fair_rwlock *rwlock,
+		long rmask, long wmask)
+{
+	long value;
+
+	for (;;) {
+		value = atomic_long_read(&rwlock->value);
+		if (value & (rmask | wmask)) {
+			/* Order value reads */
+			smp_rmb();
+			continue;
+		}
+		if (atomic_long_cmpxchg(&rwlock->value, value, value | wmask)
+				== value)
+			break;
+	}
+	printk_dbg("lib writer got in with value %lX, new %lX, rmask %lX\n",
+		value, value | wmask, rmask);
+}
+
+/*
+ * Lock out a specific execution context from the read lock. First lock the read
+ * context out of the lock, then wait for every readers to exit their critical
+ * section.
+ */
+static void _fair_write_lock_ctx_force(struct fair_rwlock *rwlock,
+		long rmask, long woffset)
+{
+	long value;
+
+	atomic_long_add(woffset, &rwlock->value);
+	do {
+		value = atomic_long_read(&rwlock->value);
+		/* Order rwlock->value read wrt following reads */
+		smp_rmb();
+	} while (value & rmask);
+	printk_dbg("lib writer got in with value %lX, woffset %lX, rmask %lX\n",
+		value, woffset, rmask);
+}
+
+
+/*
+ * Uncontended fastpath.
+ */
+static int fair_write_lock_irq_fast(struct fair_rwlock *rwlock)
+{
+	long value;
+
+
+	value = atomic_long_read(&rwlock->value);
+	if (likely(!value)) {
+		/* no other reader nor writer present, try to take the lock */
+		local_bh_disable();
+		local_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
+				value + (SUBSCRIBERS_WOFFSET | SOFTIRQ_WOFFSET
+					| HARDIRQ_WOFFSET | WRITER_MUTEX))
+						== value))
+			return 1;
+		local_irq_enable();
+		local_bh_enable();
+	}
+	return 0;
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock)
+{
+	preempt_disable();
+
+	if (likely(fair_write_lock_irq_fast(rwlock)))
+		return;
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, THREAD_RMASK, WRITER_MUTEX);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* lock out hardirqs */
+	local_irq_disable();
+	_fair_write_lock_ctx_force(rwlock, HARDIRQ_RMASK, HARDIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_irq);
+
+void fair_write_unlock_irq(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(HARDIRQ_WOFFSET | SOFTIRQ_WOFFSET
+			| WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
+			&rwlock->value);
+	local_irq_enable();
+	local_bh_enable();
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_irq);
+
+/*
+ * Uncontended fastpath.
+ */
+static int fair_write_lock_bh_fast(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	value = atomic_long_read(&rwlock->value);
+	if (likely(!value)) {
+		/* no other reader nor writer present, try to take the lock */
+		local_bh_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
+					(value + SUBSCRIBERS_WOFFSET
+					+ SOFTIRQ_WOFFSET) | WRITER_MUTEX)
+							== value))
+			return 1;
+		local_bh_enable();
+	}
+	return 0;
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock)
+{
+	preempt_disable();
+
+	if (likely(fair_write_lock_bh_fast(rwlock)))
+		return;
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, THREAD_RMASK, WRITER_MUTEX);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_bh);
+
+void fair_write_unlock_bh(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(SOFTIRQ_WOFFSET | WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
+			&rwlock->value);
+	local_bh_enable();
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_bh);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock)
+{
+	preempt_disable();
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, THREAD_RMASK, WRITER_MUTEX);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock);
+
+void fair_write_unlock(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(WRITER_MUTEX | SUBSCRIBERS_WOFFSET, &rwlock->value);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock);


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v3
  2008-08-17  7:53           ` [RFC PATCH] Fair low-latency rwlock v3 Mathieu Desnoyers
@ 2008-08-17 16:17             ` Linus Torvalds
  2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-17 16:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, linux-kernel

On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
> +/*
> + * Uncontended fastpath.
> + */
> +static int fair_write_lock_irq_fast(struct fair_rwlock *rwlock)

So first off, you should call this "trylock", since it doesn't necessarily 
get the lock at all. It has nothing to do with fast.

Secondly:

> +	value = atomic_long_read(&rwlock->value);
> +	if (likely(!value)) {
> +		/* no other reader nor writer present, try to take the lock */
> +		local_bh_disable();
> +		local_irq_disable();
> +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,

This is actually potentially very slow. 

Why? If the lock is uncontended, but is not in the current CPU's caches, 
the read -> rmw operation generates multiple cache coherency protocol 
events. First it gets the line in shared mode (for the read), and then 
later it turns it into exclusive mode.

So if it's likely that the value is zero (or even if it's just the only 
case we really care about), then you really should do the

	atomic_long_cmpxchg(&rwlock->value, 0, newvalue);

thing as the _first_ access to the lock.

Yeah, yeah, that means that you need to do the local_bh_disable etc first 
too, and undo it if it fails, but the failure path should be the unusual 
one. 

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH] Fair low-latency rwlock v5
  2008-08-17 16:17             ` Linus Torvalds
@ 2008-08-17 19:10               ` Mathieu Desnoyers
  2008-08-17 21:30                 ` [RFC PATCH] Fair low-latency rwlock v5 (updated benchmarks) Mathieu Desnoyers
                                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-17 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, Paul E. McKenney, linux-kernel

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
> > +/*
> > + * Uncontended fastpath.
> > + */
> > +static int fair_write_lock_irq_fast(struct fair_rwlock *rwlock)
> 
> So first off, you should call this "trylock", since it doesn't necessarily 
> get the lock at all. It has nothing to do with fast.
> 
> Secondly:
> 
> > +	value = atomic_long_read(&rwlock->value);
> > +	if (likely(!value)) {
> > +		/* no other reader nor writer present, try to take the lock */
> > +		local_bh_disable();
> > +		local_irq_disable();
> > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> 
> This is actually potentially very slow. 
> 
> Why? If the lock is uncontended, but is not in the current CPU's caches, 
> the read -> rmw operation generates multiple cache coherency protocol 
> events. First it gets the line in shared mode (for the read), and then 
> later it turns it into exclusive mode.
> 
> So if it's likely that the value is zero (or even if it's just the only 
> case we really care about), then you really should do the
> 
> 	atomic_long_cmpxchg(&rwlock->value, 0, newvalue);
> 
> thing as the _first_ access to the lock.
> 
> Yeah, yeah, that means that you need to do the local_bh_disable etc first 
> too, and undo it if it fails, but the failure path should be the unusual 
> one. 
> 
> 			Linus

Ah, you are right. This new version implements a "single cmpxchg"
uncontended code path, changes the _fast semantic for "trylock
uncontended" and also adds the trylock primitives.

Thanks for the input,

Mathieu


Fair low-latency rwlock v5

Fair low-latency rwlock provides fairness for writers against reader threads,
but lets softirq and irq readers in even when there are subscribed writers
waiting for the lock. Writers could starve reader threads.

Changelog since v4 :
rename _fast -> trylock uncontended
Remove the read from the uncontended read and write cases : just do a cmpxchg
expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock".
Use the value returned by the first cmpxchg as following value instead of doing
an unnecessary read.

Here is why read + rmw is wrong (quoting Linus) :

"This is actually potentially very slow.

Why? If the lock is uncontended, but is not in the current CPU's caches,
the read -> rmw operation generates multiple cache coherency protocol
events. First it gets the line in shared mode (for the read), and then
later it turns it into exclusive mode."

Changelog since v3 :
Added writer subscription and fair trylock. The writer subscription keeps the
reader threads out of the lock.

Changelog since v2 :
Add writer fairness in addition to fairness wrt interrupts and softirqs.
Added contention delays performance tests for thread and interrupt contexts to
changelog.

Changelog since v1 :
- No more spinlock to protect against concurrent writes, it is done within
  the existing atomic variable.
- Write fastpath with a single atomic op.

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > I have hit this problem when tying to implement a better rwlock design
> > than is currently in the mainline kernel (I know the RT kernel has a
> > hard time with rwlocks)
> 
> Have you looked at my sleping rwlock trial thing?
> 
[...]
>  - and because it is designed for sleeping, I'm pretty sure that you can 
>    easily drop interrupts in the contention path, to make 
>    write_lock_irq[save]() be reasonable.
> 
> In particular, the third bullet is the important one: because it's 
> designed to have a "contention" path that has _extra_ information for the 
> contended case, you could literally make the extra information have things 
> like a list of pending writers, so that you can drop interrupts on one 
> CPU, while you adding information to let the reader side know that if the 
> read-lock happens on that CPU, it needs to be able to continue in order to 
> not deadlock.
> 
> 		Linus

No, I just had a look at it, thanks for the pointer!

Tweakable contention behavior seems interesting, but I don't think it
deals with the fact that on a mainline kernel, when an interrupt handler
comes in and asks for a read lock, it has to get it on the spot. (RT
kernels can get away with that using threaded threaded interrupts, but
that's a completely different scheme). Therefore, the impact is that
interrupts must be disabled around write lock usage, and we end up in
the situation where this interrupt disable section can last for a long
time, given it waits for every readers (including ones which does not
disable interrupts nor softirqs) to complete.

Actually, I just used LTTng traces and eventually made a small patch to
lockdep to detect whenever a spinlock or a rwlock is used both with
interrupts enabled and disabled. Those sites are likely to produce very
high latencies and should IMHO be considered as bogus. The basic bogus
scenario is to have a spinlock held on CPU A with interrupts enabled
being interrupted and then a softirq runs. On CPU B, the same lock is
acquired with interrupts off. We therefore disable interrupts on CPU B
for the duration of the softirq currently running on the CPU A, which is
really not something that helps keeping short latencies. My preliminary
results shows that there are a lot of inconsistent spinlock/rwlock irq
on/off uses in the kernel.

This kind of scenario is pretty easy to fix for spinlocks (either move
the interrupt disable within the spinlock section if the spinlock is
never used by an interrupt handler or make sure that every users has
interrupts disabled).

The problem comes with rwlocks : it is correct to have readers both with
and without irq disable, even when interrupt handlers use the read lock.
However, the write lock has to disable interrupt in that case, and we
suffer from the high latency I pointed out. The tasklist_lock is the
perfect example of this. In the following patch, I try to address this
issue.

The core idea is this :

This "fair" rwlock writer subscribes to the lock, which locks out the 
reader threads. Then, it takes waits until all reader threads exited their
critical section and takes the mutex (1 bit within the "long"). Then it
disables softirqs, locks out the softirqs, waits for all softirqs to exit their
critical section, disables irqs and locks out irqs. It then waits for irqs to
exit their critical section. Only then is the writer allowed to modify the data
structure.

The writer fast path checks for a non-contended lock (all bits set to 0) and
does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and irq
exclusion bits.

The reader does an atomic cmpxchg to check if there is a subscribed writer. If
not, it increments the reader count for its context (thread, softirq, irq).

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c

** Performance tests

Dual quad-core Xeon 2.0GHz E5405

* Lock contention delays, per context, 60s test

6 thread readers (no delay loop)
3 thread writers (10us period)
2 periodical interrupt readers on 7/8 cpus (IPIs).

writer_thread/0 iterations : 2980374, max contention 1007442 cycles
writer_thread/1 iterations : 2992675, max contention 917364 cycles
trylock_writer_thread/0 iterations : 13533386, successful iterations : 3357015
trylock_writer_thread/1 iterations : 13310980, successful iterations : 3278394
reader_thread/0 iterations : 13624664, max contention 1296450 cycles
reader_thread/1 iterations : 12183042, max contention 1524810 cycles
reader_thread/2 iterations : 13302985, max contention 1017834 cycles
reader_thread/3 iterations : 13398307, max contention 1016700 cycles
trylock_reader_thread/0 iterations : 60849620, successful iterations : 14708736
trylock_reader_thread/1 iterations : 63830891, successful iterations : 15827695
interrupt_reader_thread/0 iterations : 559
interrupt readers on CPU 0, max contention : 13914 cycles
interrupt readers on CPU 1, max contention : 16446 cycles
interrupt readers on CPU 2, max contention : 11604 cycles
interrupt readers on CPU 3, max contention : 14124 cycles
interrupt readers on CPU 4, max contention : 16494 cycles
interrupt readers on CPU 5, max contention : 12930 cycles
interrupt readers on CPU 6, max contention : 9432 cycles
interrupt readers on CPU 7, max contention : 14514 cycles
trylock_interrupt_reader_thread/0 iterations : 573
trylock interrupt readers on CPU 0, iterations 1249, successful iterations : 1135
trylock interrupt readers on CPU 1, iterations 1152, successful iterations : 1064
trylock interrupt readers on CPU 2, iterations 1239, successful iterations : 1135
trylock interrupt readers on CPU 3, iterations 1227, successful iterations : 1130
trylock interrupt readers on CPU 4, iterations 1029, successful iterations : 908
trylock interrupt readers on CPU 5, iterations 1161, successful iterations : 1063
trylock interrupt readers on CPU 6, iterations 612, successful iterations : 578
trylock interrupt readers on CPU 7, iterations 1007, successful iterations : 932


Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
---
 include/linux/fair-rwlock.h |   61 ++++
 lib/Makefile                |    2 
 lib/fair-rwlock.c           |  557 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 620 insertions(+)

Index: linux-2.6-lttng/include/linux/fair-rwlock.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/fair-rwlock.h	2008-08-17 11:48:42.000000000 -0400
@@ -0,0 +1,61 @@
+#ifndef _LINUX_FAIR_RWLOCK_H
+#define _LINUX_FAIR_RWLOCK_H
+
+/*
+ * Fair low-latency rwlock
+ *
+ * Allows writer fairness wrt readers and also minimally impact the irq latency
+ * of the system.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * August 2008
+ */
+
+#include <asm/atomic.h>
+
+struct fair_rwlock {
+	atomic_long_t value;
+};
+
+/* Reader lock */
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock);
+int fair_read_trylock(struct fair_rwlock *rwlock);
+void fair_read_unlock(struct fair_rwlock *rwlock);
+
+/* Writer Lock */
+
+/*
+ * Subscription for write lock. Must be used for write trylock only.
+ * Insures fairness wrt reader threads.
+ */
+void fair_write_subscribe(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock);
+int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock);
+void fair_write_unlock_irq(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock);
+int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock);
+void fair_write_unlock_bh(struct fair_rwlock *rwlock);
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock);
+int fair_write_trylock_subscribed(struct fair_rwlock *rwlock);
+void fair_write_unlock(struct fair_rwlock *rwlock);
+#endif /* _LINUX_FAIR_RWLOCK_H */
Index: linux-2.6-lttng/lib/Makefile
===================================================================
--- linux-2.6-lttng.orig/lib/Makefile	2008-08-16 03:22:14.000000000 -0400
+++ linux-2.6-lttng/lib/Makefile	2008-08-16 03:29:12.000000000 -0400
@@ -43,6 +43,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
+obj-y += fair-rwlock.o
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
Index: linux-2.6-lttng/lib/fair-rwlock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/lib/fair-rwlock.c	2008-08-17 14:20:38.000000000 -0400
@@ -0,0 +1,557 @@
+/*
+ * Fair low-latency rwlock
+ *
+ * Allows writer fairness wrt readers and also minimally impact the irq latency
+ * of the system.
+ *
+ * A typical case leading to long interrupt latencies :
+ *
+ * - rwlock shared between
+ *   - Rare update in thread context
+ *   - Frequent slow read in thread context (task list iteration)
+ *   - Fast interrupt handler read
+ *
+ * The slow write must therefore disable interrupts around the write lock,
+ * but will therefore add up to the global interrupt latency; worse case being
+ * the duration of the slow read.
+ *
+ * This "fair" rwlock writer subscribes to the lock, which locks out the reader
+ * threads. Then, it takes waits until all reader threads exited their critical
+ * section and takes the mutex (1 bit within the "long"). Then it disables
+ * softirqs, locks out the softirqs, waits for all softirqs to exit their
+ * critical section, disables irqs and locks out irqs. It then waits for irqs to
+ * exit their critical section. Only then is the writer allowed to modify the
+ * data structure.
+ *
+ * The writer fast path checks for a non-contended lock (all bits set to 0) and
+ * does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and
+ * irq exclusion bits.
+ *
+ * The reader does an atomic cmpxchg to check if there is a subscribed writer.
+ * If not, it increments the reader count for its context (thread, softirq,
+ * irq).
+ *
+ * rwlock bits :
+ *
+ * - bits 0 .. log_2(NR_CPUS)-1) are the thread readers count
+ *   (max number of threads : NR_CPUS)
+ * - bits log_2(NR_CPUS) .. 2*log_2(NR_CPUS)-1 are the softirq readers count
+ *   (max # of softirqs: NR_CPUS)
+ * - bits 2*log_2(NR_CPUS) .. 3*log_2(NR_CPUS)-1 are the hardirq readers count
+ *   (max # of hardirqs: NR_CPUS)
+ *
+ * - bits 3*log_2(NR_CPUS) .. 4*log_2(NR_CPUS)-1 are the writer subscribers
+ *   count. Locks against reader threads if non zero.
+ *   (max # of writer subscribers : NR_CPUS)
+ * - bit 4*log_2(NR_CPUS) is the write mutex
+ * - bit 4*log_2(NR_CPUS)+1 is the writer lock against softirqs
+ * - bit 4*log_2(NR_CPUS)+2 is the writer lock against hardirqs
+ *
+ * e.g. : NR_CPUS = 16
+ *
+ * THREAD_RMASK:      0x0000000f
+ * SOFTIRQ_RMASK:     0x000000f0
+ * HARDIRQ_RMASK:     0x00000f00
+ * SUBSCRIBERS_WMASK: 0x0000f000
+ * WRITER_MUTEX:      0x00010000
+ * SOFTIRQ_WMASK:     0x00020000
+ * HARDIRQ_WMASK:     0x00040000
+ *
+ * Bits usage :
+ *
+ * nr cpus for thread read
+ * nr cpus for softirq read
+ * nr cpus for hardirq read
+ *
+ * IRQ-safe write lock :
+ * nr cpus for write subscribers, disables new thread readers if non zero
+ * if (nr thread read == 0 && write mutex == 0)
+ * 1 bit for write mutex
+ * (softirq off)
+ * 1 bit for softirq exclusion
+ * if (nr softirq read == 0)
+ * (hardirq off)
+ * 1 bit for hardirq exclusion
+ * if (nr hardirq read == 0)
+ * -> locked
+ *
+ * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/fair-rwlock.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+
+#if (NR_CPUS > 64 && (BITS_PER_LONG == 32 || NR_CPUS > 32768))
+#error "fair rwlock needs more bits per long to deal with that many CPUs"
+#endif
+
+#define THREAD_ROFFSET	1UL
+#define THREAD_RMASK	((NR_CPUS - 1) * THREAD_ROFFSET)
+#define SOFTIRQ_ROFFSET	(THREAD_RMASK + 1)
+#define SOFTIRQ_RMASK	((NR_CPUS - 1) * SOFTIRQ_ROFFSET)
+#define HARDIRQ_ROFFSET	((SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define HARDIRQ_RMASK	((NR_CPUS - 1) * HARDIRQ_ROFFSET)
+
+#define SUBSCRIBERS_WOFFSET	\
+	((HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define SUBSCRIBERS_WMASK	\
+	((NR_CPUS - 1) * SUBSCRIBERS_WOFFSET)
+#define WRITER_MUTEX		\
+	((SUBSCRIBERS_WMASK | HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
+#define SOFTIRQ_WMASK	(WRITER_MUTEX << 1)
+#define SOFTIRQ_WOFFSET	SOFTIRQ_WMASK
+#define HARDIRQ_WMASK	(SOFTIRQ_WMASK << 1)
+#define HARDIRQ_WOFFSET	HARDIRQ_WMASK
+
+#ifdef FAIR_RWLOCK_DEBUG
+#define printk_dbg printk
+#else
+#define printk_dbg(fmt, args...)
+#endif
+
+
+/*
+ * Uncontended fastpath.
+ */
+static long fair_read_trylock_uncontended(struct fair_rwlock *rwlock,
+	long expect_readers, long roffset)
+{
+	/* no other reader nor writer present, try to take the lock */
+	return atomic_long_cmpxchg(&rwlock->value, expect_readers,
+					(expect_readers + roffset));
+}
+
+/*
+ * Reader lock
+ *
+ * First try to get the uncontended lock. If it is non-zero (can be common,
+ * since we allow multiple readers), pass the returned cmpxchg value to the loop
+ * to try to get the reader lock.
+ *
+ * trylock will fail if a writer is subscribed or holds the lock, but will
+ * spin if there is concurency to win the cmpxchg. It could happen if, for
+ * instance, other concurrent reads need to update the roffset or if a
+ * writer updated the lock bits which does not contend us. Since many
+ * concurrent readers is a common case, it makes sense not to fail is it
+ * happens.
+ *
+ * the non-trylock case will spin for both situations.
+ */
+
+static int _fair_read_lock_ctx(struct fair_rwlock *rwlock,
+		long roffset, long wmask, int trylock)
+{
+	long value, newvalue;
+
+	/*
+	 * Try uncontended lock.
+	 */
+	value = fair_read_trylock_uncontended(rwlock, 0, roffset);
+	if (likely(value == 0))
+		return 1;
+
+	for (;;) {
+		if (value & wmask) {
+			if (!trylock) {
+				/* Order value reads */
+				smp_rmb();
+				value = atomic_long_read(&rwlock->value);
+				continue;
+			} else
+				return 0;
+		}
+
+		newvalue = atomic_long_cmpxchg(&rwlock->value,
+				value, value + roffset);
+		if (likely(newvalue == value))
+			break;
+		else
+			value = newvalue;
+	}
+
+	printk_dbg("lib reader got in with value %lX, wmask %lX\n",
+		value, wmask);
+	return 1;
+}
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+void fair_read_lock(struct fair_rwlock *rwlock)
+{
+	if (in_irq())
+		_fair_read_lock_ctx(rwlock, HARDIRQ_ROFFSET, HARDIRQ_WMASK, 0);
+	else if (in_softirq())
+		_fair_read_lock_ctx(rwlock, SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 0);
+	else {
+		preempt_disable();
+		_fair_read_lock_ctx(rwlock, THREAD_ROFFSET, SUBSCRIBERS_WMASK,
+					0);
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_lock);
+
+int fair_read_trylock(struct fair_rwlock *rwlock)
+{
+	int ret;
+
+	if (in_irq())
+		return _fair_read_lock_ctx(rwlock,
+					HARDIRQ_ROFFSET, HARDIRQ_WMASK, 1);
+	else if (in_softirq())
+		return _fair_read_lock_ctx(rwlock,
+					SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 1);
+	else {
+		preempt_disable();
+		ret = _fair_read_lock_ctx(rwlock,
+					THREAD_ROFFSET, SUBSCRIBERS_WMASK, 1);
+		if (!ret)
+			preempt_enable();
+		return ret;
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_trylock);
+
+void fair_read_unlock(struct fair_rwlock *rwlock)
+{
+	/* atomic_long_sub orders reads */
+	if (in_irq())
+		atomic_long_sub(HARDIRQ_ROFFSET, &rwlock->value);
+	else if (in_softirq())
+		atomic_long_sub(SOFTIRQ_ROFFSET, &rwlock->value);
+	else {
+		atomic_long_sub(THREAD_ROFFSET, &rwlock->value);
+		preempt_enable();
+	}
+}
+EXPORT_SYMBOL_GPL(fair_read_unlock);
+
+/* Writer lock */
+
+/*
+ * Lock out a specific execution context from the read lock. Wait for both the
+ * rmask and the wmask to be empty before proceeding to take the lock.
+ */
+static void _fair_write_lock_ctx_wait(struct fair_rwlock *rwlock,
+		long value, long rmask, long wmask)
+{
+	long newvalue;
+
+	for (;;) {
+		if (value & (rmask | wmask)) {
+			/* Order value reads */
+			smp_rmb();
+			value = atomic_long_read(&rwlock->value);
+			continue;
+		}
+		newvalue = atomic_long_cmpxchg(&rwlock->value,
+				value, value | wmask);
+		if (likely(newvalue == value))
+			break;
+		else
+			value = newvalue;
+	}
+	printk_dbg("lib writer got in with value %lX, new %lX, rmask %lX\n",
+		value, value | wmask, rmask);
+}
+
+/*
+ * Lock out a specific execution context from the read lock. First lock the read
+ * context out of the lock, then wait for every readers to exit their critical
+ * section.
+ */
+static void _fair_write_lock_ctx_force(struct fair_rwlock *rwlock,
+		long rmask, long woffset)
+{
+	long value;
+
+	atomic_long_add(woffset, &rwlock->value);
+	do {
+		value = atomic_long_read(&rwlock->value);
+		/* Order rwlock->value read wrt following reads */
+		smp_rmb();
+	} while (value & rmask);
+	printk_dbg("lib writer got in with value %lX, woffset %lX, rmask %lX\n",
+		value, woffset, rmask);
+}
+
+void fair_write_subscribe(struct fair_rwlock *rwlock)
+{
+	preempt_disable();
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+}
+EXPORT_SYMBOL_GPL(fair_write_subscribe);
+
+/*
+ * Uncontended fastpath.
+ */
+static long
+fair_write_trylock_irq_uncontended(struct fair_rwlock *rwlock,
+	long expect_sub, long sub_woffset)
+{
+	long value;
+
+	/* no other reader nor writer present, try to take the lock */
+	local_bh_disable();
+	local_irq_disable();
+	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
+			expect_sub + (sub_woffset | SOFTIRQ_WOFFSET
+				| HARDIRQ_WOFFSET | WRITER_MUTEX));
+	if (unlikely(value != expect_sub)) {
+		local_irq_enable();
+		local_bh_enable();
+	}
+	return value;
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against irq/softirq/thread readers.
+ */
+void fair_write_lock_irq(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	preempt_disable();
+	value = fair_write_trylock_irq_uncontended(rwlock,
+			0, SUBSCRIBERS_WOFFSET);
+	if (likely(!value))
+		return;
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
+		THREAD_RMASK, WRITER_MUTEX);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* lock out hardirqs */
+	local_irq_disable();
+	_fair_write_lock_ctx_force(rwlock, HARDIRQ_RMASK, HARDIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_irq);
+
+/*
+ * Must already be subscribed. Will fail if there is contention caused by
+ * interrupt readers, softirq readers, other writers. Will also fail if there is
+ * concurrent cmpxchg update, even if it is only a writer subscription. It is
+ * not considered as a common case and therefore does not jusfity looping.
+ *
+ * Second cmpxchg deals with other write subscribers.
+ */
+int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	value = fair_write_trylock_irq_uncontended(rwlock,
+			SUBSCRIBERS_WOFFSET, 0);
+	if (likely(value == SUBSCRIBERS_WOFFSET))
+		return 1;
+
+	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
+		/* no other reader nor writer present, try to take the lock */
+		local_bh_disable();
+		local_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
+				value + (SOFTIRQ_WOFFSET
+					| HARDIRQ_WOFFSET | WRITER_MUTEX))
+						== value))
+			return 1;
+		local_irq_enable();
+		local_bh_enable();
+	}
+	return 0;
+
+}
+EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_irq);
+
+void fair_write_unlock_irq(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(HARDIRQ_WOFFSET | SOFTIRQ_WOFFSET
+			| WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
+			&rwlock->value);
+	local_irq_enable();
+	local_bh_enable();
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_irq);
+
+/*
+ * Uncontended fastpath.
+ */
+static long fair_write_trylock_bh_uncontended(struct fair_rwlock *rwlock,
+	long expect_sub, long sub_woffset)
+{
+	long value;
+
+	/* no other reader nor writer present, try to take the lock */
+	local_bh_disable();
+	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
+		expect_sub + (sub_woffset | SOFTIRQ_WOFFSET | WRITER_MUTEX));
+	if (unlikely(value != expect_sub))
+		local_bh_enable();
+	return value;
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against softirq/thread readers.
+ */
+void fair_write_lock_bh(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	preempt_disable();
+	value = fair_write_trylock_bh_uncontended(rwlock,
+			0, SUBSCRIBERS_WOFFSET);
+	if (likely(!value))
+		return;
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
+		THREAD_RMASK, WRITER_MUTEX);
+
+	/* lock out softirqs */
+	local_bh_disable();
+	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock_bh);
+
+/*
+ * Must already be subscribed. Will fail if there is contention caused by
+ * softirq readers, other writers. Will also fail if there is concurrent cmpxchg
+ * update, even if it is only a writer subscription. It is not considered as a
+ * common case and therefore does not jusfity looping.
+ */
+int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	value = fair_write_trylock_bh_uncontended(rwlock,
+			SUBSCRIBERS_WOFFSET, 0);
+	if (likely(value == SUBSCRIBERS_WOFFSET))
+		return 1;
+
+	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
+		/* no other reader nor writer present, try to take the lock */
+		local_bh_disable();
+		local_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
+				value + (SOFTIRQ_WOFFSET | WRITER_MUTEX))
+						== value))
+			return 1;
+		local_irq_enable();
+		local_bh_enable();
+	}
+	return 0;
+
+}
+EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_bh);
+
+void fair_write_unlock_bh(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(SOFTIRQ_WOFFSET | WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
+			&rwlock->value);
+	local_bh_enable();
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock_bh);
+
+/*
+ * Uncontended fastpath.
+ */
+static long fair_write_trylock_uncontended(struct fair_rwlock *rwlock,
+	long expect_sub, long sub_woffset)
+{
+	/* no other reader nor writer present, try to take the lock */
+	return atomic_long_cmpxchg(&rwlock->value, expect_sub,
+					(expect_sub + sub_woffset)
+					| WRITER_MUTEX);
+}
+
+/*
+ * Safe against other writers in thread context.
+ * Safe against thread readers.
+ */
+void fair_write_lock(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	preempt_disable();
+	value = fair_write_trylock_uncontended(rwlock, 0, SUBSCRIBERS_WOFFSET);
+	if (likely(!value))
+		return;
+
+	/* lock out threads */
+	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
+
+	/* lock out other writers when no reader threads left */
+	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
+		THREAD_RMASK, WRITER_MUTEX);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+EXPORT_SYMBOL_GPL(fair_write_lock);
+
+/*
+ * Must already be subscribed. Will fail if there is contention caused by
+ * other writers. Will also fail if there is concurrent cmpxchg update, even if
+ * it is only a writer subscription. It is not considered as a common case and
+ * therefore does not jusfity looping.
+ */
+int fair_write_trylock_subscribed(struct fair_rwlock *rwlock)
+{
+	long value;
+
+	value = fair_write_trylock_uncontended(rwlock, SUBSCRIBERS_WOFFSET, 0);
+	if (likely(value == SUBSCRIBERS_WOFFSET))
+		return 1;
+
+	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
+		/* no other reader nor writer present, try to take the lock */
+		local_bh_disable();
+		local_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
+				value | WRITER_MUTEX) == value))
+			return 1;
+		local_irq_enable();
+		local_bh_enable();
+	}
+	return 0;
+
+}
+EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed);
+
+void fair_write_unlock(struct fair_rwlock *rwlock)
+{
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	atomic_long_sub(WRITER_MUTEX | SUBSCRIBERS_WOFFSET, &rwlock->value);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fair_write_unlock);
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5 (updated benchmarks)
  2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
@ 2008-08-17 21:30                 ` Mathieu Desnoyers
  2008-08-18 18:59                 ` [RFC PATCH] Fair low-latency rwlock v5 Linus Torvalds
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-17 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, Paul E. McKenney, linux-kernel

Here are some updated benchmarks of the rwlock v5, showing [min,avg,max]
delays in non-contended and contended cases :

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c

** Performance tests

Dual quad-core Xeon 2.0GHz E5405

* Lock contention delays, per context, 60s test

** get_cycles calibration **
get_cycles takes [min,avg,max] 78,79,84 cycles, results calibrated on avg

** Single writer test, no contention **
writer_thread/0 iterations : 964054, lock delay [min,avg,max] 149,157,5933 cycles

** Single reader test, no contention **
reader_thread/0 iterations : 37591763, lock delay [min,avg,max] 59,65,21383 cycles

** Multiple readers test, no contention **
reader_thread/0 iterations : 9062919, lock delay [min,avg,max] 59,921,42593 cycles
reader_thread/1 iterations : 9014846, lock delay [min,avg,max] 59,928,42065 cycles
reader_thread/2 iterations : 9394114, lock delay [min,avg,max] 59,897,46877 cycles
reader_thread/3 iterations : 9459081, lock delay [min,avg,max] 59,896,38207 cycles

** High contention test **
4 thread readers (no delay loop)
2 thread trylock readers (no delay loop)
2 thread writers (10us period)
2 thread trylock writers (10us period)
1 periodical interrupt readers on 7/8 cpus (IPIs).
1 periodical interrupt trylock readers on 7/8 cpus (IPIs).

writer_thread/0 iterations : 2864146, lock delay [min,avg,max] 179,9493,102563 cycles
writer_thread/1 iterations : 2986492, lock delay [min,avg,max] 179,9332,66923 cycles
trylock_writer_thread/0 iterations : 12789892, successful iterations : 3257203
trylock_writer_thread/1 iterations : 12747023, successful iterations : 3268997
reader_thread/0 iterations : 13091562, lock delay [min,avg,max] 59,5806,141053 cycles
reader_thread/1 iterations : 12574027, lock delay [min,avg,max] 59,5706,141839 cycles
reader_thread/2 iterations : 12805706, lock delay [min,avg,max] 59,5738,138725 cycles
reader_thread/3 iterations : 13352731, lock delay [min,avg,max] 59,5606,137585 cycles
trylock_reader_thread/0 iterations : 63135422, successful iterations : 15204229
trylock_reader_thread/1 iterations : 62306024, successful iterations : 15123824
interrupt_reader_thread/0 iterations : 559
interrupt readers on CPU 0, lock delay [min,avg,max] 167,984,16271 cycles
interrupt readers on CPU 1, lock delay [min,avg,max] 179,1185,9125 cycles
interrupt readers on CPU 2, lock delay [min,avg,max] 155,940,8081 cycles
interrupt readers on CPU 3, lock delay [min,avg,max] 83,1158,13355 cycles
interrupt readers on CPU 4, lock delay [min,avg,max] 77,863,6371 cycles
interrupt readers on CPU 5, lock delay [min,avg,max] 173,1169,16331 cycles
interrupt readers on CPU 6, lock delay [min,avg,max] 155,834,8705 cycles
interrupt readers on CPU 7, lock delay [min,avg,max] 173,1159,10115 cycles
trylock_interrupt_reader_thread/0 iterations : 570
trylock interrupt readers on CPU 0, iterations 280, successful iterations : 239
trylock interrupt readers on CPU 1, iterations 590, successful iterations : 538
trylock interrupt readers on CPU 2, iterations 431, successful iterations : 372
trylock interrupt readers on CPU 3, iterations 605, successful iterations : 570
trylock interrupt readers on CPU 4, iterations 659, successful iterations : 570
trylock interrupt readers on CPU 5, iterations 616, successful iterations : 570
trylock interrupt readers on CPU 6, iterations 675, successful iterations : 570
trylock interrupt readers on CPU 7, iterations 597, successful iterations : 561

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
  2008-08-17 21:30                 ` [RFC PATCH] Fair low-latency rwlock v5 (updated benchmarks) Mathieu Desnoyers
@ 2008-08-18 18:59                 ` Linus Torvalds
  2008-08-18 23:25                 ` Paul E. McKenney
  2008-08-25 19:20                 ` [RFC PATCH] Fair low-latency rwlock v5 Peter Zijlstra
  3 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-18 18:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Andrew Morton, Ingo Molnar,
	Joe Perches, Paul E. McKenney, linux-kernel

On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
> 
> Ah, you are right. This new version implements a "single cmpxchg"
> uncontended code path, changes the _fast semantic for "trylock
> uncontended" and also adds the trylock primitives.

Can you also finally

 - separate out the fastpaths more clearly. The slowpaths should not be 
   even visible in the I$ footprint. The fastpaths are probably better off 
   as a separate "fastpath.S" file to make sure that the compiler doesn't 
   inline the slowpath or do other insane things.

   If the fastpath isn't just a few instructions, there's something wrong. 
   It should be written in assembly, so that it's very clear what the 
   fastpath is. Because most of the time, the fastpath is all that really 
   matters (at least as long as there is some reasonable slow-path and 
   contention behaviour - the slowpath matters int hat it shouldn't ever 
   be a _problem_).

 - don't call these "fair". They may be fairER than the regular rwlocks,
   but you'd really need to explain that, and true fairness you (a) can't
   really get without explicit queueing and (b) probably don't even want, 
   because the only starvation people tend to worry about is reader vs 
   writer, and writer-writer fairness tends to be unimportant. So it's not 
   that people want the (expensive) "fairness" it's really that they want 
   something _reasonably_ fair considering the normal worries.

   (ie if you have so much write activity that you get into write-write 
   fairness worries, you shouldn't be using a rwlock to begin with, so 
   that level of fairness is simply not very interesting).

   People react emotionally and too strongly to to words like "fairness" 
   or "freedom". Different people have different ideas on what they mean, 
   and take them to pointless extremes. So just don't use the words, they 
   just detract from the real issue.

Also, the performance numbers on their own are pretty irrelevant, since 
there's nothing to compare them to. It would be much more relevant if you 
had a "this is what the standard rwlock" does as a comparison for each 
number.

				Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
  2008-08-17 21:30                 ` [RFC PATCH] Fair low-latency rwlock v5 (updated benchmarks) Mathieu Desnoyers
  2008-08-18 18:59                 ` [RFC PATCH] Fair low-latency rwlock v5 Linus Torvalds
@ 2008-08-18 23:25                 ` Paul E. McKenney
  2008-08-19  6:04                   ` Mathieu Desnoyers
  2008-08-25 19:20                 ` [RFC PATCH] Fair low-latency rwlock v5 Peter Zijlstra
  3 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2008-08-18 23:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel

On Sun, Aug 17, 2008 at 03:10:35PM -0400, Mathieu Desnoyers wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > 
> > 
> > On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
> > > +/*
> > > + * Uncontended fastpath.
> > > + */
> > > +static int fair_write_lock_irq_fast(struct fair_rwlock *rwlock)
> > 
> > So first off, you should call this "trylock", since it doesn't necessarily 
> > get the lock at all. It has nothing to do with fast.
> > 
> > Secondly:
> > 
> > > +	value = atomic_long_read(&rwlock->value);
> > > +	if (likely(!value)) {
> > > +		/* no other reader nor writer present, try to take the lock */
> > > +		local_bh_disable();
> > > +		local_irq_disable();
> > > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> > 
> > This is actually potentially very slow. 
> > 
> > Why? If the lock is uncontended, but is not in the current CPU's caches, 
> > the read -> rmw operation generates multiple cache coherency protocol 
> > events. First it gets the line in shared mode (for the read), and then 
> > later it turns it into exclusive mode.
> > 
> > So if it's likely that the value is zero (or even if it's just the only 
> > case we really care about), then you really should do the
> > 
> > 	atomic_long_cmpxchg(&rwlock->value, 0, newvalue);
> > 
> > thing as the _first_ access to the lock.
> > 
> > Yeah, yeah, that means that you need to do the local_bh_disable etc first 
> > too, and undo it if it fails, but the failure path should be the unusual 
> > one. 
> > 
> > 			Linus
> 
> Ah, you are right. This new version implements a "single cmpxchg"
> uncontended code path, changes the _fast semantic for "trylock
> uncontended" and also adds the trylock primitives.

Interesting approach!  Questions and comments interspersed below.

						Thanx, Paul

> Thanks for the input,
> 
> Mathieu
> 
> 
> Fair low-latency rwlock v5
> 
> Fair low-latency rwlock provides fairness for writers against reader threads,
> but lets softirq and irq readers in even when there are subscribed writers
> waiting for the lock. Writers could starve reader threads.
> 
> Changelog since v4 :
> rename _fast -> trylock uncontended
> Remove the read from the uncontended read and write cases : just do a cmpxchg
> expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock".
> Use the value returned by the first cmpxchg as following value instead of doing
> an unnecessary read.
> 
> Here is why read + rmw is wrong (quoting Linus) :
> 
> "This is actually potentially very slow.
> 
> Why? If the lock is uncontended, but is not in the current CPU's caches,
> the read -> rmw operation generates multiple cache coherency protocol
> events. First it gets the line in shared mode (for the read), and then
> later it turns it into exclusive mode."
> 
> Changelog since v3 :
> Added writer subscription and fair trylock. The writer subscription keeps the
> reader threads out of the lock.
> 
> Changelog since v2 :
> Add writer fairness in addition to fairness wrt interrupts and softirqs.
> Added contention delays performance tests for thread and interrupt contexts to
> changelog.
> 
> Changelog since v1 :
> - No more spinlock to protect against concurrent writes, it is done within
>   the existing atomic variable.
> - Write fastpath with a single atomic op.
> 
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > 
> > 
> > On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> > > 
> > > I have hit this problem when tying to implement a better rwlock design
> > > than is currently in the mainline kernel (I know the RT kernel has a
> > > hard time with rwlocks)
> > 
> > Have you looked at my sleping rwlock trial thing?
> > 
> [...]
> >  - and because it is designed for sleeping, I'm pretty sure that you can 
> >    easily drop interrupts in the contention path, to make 
> >    write_lock_irq[save]() be reasonable.
> > 
> > In particular, the third bullet is the important one: because it's 
> > designed to have a "contention" path that has _extra_ information for the 
> > contended case, you could literally make the extra information have things 
> > like a list of pending writers, so that you can drop interrupts on one 
> > CPU, while you adding information to let the reader side know that if the 
> > read-lock happens on that CPU, it needs to be able to continue in order to 
> > not deadlock.
> > 
> > 		Linus
> 
> No, I just had a look at it, thanks for the pointer!
> 
> Tweakable contention behavior seems interesting, but I don't think it
> deals with the fact that on a mainline kernel, when an interrupt handler
> comes in and asks for a read lock, it has to get it on the spot. (RT
> kernels can get away with that using threaded threaded interrupts, but
> that's a completely different scheme). Therefore, the impact is that
> interrupts must be disabled around write lock usage, and we end up in
> the situation where this interrupt disable section can last for a long
> time, given it waits for every readers (including ones which does not
> disable interrupts nor softirqs) to complete.
> 
> Actually, I just used LTTng traces and eventually made a small patch to
> lockdep to detect whenever a spinlock or a rwlock is used both with
> interrupts enabled and disabled. Those sites are likely to produce very
> high latencies and should IMHO be considered as bogus. The basic bogus
> scenario is to have a spinlock held on CPU A with interrupts enabled
> being interrupted and then a softirq runs. On CPU B, the same lock is
> acquired with interrupts off. We therefore disable interrupts on CPU B
> for the duration of the softirq currently running on the CPU A, which is
> really not something that helps keeping short latencies. My preliminary
> results shows that there are a lot of inconsistent spinlock/rwlock irq
> on/off uses in the kernel.
> 
> This kind of scenario is pretty easy to fix for spinlocks (either move
> the interrupt disable within the spinlock section if the spinlock is
> never used by an interrupt handler or make sure that every users has
> interrupts disabled).
> 
> The problem comes with rwlocks : it is correct to have readers both with
> and without irq disable, even when interrupt handlers use the read lock.
> However, the write lock has to disable interrupt in that case, and we
> suffer from the high latency I pointed out. The tasklist_lock is the
> perfect example of this. In the following patch, I try to address this
> issue.
> 
> The core idea is this :
> 
> This "fair" rwlock writer subscribes to the lock, which locks out the 
> reader threads. Then, it takes waits until all reader threads exited their
> critical section and takes the mutex (1 bit within the "long"). Then it
> disables softirqs, locks out the softirqs, waits for all softirqs to exit their
> critical section, disables irqs and locks out irqs. It then waits for irqs to
> exit their critical section. Only then is the writer allowed to modify the data
> structure.
> 
> The writer fast path checks for a non-contended lock (all bits set to 0) and
> does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and irq
> exclusion bits.
> 
> The reader does an atomic cmpxchg to check if there is a subscribed writer. If
> not, it increments the reader count for its context (thread, softirq, irq).
> 
> The test module is available at :
> 
> http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c
> 
> ** Performance tests
> 
> Dual quad-core Xeon 2.0GHz E5405
> 
> * Lock contention delays, per context, 60s test
> 
> 6 thread readers (no delay loop)
> 3 thread writers (10us period)
> 2 periodical interrupt readers on 7/8 cpus (IPIs).
> 
> writer_thread/0 iterations : 2980374, max contention 1007442 cycles
> writer_thread/1 iterations : 2992675, max contention 917364 cycles
> trylock_writer_thread/0 iterations : 13533386, successful iterations : 3357015
> trylock_writer_thread/1 iterations : 13310980, successful iterations : 3278394
> reader_thread/0 iterations : 13624664, max contention 1296450 cycles
> reader_thread/1 iterations : 12183042, max contention 1524810 cycles
> reader_thread/2 iterations : 13302985, max contention 1017834 cycles
> reader_thread/3 iterations : 13398307, max contention 1016700 cycles
> trylock_reader_thread/0 iterations : 60849620, successful iterations : 14708736
> trylock_reader_thread/1 iterations : 63830891, successful iterations : 15827695
> interrupt_reader_thread/0 iterations : 559
> interrupt readers on CPU 0, max contention : 13914 cycles
> interrupt readers on CPU 1, max contention : 16446 cycles
> interrupt readers on CPU 2, max contention : 11604 cycles
> interrupt readers on CPU 3, max contention : 14124 cycles
> interrupt readers on CPU 4, max contention : 16494 cycles
> interrupt readers on CPU 5, max contention : 12930 cycles
> interrupt readers on CPU 6, max contention : 9432 cycles
> interrupt readers on CPU 7, max contention : 14514 cycles
> trylock_interrupt_reader_thread/0 iterations : 573
> trylock interrupt readers on CPU 0, iterations 1249, successful iterations : 1135
> trylock interrupt readers on CPU 1, iterations 1152, successful iterations : 1064
> trylock interrupt readers on CPU 2, iterations 1239, successful iterations : 1135
> trylock interrupt readers on CPU 3, iterations 1227, successful iterations : 1130
> trylock interrupt readers on CPU 4, iterations 1029, successful iterations : 908
> trylock interrupt readers on CPU 5, iterations 1161, successful iterations : 1063
> trylock interrupt readers on CPU 6, iterations 612, successful iterations : 578
> trylock interrupt readers on CPU 7, iterations 1007, successful iterations : 932
> 
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> CC: Jeremy Fitzhardinge <jeremy@goop.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> ---
>  include/linux/fair-rwlock.h |   61 ++++
>  lib/Makefile                |    2 
>  lib/fair-rwlock.c           |  557 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 620 insertions(+)
> 
> Index: linux-2.6-lttng/include/linux/fair-rwlock.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/include/linux/fair-rwlock.h	2008-08-17 11:48:42.000000000 -0400
> @@ -0,0 +1,61 @@
> +#ifndef _LINUX_FAIR_RWLOCK_H
> +#define _LINUX_FAIR_RWLOCK_H
> +
> +/*
> + * Fair low-latency rwlock
> + *
> + * Allows writer fairness wrt readers and also minimally impact the irq latency
> + * of the system.
> + *
> + * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + * August 2008
> + */
> +
> +#include <asm/atomic.h>
> +
> +struct fair_rwlock {
> +	atomic_long_t value;
> +};
> +
> +/* Reader lock */
> +
> +/*
> + * many readers, from irq/softirq/thread context.
> + * protects against writers.
> + */
> +void fair_read_lock(struct fair_rwlock *rwlock);
> +int fair_read_trylock(struct fair_rwlock *rwlock);
> +void fair_read_unlock(struct fair_rwlock *rwlock);
> +
> +/* Writer Lock */
> +
> +/*
> + * Subscription for write lock. Must be used for write trylock only.
> + * Insures fairness wrt reader threads.
> + */
> +void fair_write_subscribe(struct fair_rwlock *rwlock);
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against irq/softirq/thread readers.
> + */
> +void fair_write_lock_irq(struct fair_rwlock *rwlock);
> +int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock);
> +void fair_write_unlock_irq(struct fair_rwlock *rwlock);
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against softirq/thread readers.
> + */
> +void fair_write_lock_bh(struct fair_rwlock *rwlock);
> +int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock);
> +void fair_write_unlock_bh(struct fair_rwlock *rwlock);
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against thread readers.
> + */
> +void fair_write_lock(struct fair_rwlock *rwlock);
> +int fair_write_trylock_subscribed(struct fair_rwlock *rwlock);
> +void fair_write_unlock(struct fair_rwlock *rwlock);
> +#endif /* _LINUX_FAIR_RWLOCK_H */
> Index: linux-2.6-lttng/lib/Makefile
> ===================================================================
> --- linux-2.6-lttng.orig/lib/Makefile	2008-08-16 03:22:14.000000000 -0400
> +++ linux-2.6-lttng/lib/Makefile	2008-08-16 03:29:12.000000000 -0400
> @@ -43,6 +43,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
>  obj-$(CONFIG_DEBUG_LIST) += list_debug.o
>  obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
> 
> +obj-y += fair-rwlock.o
> +
>  ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
>    lib-y += dec_and_lock.o
>  endif
> Index: linux-2.6-lttng/lib/fair-rwlock.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/lib/fair-rwlock.c	2008-08-17 14:20:38.000000000 -0400
> @@ -0,0 +1,557 @@
> +/*
> + * Fair low-latency rwlock
> + *
> + * Allows writer fairness wrt readers and also minimally impact the irq latency
> + * of the system.
> + *
> + * A typical case leading to long interrupt latencies :
> + *
> + * - rwlock shared between
> + *   - Rare update in thread context
> + *   - Frequent slow read in thread context (task list iteration)
> + *   - Fast interrupt handler read
> + *
> + * The slow write must therefore disable interrupts around the write lock,
> + * but will therefore add up to the global interrupt latency; worse case being
> + * the duration of the slow read.
> + *
> + * This "fair" rwlock writer subscribes to the lock, which locks out the reader
> + * threads. Then, it takes waits until all reader threads exited their critical
> + * section and takes the mutex (1 bit within the "long"). Then it disables
> + * softirqs, locks out the softirqs, waits for all softirqs to exit their
> + * critical section, disables irqs and locks out irqs. It then waits for irqs to
> + * exit their critical section. Only then is the writer allowed to modify the
> + * data structure.
> + *
> + * The writer fast path checks for a non-contended lock (all bits set to 0) and
> + * does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and
> + * irq exclusion bits.
> + *
> + * The reader does an atomic cmpxchg to check if there is a subscribed writer.
> + * If not, it increments the reader count for its context (thread, softirq,
> + * irq).
> + *
> + * rwlock bits :
> + *
> + * - bits 0 .. log_2(NR_CPUS)-1) are the thread readers count
> + *   (max number of threads : NR_CPUS)
> + * - bits log_2(NR_CPUS) .. 2*log_2(NR_CPUS)-1 are the softirq readers count
> + *   (max # of softirqs: NR_CPUS)
> + * - bits 2*log_2(NR_CPUS) .. 3*log_2(NR_CPUS)-1 are the hardirq readers count
> + *   (max # of hardirqs: NR_CPUS)
> + *
> + * - bits 3*log_2(NR_CPUS) .. 4*log_2(NR_CPUS)-1 are the writer subscribers
> + *   count. Locks against reader threads if non zero.
> + *   (max # of writer subscribers : NR_CPUS)
> + * - bit 4*log_2(NR_CPUS) is the write mutex
> + * - bit 4*log_2(NR_CPUS)+1 is the writer lock against softirqs
> + * - bit 4*log_2(NR_CPUS)+2 is the writer lock against hardirqs

OK...  64 bits less three for the writer bits is 61 bits, divided by
four gives us 15 bits, which limits us to 32,768 CPUs (16,384 if we need
a guard bit separating the fields).  This should be enough for the next
few years.

However, if critical sections are to be preemptable (as they need to be
in -rt, then this limit applies to the number of -tasks- rather than the
number of CPUs.  32,768 is IMHO an uncomfortably low ceiling for the
number of tasks in a system, given that my laptop supports several
thousand without breaking a sweat.

So, how about combining the softirq and hardirq bits, giving the range 0
.. 2*log_2(NR_CPUS)-1 over to tasks?  This would give 30 bits, allowing
more than 10^9 tasks, which should be enough for the next few years.
Better yet, why not begin the bit allotment at the high end of the word,
allowing whatever is left over at the bottom for tasks?

Alternatively, "hardirq" and "softirq" runs in process context in -rt,
so the three read-side bitfields should be able to be combined into one
in that case.

> + * e.g. : NR_CPUS = 16
> + *
> + * THREAD_RMASK:      0x0000000f
> + * SOFTIRQ_RMASK:     0x000000f0
> + * HARDIRQ_RMASK:     0x00000f00
> + * SUBSCRIBERS_WMASK: 0x0000f000
> + * WRITER_MUTEX:      0x00010000
> + * SOFTIRQ_WMASK:     0x00020000
> + * HARDIRQ_WMASK:     0x00040000
> + *
> + * Bits usage :
> + *
> + * nr cpus for thread read
> + * nr cpus for softirq read
> + * nr cpus for hardirq read
> + *
> + * IRQ-safe write lock :
> + * nr cpus for write subscribers, disables new thread readers if non zero
> + * if (nr thread read == 0 && write mutex == 0)
> + * 1 bit for write mutex
> + * (softirq off)
> + * 1 bit for softirq exclusion
> + * if (nr softirq read == 0)
> + * (hardirq off)
> + * 1 bit for hardirq exclusion
> + * if (nr hardirq read == 0)
> + * -> locked
> + *
> + * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + */
> +
> +#include <linux/fair-rwlock.h>
> +#include <linux/hardirq.h>
> +#include <linux/module.h>
> +
> +#if (NR_CPUS > 64 && (BITS_PER_LONG == 32 || NR_CPUS > 32768))
> +#error "fair rwlock needs more bits per long to deal with that many CPUs"
> +#endif
> +
> +#define THREAD_ROFFSET	1UL
> +#define THREAD_RMASK	((NR_CPUS - 1) * THREAD_ROFFSET)

NR_CPUS is required to be a power of two?  News to me...

> +#define SOFTIRQ_ROFFSET	(THREAD_RMASK + 1)
> +#define SOFTIRQ_RMASK	((NR_CPUS - 1) * SOFTIRQ_ROFFSET)
> +#define HARDIRQ_ROFFSET	((SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> +#define HARDIRQ_RMASK	((NR_CPUS - 1) * HARDIRQ_ROFFSET)
> +
> +#define SUBSCRIBERS_WOFFSET	\
> +	((HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> +#define SUBSCRIBERS_WMASK	\
> +	((NR_CPUS - 1) * SUBSCRIBERS_WOFFSET)
> +#define WRITER_MUTEX		\
> +	((SUBSCRIBERS_WMASK | HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> +#define SOFTIRQ_WMASK	(WRITER_MUTEX << 1)
> +#define SOFTIRQ_WOFFSET	SOFTIRQ_WMASK
> +#define HARDIRQ_WMASK	(SOFTIRQ_WMASK << 1)
> +#define HARDIRQ_WOFFSET	HARDIRQ_WMASK
> +
> +#ifdef FAIR_RWLOCK_DEBUG
> +#define printk_dbg printk
> +#else
> +#define printk_dbg(fmt, args...)
> +#endif
> +
> +
> +/*
> + * Uncontended fastpath.
> + */
> +static long fair_read_trylock_uncontended(struct fair_rwlock *rwlock,
> +	long expect_readers, long roffset)
> +{
> +	/* no other reader nor writer present, try to take the lock */
> +	return atomic_long_cmpxchg(&rwlock->value, expect_readers,
> +					(expect_readers + roffset));
> +}

Suggest inlining this one, as it is just a wrapper.  (Or is gcc smart
enough to figure that out on its own these days?)

> +
> +/*
> + * Reader lock
> + *
> + * First try to get the uncontended lock. If it is non-zero (can be common,
> + * since we allow multiple readers), pass the returned cmpxchg value to the loop
> + * to try to get the reader lock.
> + *
> + * trylock will fail if a writer is subscribed or holds the lock, but will
> + * spin if there is concurency to win the cmpxchg. It could happen if, for
> + * instance, other concurrent reads need to update the roffset or if a
> + * writer updated the lock bits which does not contend us. Since many
> + * concurrent readers is a common case, it makes sense not to fail is it
> + * happens.
> + *
> + * the non-trylock case will spin for both situations.
> + */
> +
> +static int _fair_read_lock_ctx(struct fair_rwlock *rwlock,
> +		long roffset, long wmask, int trylock)
> +{
> +	long value, newvalue;
> +
> +	/*
> +	 * Try uncontended lock.
> +	 */
> +	value = fair_read_trylock_uncontended(rwlock, 0, roffset);
> +	if (likely(value == 0))
> +		return 1;
> +
> +	for (;;) {
> +		if (value & wmask) {
> +			if (!trylock) {
> +				/* Order value reads */
> +				smp_rmb();

I don't understand the need for this smp_rmb() -- what other location
are we maintaining order with?  The CPU must maintain order of accesses
to a single variable (cache coherence requires this).

I suggest replacing the smp_rmb() with a barrier().  Not that optimizing
a spinloop is violently important...

> +				value = atomic_long_read(&rwlock->value);
> +				continue;
> +			} else
> +				return 0;
> +		}
> +
> +		newvalue = atomic_long_cmpxchg(&rwlock->value,
> +				value, value + roffset);

Why not use fair_read_trylock_uncontended()?  Or equivalently, why not
replace calls to fair_read_trylock_uncontended() with the bare cmpxchg?

> +		if (likely(newvalue == value))
> +			break;
> +		else
> +			value = newvalue;
> +	}
> +
> +	printk_dbg("lib reader got in with value %lX, wmask %lX\n",
> +		value, wmask);
> +	return 1;
> +}
> +
> +/*
> + * many readers, from irq/softirq/thread context.
> + * protects against writers.
> + */
> +void fair_read_lock(struct fair_rwlock *rwlock)
> +{
> +	if (in_irq())
> +		_fair_read_lock_ctx(rwlock, HARDIRQ_ROFFSET, HARDIRQ_WMASK, 0);
> +	else if (in_softirq())
> +		_fair_read_lock_ctx(rwlock, SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 0);
> +	else {
> +		preempt_disable();
> +		_fair_read_lock_ctx(rwlock, THREAD_ROFFSET, SUBSCRIBERS_WMASK,
> +					0);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(fair_read_lock);
> +
> +int fair_read_trylock(struct fair_rwlock *rwlock)
> +{
> +	int ret;
> +
> +	if (in_irq())
> +		return _fair_read_lock_ctx(rwlock,
> +					HARDIRQ_ROFFSET, HARDIRQ_WMASK, 1);
> +	else if (in_softirq())
> +		return _fair_read_lock_ctx(rwlock,
> +					SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 1);
> +	else {
> +		preempt_disable();
> +		ret = _fair_read_lock_ctx(rwlock,
> +					THREAD_ROFFSET, SUBSCRIBERS_WMASK, 1);
> +		if (!ret)
> +			preempt_enable();
> +		return ret;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(fair_read_trylock);
> +
> +void fair_read_unlock(struct fair_rwlock *rwlock)
> +{
> +	/* atomic_long_sub orders reads */
> +	if (in_irq())
> +		atomic_long_sub(HARDIRQ_ROFFSET, &rwlock->value);
> +	else if (in_softirq())
> +		atomic_long_sub(SOFTIRQ_ROFFSET, &rwlock->value);
> +	else {
> +		atomic_long_sub(THREAD_ROFFSET, &rwlock->value);
> +		preempt_enable();
> +	}
> +}
> +EXPORT_SYMBOL_GPL(fair_read_unlock);
> +
> +/* Writer lock */
> +
> +/*
> + * Lock out a specific execution context from the read lock. Wait for both the
> + * rmask and the wmask to be empty before proceeding to take the lock.
> + */
> +static void _fair_write_lock_ctx_wait(struct fair_rwlock *rwlock,
> +		long value, long rmask, long wmask)
> +{
> +	long newvalue;
> +
> +	for (;;) {
> +		if (value & (rmask | wmask)) {
> +			/* Order value reads */
> +			smp_rmb();

Suggest replacing the smp_rmb() with barrier() as noted above.

> +			value = atomic_long_read(&rwlock->value);
> +			continue;
> +		}
> +		newvalue = atomic_long_cmpxchg(&rwlock->value,
> +				value, value | wmask);
> +		if (likely(newvalue == value))
> +			break;
> +		else
> +			value = newvalue;
> +	}
> +	printk_dbg("lib writer got in with value %lX, new %lX, rmask %lX\n",
> +		value, value | wmask, rmask);
> +}
> +
> +/*
> + * Lock out a specific execution context from the read lock. First lock the read
> + * context out of the lock, then wait for every readers to exit their critical
> + * section.
> + */
> +static void _fair_write_lock_ctx_force(struct fair_rwlock *rwlock,
> +		long rmask, long woffset)
> +{
> +	long value;
> +
> +	atomic_long_add(woffset, &rwlock->value);
> +	do {
> +		value = atomic_long_read(&rwlock->value);
> +		/* Order rwlock->value read wrt following reads */
> +		smp_rmb();

Suggest replacing the smp_rmb() with barrier() as noted above.

> +	} while (value & rmask);
> +	printk_dbg("lib writer got in with value %lX, woffset %lX, rmask %lX\n",
> +		value, woffset, rmask);
> +}
> +
> +void fair_write_subscribe(struct fair_rwlock *rwlock)
> +{
> +	preempt_disable();
> +
> +	/* lock out threads */
> +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> +}
> +EXPORT_SYMBOL_GPL(fair_write_subscribe);
> +
> +/*
> + * Uncontended fastpath.
> + */
> +static long
> +fair_write_trylock_irq_uncontended(struct fair_rwlock *rwlock,
> +	long expect_sub, long sub_woffset)
> +{
> +	long value;
> +
> +	/* no other reader nor writer present, try to take the lock */
> +	local_bh_disable();
> +	local_irq_disable();
> +	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
> +			expect_sub + (sub_woffset | SOFTIRQ_WOFFSET
> +				| HARDIRQ_WOFFSET | WRITER_MUTEX));
> +	if (unlikely(value != expect_sub)) {
> +		local_irq_enable();
> +		local_bh_enable();
> +	}
> +	return value;
> +}
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against irq/softirq/thread readers.
> + */
> +void fair_write_lock_irq(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	preempt_disable();
> +	value = fair_write_trylock_irq_uncontended(rwlock,
> +			0, SUBSCRIBERS_WOFFSET);
> +	if (likely(!value))
> +		return;
> +
> +	/* lock out threads */
> +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);

Why don't we use fair_write_subscribe() here?  Ah, because we have
already disabled preemption...

OK, so why do we have fair_write_subscribe() defined in the first place?

> +	/* lock out other writers when no reader threads left */
> +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> +		THREAD_RMASK, WRITER_MUTEX);

OK, I'll bite...  Why don't we also have to wait for the softirq and
hardirq readers?  I would expect THREAD_RMASK|SOFTIRQ_RMASK|HARDIRQ_RMASK
rather than just THREAD_RMASK above.

Never mind, I finally see the following pair of statements.

So, the idea is to allow recursive readers in irq when we also have
readers in the interrupted process context, right?  But isn't the
common case going to be no irq handlers read-holding the lock?  If
so, wouldn't we want to try for all of them at once if we could, using
a state machine?

Also, just to be sure I understand, writers waiting for other writers
would spin in the above _fair_write_lock_ctx_wait(), correct?

> +	/* lock out softirqs */
> +	local_bh_disable();

Why not just local_irq_disable() at this point and be done with it?

> +	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
> +
> +	/* lock out hardirqs */
> +	local_irq_disable();
> +	_fair_write_lock_ctx_force(rwlock, HARDIRQ_RMASK, HARDIRQ_WOFFSET);
> +
> +	/* atomic_long_cmpxchg orders writes */
> +}
> +EXPORT_SYMBOL_GPL(fair_write_lock_irq);
> +
> +/*
> + * Must already be subscribed. Will fail if there is contention caused by
> + * interrupt readers, softirq readers, other writers. Will also fail if there is
> + * concurrent cmpxchg update, even if it is only a writer subscription. It is
> + * not considered as a common case and therefore does not jusfity looping.

"justify"?  (Sorry, couldn't resist...)

> + *
> + * Second cmpxchg deals with other write subscribers.
> + */
> +int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	value = fair_write_trylock_irq_uncontended(rwlock,
> +			SUBSCRIBERS_WOFFSET, 0);
> +	if (likely(value == SUBSCRIBERS_WOFFSET))
> +		return 1;
> +
> +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> +		/* no other reader nor writer present, try to take the lock */
> +		local_bh_disable();
> +		local_irq_disable();

Doesn't local_irq_disable() imply local_bh_disable()?  So why not just
do local_irq_disable()?

> +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> +				value + (SOFTIRQ_WOFFSET
> +					| HARDIRQ_WOFFSET | WRITER_MUTEX))
> +						== value))
> +			return 1;
> +		local_irq_enable();
> +		local_bh_enable();
> +	}
> +	return 0;
> +
> +}
> +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_irq);
> +
> +void fair_write_unlock_irq(struct fair_rwlock *rwlock)
> +{
> +	/*
> +	 * atomic_long_sub makes sure we commit the data before reenabling
> +	 * the lock.
> +	 */
> +	atomic_long_sub(HARDIRQ_WOFFSET | SOFTIRQ_WOFFSET
> +			| WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
> +			&rwlock->value);
> +	local_irq_enable();
> +	local_bh_enable();
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(fair_write_unlock_irq);
> +
> +/*
> + * Uncontended fastpath.
> + */
> +static long fair_write_trylock_bh_uncontended(struct fair_rwlock *rwlock,
> +	long expect_sub, long sub_woffset)
> +{
> +	long value;
> +
> +	/* no other reader nor writer present, try to take the lock */
> +	local_bh_disable();
> +	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
> +		expect_sub + (sub_woffset | SOFTIRQ_WOFFSET | WRITER_MUTEX));
> +	if (unlikely(value != expect_sub))
> +		local_bh_enable();
> +	return value;
> +}
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against softirq/thread readers.
> + */
> +void fair_write_lock_bh(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	preempt_disable();
> +	value = fair_write_trylock_bh_uncontended(rwlock,
> +			0, SUBSCRIBERS_WOFFSET);
> +	if (likely(!value))
> +		return;
> +
> +	/* lock out threads */
> +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> +
> +	/* lock out other writers when no reader threads left */
> +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> +		THREAD_RMASK, WRITER_MUTEX);
> +
> +	/* lock out softirqs */
> +	local_bh_disable();
> +	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);

Seems like we would want to at least give an error if there were
hardirq readers...  (Can't think of a reasonable use case mixing
fair_write_lock_bh() with hardirq readers, but might well be simply a
momentary failure of imagination.)

> +	/* atomic_long_cmpxchg orders writes */
> +}
> +EXPORT_SYMBOL_GPL(fair_write_lock_bh);
> +
> +/*
> + * Must already be subscribed. Will fail if there is contention caused by
> + * softirq readers, other writers. Will also fail if there is concurrent cmpxchg
> + * update, even if it is only a writer subscription. It is not considered as a
> + * common case and therefore does not jusfity looping.
> + */
> +int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	value = fair_write_trylock_bh_uncontended(rwlock,
> +			SUBSCRIBERS_WOFFSET, 0);
> +	if (likely(value == SUBSCRIBERS_WOFFSET))
> +		return 1;
> +
> +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> +		/* no other reader nor writer present, try to take the lock */
> +		local_bh_disable();
> +		local_irq_disable();
> +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> +				value + (SOFTIRQ_WOFFSET | WRITER_MUTEX))
> +						== value))
> +			return 1;
> +		local_irq_enable();
> +		local_bh_enable();
> +	}
> +	return 0;
> +
> +}
> +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_bh);
> +
> +void fair_write_unlock_bh(struct fair_rwlock *rwlock)
> +{
> +	/*
> +	 * atomic_long_sub makes sure we commit the data before reenabling
> +	 * the lock.
> +	 */
> +	atomic_long_sub(SOFTIRQ_WOFFSET | WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
> +			&rwlock->value);
> +	local_bh_enable();
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(fair_write_unlock_bh);
> +
> +/*
> + * Uncontended fastpath.
> + */
> +static long fair_write_trylock_uncontended(struct fair_rwlock *rwlock,
> +	long expect_sub, long sub_woffset)
> +{
> +	/* no other reader nor writer present, try to take the lock */
> +	return atomic_long_cmpxchg(&rwlock->value, expect_sub,
> +					(expect_sub + sub_woffset)
> +					| WRITER_MUTEX);
> +}
> +
> +/*
> + * Safe against other writers in thread context.
> + * Safe against thread readers.
> + */
> +void fair_write_lock(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	preempt_disable();
> +	value = fair_write_trylock_uncontended(rwlock, 0, SUBSCRIBERS_WOFFSET);
> +	if (likely(!value))
> +		return;
> +
> +	/* lock out threads */
> +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> +
> +	/* lock out other writers when no reader threads left */
> +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> +		THREAD_RMASK, WRITER_MUTEX);

The theory here is that there shouldn't be any readers in bh or irq?
But suppose that we are calling fair_write_lock() with irqs already
disabled?  Wouldn't we still want to wait for bh and irq readers in
that case?

> +	/* atomic_long_cmpxchg orders writes */
> +}
> +EXPORT_SYMBOL_GPL(fair_write_lock);
> +
> +/*
> + * Must already be subscribed. Will fail if there is contention caused by
> + * other writers. Will also fail if there is concurrent cmpxchg update, even if
> + * it is only a writer subscription. It is not considered as a common case and
> + * therefore does not jusfity looping.
> + */
> +int fair_write_trylock_subscribed(struct fair_rwlock *rwlock)
> +{
> +	long value;
> +
> +	value = fair_write_trylock_uncontended(rwlock, SUBSCRIBERS_WOFFSET, 0);
> +	if (likely(value == SUBSCRIBERS_WOFFSET))
> +		return 1;
> +
> +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> +		/* no other reader nor writer present, try to take the lock */
> +		local_bh_disable();
> +		local_irq_disable();
> +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> +				value | WRITER_MUTEX) == value))
> +			return 1;
> +		local_irq_enable();
> +		local_bh_enable();
> +	}
> +	return 0;
> +
> +}
> +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed);

I still find the _subscribed() stuff a bit scary...  Sample use case?

> +void fair_write_unlock(struct fair_rwlock *rwlock)
> +{
> +	/*
> +	 * atomic_long_sub makes sure we commit the data before reenabling
> +	 * the lock.
> +	 */
> +	atomic_long_sub(WRITER_MUTEX | SUBSCRIBERS_WOFFSET, &rwlock->value);
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(fair_write_unlock);
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-18 23:25                 ` Paul E. McKenney
@ 2008-08-19  6:04                   ` Mathieu Desnoyers
  2008-08-19  7:33                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-19  6:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Aug 17, 2008 at 03:10:35PM -0400, Mathieu Desnoyers wrote:
> > * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > > 
> > > 
> > > On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
> > > > +/*
> > > > + * Uncontended fastpath.
> > > > + */
> > > > +static int fair_write_lock_irq_fast(struct fair_rwlock *rwlock)
> > > 
> > > So first off, you should call this "trylock", since it doesn't necessarily 
> > > get the lock at all. It has nothing to do with fast.
> > > 
> > > Secondly:
> > > 
> > > > +	value = atomic_long_read(&rwlock->value);
> > > > +	if (likely(!value)) {
> > > > +		/* no other reader nor writer present, try to take the lock */
> > > > +		local_bh_disable();
> > > > +		local_irq_disable();
> > > > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> > > 
> > > This is actually potentially very slow. 
> > > 
> > > Why? If the lock is uncontended, but is not in the current CPU's caches, 
> > > the read -> rmw operation generates multiple cache coherency protocol 
> > > events. First it gets the line in shared mode (for the read), and then 
> > > later it turns it into exclusive mode.
> > > 
> > > So if it's likely that the value is zero (or even if it's just the only 
> > > case we really care about), then you really should do the
> > > 
> > > 	atomic_long_cmpxchg(&rwlock->value, 0, newvalue);
> > > 
> > > thing as the _first_ access to the lock.
> > > 
> > > Yeah, yeah, that means that you need to do the local_bh_disable etc first 
> > > too, and undo it if it fails, but the failure path should be the unusual 
> > > one. 
> > > 
> > > 			Linus
> > 
> > Ah, you are right. This new version implements a "single cmpxchg"
> > uncontended code path, changes the _fast semantic for "trylock
> > uncontended" and also adds the trylock primitives.
> 
> Interesting approach!  Questions and comments interspersed below.
> 
> 						Thanx, Paul
> 
> > Thanks for the input,
> > 
> > Mathieu
> > 
> > 
> > Fair low-latency rwlock v5
> > 
> > Fair low-latency rwlock provides fairness for writers against reader threads,
> > but lets softirq and irq readers in even when there are subscribed writers
> > waiting for the lock. Writers could starve reader threads.
> > 
> > Changelog since v4 :
> > rename _fast -> trylock uncontended
> > Remove the read from the uncontended read and write cases : just do a cmpxchg
> > expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock".
> > Use the value returned by the first cmpxchg as following value instead of doing
> > an unnecessary read.
> > 
> > Here is why read + rmw is wrong (quoting Linus) :
> > 
> > "This is actually potentially very slow.
> > 
> > Why? If the lock is uncontended, but is not in the current CPU's caches,
> > the read -> rmw operation generates multiple cache coherency protocol
> > events. First it gets the line in shared mode (for the read), and then
> > later it turns it into exclusive mode."
> > 
> > Changelog since v3 :
> > Added writer subscription and fair trylock. The writer subscription keeps the
> > reader threads out of the lock.
> > 
> > Changelog since v2 :
> > Add writer fairness in addition to fairness wrt interrupts and softirqs.
> > Added contention delays performance tests for thread and interrupt contexts to
> > changelog.
> > 
> > Changelog since v1 :
> > - No more spinlock to protect against concurrent writes, it is done within
> >   the existing atomic variable.
> > - Write fastpath with a single atomic op.
> > 
> > * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > > 
> > > 
> > > On Sat, 16 Aug 2008, Mathieu Desnoyers wrote:
> > > > 
> > > > I have hit this problem when tying to implement a better rwlock design
> > > > than is currently in the mainline kernel (I know the RT kernel has a
> > > > hard time with rwlocks)
> > > 
> > > Have you looked at my sleping rwlock trial thing?
> > > 
> > [...]
> > >  - and because it is designed for sleeping, I'm pretty sure that you can 
> > >    easily drop interrupts in the contention path, to make 
> > >    write_lock_irq[save]() be reasonable.
> > > 
> > > In particular, the third bullet is the important one: because it's 
> > > designed to have a "contention" path that has _extra_ information for the 
> > > contended case, you could literally make the extra information have things 
> > > like a list of pending writers, so that you can drop interrupts on one 
> > > CPU, while you adding information to let the reader side know that if the 
> > > read-lock happens on that CPU, it needs to be able to continue in order to 
> > > not deadlock.
> > > 
> > > 		Linus
> > 
> > No, I just had a look at it, thanks for the pointer!
> > 
> > Tweakable contention behavior seems interesting, but I don't think it
> > deals with the fact that on a mainline kernel, when an interrupt handler
> > comes in and asks for a read lock, it has to get it on the spot. (RT
> > kernels can get away with that using threaded threaded interrupts, but
> > that's a completely different scheme). Therefore, the impact is that
> > interrupts must be disabled around write lock usage, and we end up in
> > the situation where this interrupt disable section can last for a long
> > time, given it waits for every readers (including ones which does not
> > disable interrupts nor softirqs) to complete.
> > 
> > Actually, I just used LTTng traces and eventually made a small patch to
> > lockdep to detect whenever a spinlock or a rwlock is used both with
> > interrupts enabled and disabled. Those sites are likely to produce very
> > high latencies and should IMHO be considered as bogus. The basic bogus
> > scenario is to have a spinlock held on CPU A with interrupts enabled
> > being interrupted and then a softirq runs. On CPU B, the same lock is
> > acquired with interrupts off. We therefore disable interrupts on CPU B
> > for the duration of the softirq currently running on the CPU A, which is
> > really not something that helps keeping short latencies. My preliminary
> > results shows that there are a lot of inconsistent spinlock/rwlock irq
> > on/off uses in the kernel.
> > 
> > This kind of scenario is pretty easy to fix for spinlocks (either move
> > the interrupt disable within the spinlock section if the spinlock is
> > never used by an interrupt handler or make sure that every users has
> > interrupts disabled).
> > 
> > The problem comes with rwlocks : it is correct to have readers both with
> > and without irq disable, even when interrupt handlers use the read lock.
> > However, the write lock has to disable interrupt in that case, and we
> > suffer from the high latency I pointed out. The tasklist_lock is the
> > perfect example of this. In the following patch, I try to address this
> > issue.
> > 
> > The core idea is this :
> > 
> > This "fair" rwlock writer subscribes to the lock, which locks out the 
> > reader threads. Then, it takes waits until all reader threads exited their
> > critical section and takes the mutex (1 bit within the "long"). Then it
> > disables softirqs, locks out the softirqs, waits for all softirqs to exit their
> > critical section, disables irqs and locks out irqs. It then waits for irqs to
> > exit their critical section. Only then is the writer allowed to modify the data
> > structure.
> > 
> > The writer fast path checks for a non-contended lock (all bits set to 0) and
> > does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and irq
> > exclusion bits.
> > 
> > The reader does an atomic cmpxchg to check if there is a subscribed writer. If
> > not, it increments the reader count for its context (thread, softirq, irq).
> > 
> > The test module is available at :
> > 
> > http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c
> > 
> > ** Performance tests
> > 
> > Dual quad-core Xeon 2.0GHz E5405
> > 
> > * Lock contention delays, per context, 60s test
> > 
> > 6 thread readers (no delay loop)
> > 3 thread writers (10us period)
> > 2 periodical interrupt readers on 7/8 cpus (IPIs).
> > 
> > writer_thread/0 iterations : 2980374, max contention 1007442 cycles
> > writer_thread/1 iterations : 2992675, max contention 917364 cycles
> > trylock_writer_thread/0 iterations : 13533386, successful iterations : 3357015
> > trylock_writer_thread/1 iterations : 13310980, successful iterations : 3278394
> > reader_thread/0 iterations : 13624664, max contention 1296450 cycles
> > reader_thread/1 iterations : 12183042, max contention 1524810 cycles
> > reader_thread/2 iterations : 13302985, max contention 1017834 cycles
> > reader_thread/3 iterations : 13398307, max contention 1016700 cycles
> > trylock_reader_thread/0 iterations : 60849620, successful iterations : 14708736
> > trylock_reader_thread/1 iterations : 63830891, successful iterations : 15827695
> > interrupt_reader_thread/0 iterations : 559
> > interrupt readers on CPU 0, max contention : 13914 cycles
> > interrupt readers on CPU 1, max contention : 16446 cycles
> > interrupt readers on CPU 2, max contention : 11604 cycles
> > interrupt readers on CPU 3, max contention : 14124 cycles
> > interrupt readers on CPU 4, max contention : 16494 cycles
> > interrupt readers on CPU 5, max contention : 12930 cycles
> > interrupt readers on CPU 6, max contention : 9432 cycles
> > interrupt readers on CPU 7, max contention : 14514 cycles
> > trylock_interrupt_reader_thread/0 iterations : 573
> > trylock interrupt readers on CPU 0, iterations 1249, successful iterations : 1135
> > trylock interrupt readers on CPU 1, iterations 1152, successful iterations : 1064
> > trylock interrupt readers on CPU 2, iterations 1239, successful iterations : 1135
> > trylock interrupt readers on CPU 3, iterations 1227, successful iterations : 1130
> > trylock interrupt readers on CPU 4, iterations 1029, successful iterations : 908
> > trylock interrupt readers on CPU 5, iterations 1161, successful iterations : 1063
> > trylock interrupt readers on CPU 6, iterations 612, successful iterations : 578
> > trylock interrupt readers on CPU 7, iterations 1007, successful iterations : 932
> > 
> > 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > CC: Jeremy Fitzhardinge <jeremy@goop.org>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Ingo Molnar <mingo@elte.hu>
> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > ---
> >  include/linux/fair-rwlock.h |   61 ++++
> >  lib/Makefile                |    2 
> >  lib/fair-rwlock.c           |  557 ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 620 insertions(+)
> > 
> > Index: linux-2.6-lttng/include/linux/fair-rwlock.h
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/include/linux/fair-rwlock.h	2008-08-17 11:48:42.000000000 -0400
> > @@ -0,0 +1,61 @@
> > +#ifndef _LINUX_FAIR_RWLOCK_H
> > +#define _LINUX_FAIR_RWLOCK_H
> > +
> > +/*
> > + * Fair low-latency rwlock
> > + *
> > + * Allows writer fairness wrt readers and also minimally impact the irq latency
> > + * of the system.
> > + *
> > + * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > + * August 2008
> > + */
> > +
> > +#include <asm/atomic.h>
> > +
> > +struct fair_rwlock {
> > +	atomic_long_t value;
> > +};
> > +
> > +/* Reader lock */
> > +
> > +/*
> > + * many readers, from irq/softirq/thread context.
> > + * protects against writers.
> > + */
> > +void fair_read_lock(struct fair_rwlock *rwlock);
> > +int fair_read_trylock(struct fair_rwlock *rwlock);
> > +void fair_read_unlock(struct fair_rwlock *rwlock);
> > +
> > +/* Writer Lock */
> > +
> > +/*
> > + * Subscription for write lock. Must be used for write trylock only.
> > + * Insures fairness wrt reader threads.
> > + */
> > +void fair_write_subscribe(struct fair_rwlock *rwlock);
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against irq/softirq/thread readers.
> > + */
> > +void fair_write_lock_irq(struct fair_rwlock *rwlock);
> > +int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock);
> > +void fair_write_unlock_irq(struct fair_rwlock *rwlock);
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against softirq/thread readers.
> > + */
> > +void fair_write_lock_bh(struct fair_rwlock *rwlock);
> > +int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock);
> > +void fair_write_unlock_bh(struct fair_rwlock *rwlock);
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against thread readers.
> > + */
> > +void fair_write_lock(struct fair_rwlock *rwlock);
> > +int fair_write_trylock_subscribed(struct fair_rwlock *rwlock);
> > +void fair_write_unlock(struct fair_rwlock *rwlock);
> > +#endif /* _LINUX_FAIR_RWLOCK_H */
> > Index: linux-2.6-lttng/lib/Makefile
> > ===================================================================
> > --- linux-2.6-lttng.orig/lib/Makefile	2008-08-16 03:22:14.000000000 -0400
> > +++ linux-2.6-lttng/lib/Makefile	2008-08-16 03:29:12.000000000 -0400
> > @@ -43,6 +43,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
> >  obj-$(CONFIG_DEBUG_LIST) += list_debug.o
> >  obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
> > 
> > +obj-y += fair-rwlock.o
> > +
> >  ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
> >    lib-y += dec_and_lock.o
> >  endif
> > Index: linux-2.6-lttng/lib/fair-rwlock.c
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/lib/fair-rwlock.c	2008-08-17 14:20:38.000000000 -0400
> > @@ -0,0 +1,557 @@
> > +/*
> > + * Fair low-latency rwlock
> > + *
> > + * Allows writer fairness wrt readers and also minimally impact the irq latency
> > + * of the system.
> > + *
> > + * A typical case leading to long interrupt latencies :
> > + *
> > + * - rwlock shared between
> > + *   - Rare update in thread context
> > + *   - Frequent slow read in thread context (task list iteration)
> > + *   - Fast interrupt handler read
> > + *
> > + * The slow write must therefore disable interrupts around the write lock,
> > + * but will therefore add up to the global interrupt latency; worse case being
> > + * the duration of the slow read.
> > + *
> > + * This "fair" rwlock writer subscribes to the lock, which locks out the reader
> > + * threads. Then, it takes waits until all reader threads exited their critical
> > + * section and takes the mutex (1 bit within the "long"). Then it disables
> > + * softirqs, locks out the softirqs, waits for all softirqs to exit their
> > + * critical section, disables irqs and locks out irqs. It then waits for irqs to
> > + * exit their critical section. Only then is the writer allowed to modify the
> > + * data structure.
> > + *
> > + * The writer fast path checks for a non-contended lock (all bits set to 0) and
> > + * does an atomic cmpxchg to set the subscription, mutex, softirq exclusion and
> > + * irq exclusion bits.
> > + *
> > + * The reader does an atomic cmpxchg to check if there is a subscribed writer.
> > + * If not, it increments the reader count for its context (thread, softirq,
> > + * irq).
> > + *
> > + * rwlock bits :
> > + *
> > + * - bits 0 .. log_2(NR_CPUS)-1) are the thread readers count
> > + *   (max number of threads : NR_CPUS)
> > + * - bits log_2(NR_CPUS) .. 2*log_2(NR_CPUS)-1 are the softirq readers count
> > + *   (max # of softirqs: NR_CPUS)
> > + * - bits 2*log_2(NR_CPUS) .. 3*log_2(NR_CPUS)-1 are the hardirq readers count
> > + *   (max # of hardirqs: NR_CPUS)
> > + *
> > + * - bits 3*log_2(NR_CPUS) .. 4*log_2(NR_CPUS)-1 are the writer subscribers
> > + *   count. Locks against reader threads if non zero.
> > + *   (max # of writer subscribers : NR_CPUS)
> > + * - bit 4*log_2(NR_CPUS) is the write mutex
> > + * - bit 4*log_2(NR_CPUS)+1 is the writer lock against softirqs
> > + * - bit 4*log_2(NR_CPUS)+2 is the writer lock against hardirqs
> 
> OK...  64 bits less three for the writer bits is 61 bits, divided by
> four gives us 15 bits, which limits us to 32,768 CPUs (16,384 if we need
> a guard bit separating the fields).  This should be enough for the next
> few years.
> 
> However, if critical sections are to be preemptable (as they need to be
> in -rt, then this limit applies to the number of -tasks- rather than the
> number of CPUs.  32,768 is IMHO an uncomfortably low ceiling for the
> number of tasks in a system, given that my laptop supports several
> thousand without breaking a sweat.
> 

Agreed. I am careful about overdesigning for RT though, given that once
we get it right for the current thread vs softirq vs hardirq mess, the
solution for preemptable critical sections might come more naturally.

> So, how about combining the softirq and hardirq bits, giving the range 0
> .. 2*log_2(NR_CPUS)-1 over to tasks?  This would give 30 bits,
> allowing more than 10^9 tasks, which should be enough for the next few
> years.

See below about RT kernel.

> Better yet, why not begin the bit allotment at the high end of
> the word, allowing whatever is left over at the bottom for tasks?
> 

Technical detail : x86_64 cannot pass immediate values > 2^32 to most
instructions; it must load an immediate value in a register, which would
add a few instructions. Therefore, I prefer to use the lowest bits
possible given the NR_CPUS configuration.

> Alternatively, "hardirq" and "softirq" runs in process context in -rt,
> so the three read-side bitfields should be able to be combined into one
> in that case.
> 

Actually, given that softirqs and hardirqs are really just tasks in -RT,
I guess threads could be given the whole 45 bits available for threads,
softirqs and irqs (0 .. 3*log_2(NR_CPUS)-1).

However, in that case, we would also like to have as much preempable
writer critical sections as we have reader critical sections. Therefore,
it would be best to split the 60 bits in two :

30 bits for reader threads
30 bits for writer threads

However, I wonder if the whole algorithm included in this rwlock
primitive to make sure that interrupt latency < softirq latency < thread
latency won't just vanish in -rt kernels. We might want to group those
threads by priority (e.g. grouping irq threads together and maybe
leaving softirq and normal threads in the same bunch), but we would then
impact softirq latency a little bit more.

The problem of this approach wrt RT kernels is that we cannot provide
enough "priority groups" (current irq, softirq and threads in mainline
kernel) for all the subtile priority levels of RT kernels. The more
groups we add, the less threads we allow on the system.

So basically, the relationship between the max number of threads (T) and
the number of reader priorities goes as follow on a 64 bits machine :

T writers subscribed count bits
1 bit for writer mutex

for first priority group :
T reader count bits
(no need of reader exclusion bit because the writer subscribed count
bits and the writer mutex act as exclusion)

for each other priority group :
T reader count bits
1 reader exclusion bit (set by the writer)

We have the inequality :

64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1)

64 >= (2T + 1) + (NR_PRIORITIES - 1) * (T + 1)
63 - 2T >= (NR_PRIORITIES - 1) * (T + 1)
((63 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES

Therefore :

Thread bits  |  Max number of threads  |  Number of priorities
  31         |    2147483648           |          1
  20         |       1048576           |          2
  15         |         32768           |          3
  12         |          4096           |          4
   9         |           512           |          5
   8         |           256           |          6
   7         |           128           |          7
   6         |            64           |          8
   5         |            32           |          9
   4         |            16           |         10
   3         |             8           |         15

Starting from here, we have more priority groups than threads in the
system, which becomes somewhat pointless... :)

So currently, for the mainline kernel, I chose 3 priority levels thread,
softirq, irq), which gives me 32768 max CPU in the system because I
choose to disable preemption. However, we can think of ways to tune that
in the direction we prefer. We could also hybrid those : having more
bits for some groups which have preemptable threads (for which we need
a max. of nr. threads) and less bits for other groups where preemption
is disabled (where we only need enough bits to cound NR_CPUS)

Ideas are welcome...


> > + * e.g. : NR_CPUS = 16
> > + *
> > + * THREAD_RMASK:      0x0000000f
> > + * SOFTIRQ_RMASK:     0x000000f0
> > + * HARDIRQ_RMASK:     0x00000f00
> > + * SUBSCRIBERS_WMASK: 0x0000f000
> > + * WRITER_MUTEX:      0x00010000
> > + * SOFTIRQ_WMASK:     0x00020000
> > + * HARDIRQ_WMASK:     0x00040000
> > + *
> > + * Bits usage :
> > + *
> > + * nr cpus for thread read
> > + * nr cpus for softirq read
> > + * nr cpus for hardirq read
> > + *
> > + * IRQ-safe write lock :
> > + * nr cpus for write subscribers, disables new thread readers if non zero
> > + * if (nr thread read == 0 && write mutex == 0)
> > + * 1 bit for write mutex
> > + * (softirq off)
> > + * 1 bit for softirq exclusion
> > + * if (nr softirq read == 0)
> > + * (hardirq off)
> > + * 1 bit for hardirq exclusion
> > + * if (nr hardirq read == 0)
> > + * -> locked
> > + *
> > + * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > + */
> > +
> > +#include <linux/fair-rwlock.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/module.h>
> > +
> > +#if (NR_CPUS > 64 && (BITS_PER_LONG == 32 || NR_CPUS > 32768))
> > +#error "fair rwlock needs more bits per long to deal with that many CPUs"
> > +#endif
> > +
> > +#define THREAD_ROFFSET	1UL
> > +#define THREAD_RMASK	((NR_CPUS - 1) * THREAD_ROFFSET)
> 
> NR_CPUS is required to be a power of two?  News to me...
> 

I thought it would have been sane to make is a power of two, but you are
right.. I should use the neat "roundup_pow_of_two()" :)

> > +#define SOFTIRQ_ROFFSET	(THREAD_RMASK + 1)
> > +#define SOFTIRQ_RMASK	((NR_CPUS - 1) * SOFTIRQ_ROFFSET)
> > +#define HARDIRQ_ROFFSET	((SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> > +#define HARDIRQ_RMASK	((NR_CPUS - 1) * HARDIRQ_ROFFSET)
> > +
> > +#define SUBSCRIBERS_WOFFSET	\
> > +	((HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> > +#define SUBSCRIBERS_WMASK	\
> > +	((NR_CPUS - 1) * SUBSCRIBERS_WOFFSET)
> > +#define WRITER_MUTEX		\
> > +	((SUBSCRIBERS_WMASK | HARDIRQ_RMASK | SOFTIRQ_RMASK | THREAD_RMASK) + 1)
> > +#define SOFTIRQ_WMASK	(WRITER_MUTEX << 1)
> > +#define SOFTIRQ_WOFFSET	SOFTIRQ_WMASK
> > +#define HARDIRQ_WMASK	(SOFTIRQ_WMASK << 1)
> > +#define HARDIRQ_WOFFSET	HARDIRQ_WMASK
> > +
> > +#ifdef FAIR_RWLOCK_DEBUG
> > +#define printk_dbg printk
> > +#else
> > +#define printk_dbg(fmt, args...)
> > +#endif
> > +
> > +
> > +/*
> > + * Uncontended fastpath.
> > + */
> > +static long fair_read_trylock_uncontended(struct fair_rwlock *rwlock,
> > +	long expect_readers, long roffset)
> > +{
> > +	/* no other reader nor writer present, try to take the lock */
> > +	return atomic_long_cmpxchg(&rwlock->value, expect_readers,
> > +					(expect_readers + roffset));
> > +}
> 
> Suggest inlining this one, as it is just a wrapper.  (Or is gcc smart
> enough to figure that out on its own these days?)
> 

gcc is smart nowadays, it gets inlined automatically when it is small
enough and just called once.

> > +
> > +/*
> > + * Reader lock
> > + *
> > + * First try to get the uncontended lock. If it is non-zero (can be common,
> > + * since we allow multiple readers), pass the returned cmpxchg value to the loop
> > + * to try to get the reader lock.
> > + *
> > + * trylock will fail if a writer is subscribed or holds the lock, but will
> > + * spin if there is concurency to win the cmpxchg. It could happen if, for
> > + * instance, other concurrent reads need to update the roffset or if a
> > + * writer updated the lock bits which does not contend us. Since many
> > + * concurrent readers is a common case, it makes sense not to fail is it
> > + * happens.
> > + *
> > + * the non-trylock case will spin for both situations.
> > + */
> > +
> > +static int _fair_read_lock_ctx(struct fair_rwlock *rwlock,
> > +		long roffset, long wmask, int trylock)
> > +{
> > +	long value, newvalue;
> > +
> > +	/*
> > +	 * Try uncontended lock.
> > +	 */
> > +	value = fair_read_trylock_uncontended(rwlock, 0, roffset);
> > +	if (likely(value == 0))
> > +		return 1;
> > +
> > +	for (;;) {
> > +		if (value & wmask) {
> > +			if (!trylock) {
> > +				/* Order value reads */
> > +				smp_rmb();
> 
> I don't understand the need for this smp_rmb() -- what other location
> are we maintaining order with?  The CPU must maintain order of accesses
> to a single variable (cache coherence requires this).
> 
> I suggest replacing the smp_rmb() with a barrier().  Not that optimizing
> a spinloop is violently important...
> 

Good point, will fix. The atomic_long_cmpxchg executed before we exit
the busy loop takes care of ordering.

> > +				value = atomic_long_read(&rwlock->value);
> > +				continue;
> > +			} else
> > +				return 0;
> > +		}
> > +
> > +		newvalue = atomic_long_cmpxchg(&rwlock->value,
> > +				value, value + roffset);
> 
> Why not use fair_read_trylock_uncontended()?  Or equivalently, why not
> replace calls to fair_read_trylock_uncontended() with the bare cmpxchg?
> 

Yep, I've done that in my fast/slowpath rewrite. Will be posted soonish.

> > +		if (likely(newvalue == value))
> > +			break;
> > +		else
> > +			value = newvalue;
> > +	}
> > +
> > +	printk_dbg("lib reader got in with value %lX, wmask %lX\n",
> > +		value, wmask);
> > +	return 1;
> > +}
> > +
> > +/*
> > + * many readers, from irq/softirq/thread context.
> > + * protects against writers.
> > + */
> > +void fair_read_lock(struct fair_rwlock *rwlock)
> > +{
> > +	if (in_irq())
> > +		_fair_read_lock_ctx(rwlock, HARDIRQ_ROFFSET, HARDIRQ_WMASK, 0);
> > +	else if (in_softirq())
> > +		_fair_read_lock_ctx(rwlock, SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 0);
> > +	else {
> > +		preempt_disable();
> > +		_fair_read_lock_ctx(rwlock, THREAD_ROFFSET, SUBSCRIBERS_WMASK,
> > +					0);
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(fair_read_lock);
> > +
> > +int fair_read_trylock(struct fair_rwlock *rwlock)
> > +{
> > +	int ret;
> > +
> > +	if (in_irq())
> > +		return _fair_read_lock_ctx(rwlock,
> > +					HARDIRQ_ROFFSET, HARDIRQ_WMASK, 1);
> > +	else if (in_softirq())
> > +		return _fair_read_lock_ctx(rwlock,
> > +					SOFTIRQ_ROFFSET, SOFTIRQ_WMASK, 1);
> > +	else {
> > +		preempt_disable();
> > +		ret = _fair_read_lock_ctx(rwlock,
> > +					THREAD_ROFFSET, SUBSCRIBERS_WMASK, 1);
> > +		if (!ret)
> > +			preempt_enable();
> > +		return ret;
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(fair_read_trylock);
> > +
> > +void fair_read_unlock(struct fair_rwlock *rwlock)
> > +{
> > +	/* atomic_long_sub orders reads */
> > +	if (in_irq())
> > +		atomic_long_sub(HARDIRQ_ROFFSET, &rwlock->value);
> > +	else if (in_softirq())
> > +		atomic_long_sub(SOFTIRQ_ROFFSET, &rwlock->value);
> > +	else {
> > +		atomic_long_sub(THREAD_ROFFSET, &rwlock->value);
> > +		preempt_enable();
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(fair_read_unlock);
> > +
> > +/* Writer lock */
> > +
> > +/*
> > + * Lock out a specific execution context from the read lock. Wait for both the
> > + * rmask and the wmask to be empty before proceeding to take the lock.
> > + */
> > +static void _fair_write_lock_ctx_wait(struct fair_rwlock *rwlock,
> > +		long value, long rmask, long wmask)
> > +{
> > +	long newvalue;
> > +
> > +	for (;;) {
> > +		if (value & (rmask | wmask)) {
> > +			/* Order value reads */
> > +			smp_rmb();
> 
> Suggest replacing the smp_rmb() with barrier() as noted above.
> 

Yep.

> > +			value = atomic_long_read(&rwlock->value);
> > +			continue;
> > +		}
> > +		newvalue = atomic_long_cmpxchg(&rwlock->value,
> > +				value, value | wmask);
> > +		if (likely(newvalue == value))
> > +			break;
> > +		else
> > +			value = newvalue;
> > +	}
> > +	printk_dbg("lib writer got in with value %lX, new %lX, rmask %lX\n",
> > +		value, value | wmask, rmask);
> > +}
> > +
> > +/*
> > + * Lock out a specific execution context from the read lock. First lock the read
> > + * context out of the lock, then wait for every readers to exit their critical
> > + * section.
> > + */
> > +static void _fair_write_lock_ctx_force(struct fair_rwlock *rwlock,
> > +		long rmask, long woffset)
> > +{
> > +	long value;
> > +
> > +	atomic_long_add(woffset, &rwlock->value);
> > +	do {
> > +		value = atomic_long_read(&rwlock->value);
> > +		/* Order rwlock->value read wrt following reads */
> > +		smp_rmb();
> 
> Suggest replacing the smp_rmb() with barrier() as noted above.
> 

What would make sure that we have correct read ordering between
rwlock->value and the data we are trying to protect then ?

Given that we are a writer, we might argue that we are not interested in
ordering memory reads, only writes.

Yeah, given that all we are interested to know is when a reader has
exited its critical section (we are not actually interested in reading
any other data that this reader might have written for us), it should be
ok to just use a barrier().

> > +	} while (value & rmask);
> > +	printk_dbg("lib writer got in with value %lX, woffset %lX, rmask %lX\n",
> > +		value, woffset, rmask);
> > +}
> > +
> > +void fair_write_subscribe(struct fair_rwlock *rwlock)
> > +{
> > +	preempt_disable();
> > +
> > +	/* lock out threads */
> > +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_subscribe);
> > +
> > +/*
> > + * Uncontended fastpath.
> > + */
> > +static long
> > +fair_write_trylock_irq_uncontended(struct fair_rwlock *rwlock,
> > +	long expect_sub, long sub_woffset)
> > +{
> > +	long value;
> > +
> > +	/* no other reader nor writer present, try to take the lock */
> > +	local_bh_disable();
> > +	local_irq_disable();
> > +	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
> > +			expect_sub + (sub_woffset | SOFTIRQ_WOFFSET
> > +				| HARDIRQ_WOFFSET | WRITER_MUTEX));
> > +	if (unlikely(value != expect_sub)) {
> > +		local_irq_enable();
> > +		local_bh_enable();
> > +	}
> > +	return value;
> > +}
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against irq/softirq/thread readers.
> > + */
> > +void fair_write_lock_irq(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	preempt_disable();
> > +	value = fair_write_trylock_irq_uncontended(rwlock,
> > +			0, SUBSCRIBERS_WOFFSET);
> > +	if (likely(!value))
> > +		return;
> > +
> > +	/* lock out threads */
> > +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> 
> Why don't we use fair_write_subscribe() here?  Ah, because we have
> already disabled preemption...
> 
> OK, so why do we have fair_write_subscribe() defined in the first place?
> 

Actually, I changed it quite a bit in the new version. I merged the
subscription with the first "trylock" so we can do the two in a single
operation, which either gets the critical section quickly, or
subscribes.

> > +	/* lock out other writers when no reader threads left */
> > +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> > +		THREAD_RMASK, WRITER_MUTEX);
> 
> OK, I'll bite...  Why don't we also have to wait for the softirq and
> hardirq readers?  I would expect THREAD_RMASK|SOFTIRQ_RMASK|HARDIRQ_RMASK
> rather than just THREAD_RMASK above.
> 
> Never mind, I finally see the following pair of statements.
> 
> So, the idea is to allow recursive readers in irq when we also have
> readers in the interrupted process context, right?

Yes, but it's more than that. I'll do a table resuming what is allowed
by reader contexts :

IRQ reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock
- Writers are subscribed
- A writer is holding the WRITE_MUTEX (which excludes other writers)
- A writer is locking out softirqs.

SoftIRQ reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock
- Writers are subscribed
- A writer is holding the WRITE_MUTEX (which excludes other writers)

Thread reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock

Note that the fact that "Writers are subscribed" is not present for the
thread reader condition to obtain the lock implies that we favor writers
against thread readers. This is however not the case for IRQ and
Softirq.

> But isn't the
> common case going to be no irq handlers read-holding the lock?  If
> so, wouldn't we want to try for all of them at once if we could, using
> a state machine?
> 

In fair_write_trylock_irq_uncontended(), which becomes the fastpath in
the new patch version, I disable softirqs and irqs, and try to get the
lock in one go if not contended. If that fails, then I start to lock-out
execution contexts from the lock, one at a time.

I am not sure that I understand your comment completely though.


> Also, just to be sure I understand, writers waiting for other writers
> would spin in the above _fair_write_lock_ctx_wait(), correct?
> 

Right.

> > +	/* lock out softirqs */
> > +	local_bh_disable();
> 
> Why not just local_irq_disable() at this point and be done with it?
> 

If interrupts would be disabled before calling :

> > +	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);

It would be busy-looping, waiting for softirqs to exit their read-side
critical section, while interrupts are disabled. Given that those
softirqs can run for longer than the duration for which we would like to
disable interrupts (the ciritical section itself can be long, but
softirqs could also be interrupted, and more than once), it's better to
let softirqs exit their critical sections before disabling irqs.

> > +
> > +	/* lock out hardirqs */
> > +	local_irq_disable();
> > +	_fair_write_lock_ctx_force(rwlock, HARDIRQ_RMASK, HARDIRQ_WOFFSET);
> > +
> > +	/* atomic_long_cmpxchg orders writes */
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_lock_irq);
> > +
> > +/*
> > + * Must already be subscribed. Will fail if there is contention caused by
> > + * interrupt readers, softirq readers, other writers. Will also fail if there is
> > + * concurrent cmpxchg update, even if it is only a writer subscription. It is
> > + * not considered as a common case and therefore does not jusfity looping.
> 
> "justify"?  (Sorry, couldn't resist...)
> 

 :)

> > + *
> > + * Second cmpxchg deals with other write subscribers.
> > + */
> > +int fair_write_trylock_subscribed_irq(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	value = fair_write_trylock_irq_uncontended(rwlock,
> > +			SUBSCRIBERS_WOFFSET, 0);
> > +	if (likely(value == SUBSCRIBERS_WOFFSET))
> > +		return 1;
> > +
> > +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> > +		/* no other reader nor writer present, try to take the lock */
> > +		local_bh_disable();
> > +		local_irq_disable();
> 
> Doesn't local_irq_disable() imply local_bh_disable()?  So why not just
> do local_irq_disable()?
> 

Because we do

> > +	local_irq_enable();
> > +	local_bh_enable();

in fair_write_unlock_irq().

Actually, if we could do, in the slow path :

local_bh_disable();
wait for bh to exit
local_irq_disable();
local_bh_enable();
wait for irq to exit

We could then remove the local_bh_enable() from the unlock and therefore
remove the local_bh_disable/enable pair from the fast path.

However, is it legal to reenable bottom-halves when interrupts are off,
or is there some check done like the preempt_check_resched() code which
forbids that ?

Considering what I see in _local_bh_enable_ip(), it seems wrong :

  WARN_ON_ONCE(in_irq() || irqs_disabled());

Ah, and there is a preempt_check_resched(); call at the end of
_local_bh_enable_ip. That explains it. I wonder if there would be some
way around that limitation. Given this comment :

/*
 * Special-case - softirqs can safely be enabled in
 * cond_resched_softirq(), or by __do_softirq(),
 * without processing still-pending softirqs:
 */
void _local_bh_enable(void)

There seems to be a few special-cases, which gives me hope (but no clear
solution). :/


> > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> > +				value + (SOFTIRQ_WOFFSET
> > +					| HARDIRQ_WOFFSET | WRITER_MUTEX))
> > +						== value))
> > +			return 1;
> > +		local_irq_enable();
> > +		local_bh_enable();
> > +	}
> > +	return 0;
> > +
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_irq);
> > +
> > +void fair_write_unlock_irq(struct fair_rwlock *rwlock)
> > +{
> > +	/*
> > +	 * atomic_long_sub makes sure we commit the data before reenabling
> > +	 * the lock.
> > +	 */
> > +	atomic_long_sub(HARDIRQ_WOFFSET | SOFTIRQ_WOFFSET
> > +			| WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
> > +			&rwlock->value);
> > +	local_irq_enable();
> > +	local_bh_enable();
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_unlock_irq);
> > +
> > +/*
> > + * Uncontended fastpath.
> > + */
> > +static long fair_write_trylock_bh_uncontended(struct fair_rwlock *rwlock,
> > +	long expect_sub, long sub_woffset)
> > +{
> > +	long value;
> > +
> > +	/* no other reader nor writer present, try to take the lock */
> > +	local_bh_disable();
> > +	value = atomic_long_cmpxchg(&rwlock->value, expect_sub,
> > +		expect_sub + (sub_woffset | SOFTIRQ_WOFFSET | WRITER_MUTEX));
> > +	if (unlikely(value != expect_sub))
> > +		local_bh_enable();
> > +	return value;
> > +}
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against softirq/thread readers.
> > + */
> > +void fair_write_lock_bh(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	preempt_disable();
> > +	value = fair_write_trylock_bh_uncontended(rwlock,
> > +			0, SUBSCRIBERS_WOFFSET);
> > +	if (likely(!value))
> > +		return;
> > +
> > +	/* lock out threads */
> > +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> > +
> > +	/* lock out other writers when no reader threads left */
> > +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> > +		THREAD_RMASK, WRITER_MUTEX);
> > +
> > +	/* lock out softirqs */
> > +	local_bh_disable();
> > +	_fair_write_lock_ctx_force(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
> 
> Seems like we would want to at least give an error if there were
> hardirq readers...  (Can't think of a reasonable use case mixing
> fair_write_lock_bh() with hardirq readers, but might well be simply a
> momentary failure of imagination.)
> 

Good idea, I'll add a check in the slowpath, but I'll leave it out of
the fast path to keep it, well, fast.

> > +	/* atomic_long_cmpxchg orders writes */
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_lock_bh);
> > +
> > +/*
> > + * Must already be subscribed. Will fail if there is contention caused by
> > + * softirq readers, other writers. Will also fail if there is concurrent cmpxchg
> > + * update, even if it is only a writer subscription. It is not considered as a
> > + * common case and therefore does not jusfity looping.
> > + */
> > +int fair_write_trylock_subscribed_bh(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	value = fair_write_trylock_bh_uncontended(rwlock,
> > +			SUBSCRIBERS_WOFFSET, 0);
> > +	if (likely(value == SUBSCRIBERS_WOFFSET))
> > +		return 1;
> > +
> > +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> > +		/* no other reader nor writer present, try to take the lock */
> > +		local_bh_disable();
> > +		local_irq_disable();
> > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> > +				value + (SOFTIRQ_WOFFSET | WRITER_MUTEX))
> > +						== value))
> > +			return 1;
> > +		local_irq_enable();
> > +		local_bh_enable();
> > +	}
> > +	return 0;
> > +
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed_bh);
> > +
> > +void fair_write_unlock_bh(struct fair_rwlock *rwlock)
> > +{
> > +	/*
> > +	 * atomic_long_sub makes sure we commit the data before reenabling
> > +	 * the lock.
> > +	 */
> > +	atomic_long_sub(SOFTIRQ_WOFFSET | WRITER_MUTEX | SUBSCRIBERS_WOFFSET,
> > +			&rwlock->value);
> > +	local_bh_enable();
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_unlock_bh);
> > +
> > +/*
> > + * Uncontended fastpath.
> > + */
> > +static long fair_write_trylock_uncontended(struct fair_rwlock *rwlock,
> > +	long expect_sub, long sub_woffset)
> > +{
> > +	/* no other reader nor writer present, try to take the lock */
> > +	return atomic_long_cmpxchg(&rwlock->value, expect_sub,
> > +					(expect_sub + sub_woffset)
> > +					| WRITER_MUTEX);
> > +}
> > +
> > +/*
> > + * Safe against other writers in thread context.
> > + * Safe against thread readers.
> > + */
> > +void fair_write_lock(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	preempt_disable();
> > +	value = fair_write_trylock_uncontended(rwlock, 0, SUBSCRIBERS_WOFFSET);
> > +	if (likely(!value))
> > +		return;
> > +
> > +	/* lock out threads */
> > +	atomic_long_add(SUBSCRIBERS_WOFFSET, &rwlock->value);
> > +
> > +	/* lock out other writers when no reader threads left */
> > +	_fair_write_lock_ctx_wait(rwlock, value + SUBSCRIBERS_WOFFSET,
> > +		THREAD_RMASK, WRITER_MUTEX);
> 
> The theory here is that there shouldn't be any readers in bh or irq?
> But suppose that we are calling fair_write_lock() with irqs already
> disabled?  Wouldn't we still want to wait for bh and irq readers in
> that case?
> 

This fair_write_lock should _only_ be used if there are no bh nor irq
readers for this lock. Therefore we do not disable interrupts. Actually,
I should add a WARN_ON here to be certain it is used properly. A reader
thread which reads this data structure with interrupts disabled is
considered as reading from interrupt context in my semantic (implemented
in the new patch version).

Actually, all my writer slow paths should have the following :

  WARN_ON_ONCE(in_interrupt() || irqs_disabled());

Which checks if bh are disabled, if I am in irq context or in bh context
and checks if irqs are disabled. All these requirements must be met to
be able to correctly wait for contended reader threads without
increasing the irq or softirq latency.

> > +	/* atomic_long_cmpxchg orders writes */
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_lock);
> > +
> > +/*
> > + * Must already be subscribed. Will fail if there is contention caused by
> > + * other writers. Will also fail if there is concurrent cmpxchg update, even if
> > + * it is only a writer subscription. It is not considered as a common case and
> > + * therefore does not jusfity looping.
> > + */
> > +int fair_write_trylock_subscribed(struct fair_rwlock *rwlock)
> > +{
> > +	long value;
> > +
> > +	value = fair_write_trylock_uncontended(rwlock, SUBSCRIBERS_WOFFSET, 0);
> > +	if (likely(value == SUBSCRIBERS_WOFFSET))
> > +		return 1;
> > +
> > +	if (likely(!(value & ~SUBSCRIBERS_WMASK))) {
> > +		/* no other reader nor writer present, try to take the lock */
> > +		local_bh_disable();
> > +		local_irq_disable();
> > +		if (likely(atomic_long_cmpxchg(&rwlock->value, value,
> > +				value | WRITER_MUTEX) == value))
> > +			return 1;
> > +		local_irq_enable();
> > +		local_bh_enable();
> > +	}
> > +	return 0;
> > +
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_trylock_subscribed);
> 
> I still find the _subscribed() stuff a bit scary...  Sample use case?
> 

It's updated in the following version, the use case is found in comment
in the code.

Thanks a lot for the review !

Mathieu

> > +void fair_write_unlock(struct fair_rwlock *rwlock)
> > +{
> > +	/*
> > +	 * atomic_long_sub makes sure we commit the data before reenabling
> > +	 * the lock.
> > +	 */
> > +	atomic_long_sub(WRITER_MUTEX | SUBSCRIBERS_WOFFSET, &rwlock->value);
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL_GPL(fair_write_unlock);
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-19  6:04                   ` Mathieu Desnoyers
@ 2008-08-19  7:33                     ` Mathieu Desnoyers
  2008-08-19  9:06                       ` Mathieu Desnoyers
  2008-08-19 16:48                       ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-19  7:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
[...]
> The problem of this approach wrt RT kernels is that we cannot provide
> enough "priority groups" (current irq, softirq and threads in mainline
> kernel) for all the subtile priority levels of RT kernels. The more
> groups we add, the less threads we allow on the system.
> 
> So basically, the relationship between the max number of threads (T) and
> the number of reader priorities goes as follow on a 64 bits machine :
> 
> T writers subscribed count bits
> 1 bit for writer mutex
> 
> for first priority group :
> T reader count bits
> (no need of reader exclusion bit because the writer subscribed count
> bits and the writer mutex act as exclusion)
> 
> for each other priority group :
> T reader count bits
> 1 reader exclusion bit (set by the writer)
> 
> We have the inequality :
> 
> 64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1)
> 
> 64 >= (2T + 1) + (NR_PRIORITIES - 1) * (T + 1)
> 63 - 2T >= (NR_PRIORITIES - 1) * (T + 1)
> ((63 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES
> 
> Therefore :
> 
> Thread bits  |  Max number of threads  |  Number of priorities
>   31         |    2147483648           |          1
>   20         |       1048576           |          2
>   15         |         32768           |          3
>   12         |          4096           |          4
>    9         |           512           |          5
>    8         |           256           |          6
>    7         |           128           |          7
>    6         |            64           |          8
>    5         |            32           |          9
>    4         |            16           |         10
>    3         |             8           |         15
> 
> Starting from here, we have more priority groups than threads in the
> system, which becomes somewhat pointless... :)
> 
> So currently, for the mainline kernel, I chose 3 priority levels thread,
> softirq, irq), which gives me 32768 max CPU in the system because I
> choose to disable preemption. However, we can think of ways to tune that
> in the direction we prefer. We could also hybrid those : having more
> bits for some groups which have preemptable threads (for which we need
> a max. of nr. threads) and less bits for other groups where preemption
> is disabled (where we only need enough bits to cound NR_CPUS)
> 
> Ideas are welcome...
> 
> 

It strikes me that Intel has a nice (probably slow?) cmpxchg16b
instruction on x86_64. Therefore, we could atomically update 128 bits,
which gives the following table :

((127 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES

Thread bits  |  Max number of threads  |  Number of priorities
63           |            2^63         |            1
42           |            2^42         |            2
31           |            2^31         |            3
24           |            2^24         |            4
20           |            2^20         |            5
17           |          131072         |            6
15           |           32768         |            7
13           |            8192         |            8
11           |            2048         |            9
10           |            1024         |           10
9            |             512         |           11
8            |             256         |           13
7            |             128         |           15
6            |              64         |           17
5            |              32         |           20
4            |              16         |           24

.. where we have more priorities than threads.

So I wonder if having in the surrounding of 10 priorities, which could
dynamically adapt the number of threads to the number of priorities
available, could be interesting for the RT kernel ?

That would however depend on the very architecture-specific cmpxchg16b.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-19  7:33                     ` Mathieu Desnoyers
@ 2008-08-19  9:06                       ` Mathieu Desnoyers
  2008-08-19 16:48                       ` Linus Torvalds
  1 sibling, 0 replies; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-19  9:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> [...]
> > The problem of this approach wrt RT kernels is that we cannot provide
> > enough "priority groups" (current irq, softirq and threads in mainline
> > kernel) for all the subtile priority levels of RT kernels. The more
> > groups we add, the less threads we allow on the system.
> > 
> > So basically, the relationship between the max number of threads (T) and
> > the number of reader priorities goes as follow on a 64 bits machine :
> > 
> > T writers subscribed count bits
> > 1 bit for writer mutex
> > 
> > for first priority group :
> > T reader count bits
> > (no need of reader exclusion bit because the writer subscribed count
> > bits and the writer mutex act as exclusion)
> > 
> > for each other priority group :
> > T reader count bits
> > 1 reader exclusion bit (set by the writer)
> > 
> > We have the inequality :
> > 
> > 64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1)
> > 
> > 64 >= (2T + 1) + (NR_PRIORITIES - 1) * (T + 1)
> > 63 - 2T >= (NR_PRIORITIES - 1) * (T + 1)
> > ((63 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES
> > 
> > Therefore :
> > 
> > Thread bits  |  Max number of threads  |  Number of priorities
> >   31         |    2147483648           |          1
> >   20         |       1048576           |          2
> >   15         |         32768           |          3
> >   12         |          4096           |          4
> >    9         |           512           |          5
> >    8         |           256           |          6
> >    7         |           128           |          7
> >    6         |            64           |          8
> >    5         |            32           |          9
> >    4         |            16           |         10
> >    3         |             8           |         15
> > 
> > Starting from here, we have more priority groups than threads in the
> > system, which becomes somewhat pointless... :)
> > 
> > So currently, for the mainline kernel, I chose 3 priority levels thread,
> > softirq, irq), which gives me 32768 max CPU in the system because I
> > choose to disable preemption. However, we can think of ways to tune that
> > in the direction we prefer. We could also hybrid those : having more
> > bits for some groups which have preemptable threads (for which we need
> > a max. of nr. threads) and less bits for other groups where preemption
> > is disabled (where we only need enough bits to cound NR_CPUS)
> > 
> > Ideas are welcome...
> > 
> > 
> 
> It strikes me that Intel has a nice (probably slow?) cmpxchg16b
> instruction on x86_64. Therefore, we could atomically update 128 bits,
> which gives the following table :
> 
> ((127 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES
> 
> Thread bits  |  Max number of threads  |  Number of priorities
> 63           |            2^63         |            1
> 42           |            2^42         |            2
> 31           |            2^31         |            3
> 24           |            2^24         |            4
> 20           |            2^20         |            5
> 17           |          131072         |            6
> 15           |           32768         |            7
> 13           |            8192         |            8
> 11           |            2048         |            9
> 10           |            1024         |           10
> 9            |             512         |           11
> 8            |             256         |           13
> 7            |             128         |           15
> 6            |              64         |           17
> 5            |              32         |           20
> 4            |              16         |           24
> 
> .. where we have more priorities than threads.
> 
> So I wonder if having in the surrounding of 10 priorities, which could
> dynamically adapt the number of threads to the number of priorities
> available, could be interesting for the RT kernel ?
> 
> That would however depend on the very architecture-specific cmpxchg16b.
> 
> Mathieu
> 

Still thinking about RT :

The good news is : we don't really need all those bits to be updated
atomically. The bit groups which must be atomically updated are :

- Writer and first priority group
T writers subscribed count bits
1 bit for writer mutex
T reader count bits

- For each other priority group :
T reader count bits
1 reader exclusion bit (set by the writer)

I also noticed that the writer and the first reader priority group
happen to be at the same priority level. When the writer want to exclude
readers from higher priority groups, it must take their priority,
exclude them using the reader exclusion bit, wait for all readers of
that group to be out of their critical section and then proceed to the
next priority group until all the groups are locked.

The reader of the first priority group must check that both the writer
mutex and the writers subscribed count bits are not set.

The readers in the other groups only have to check the reader exclusion
bit associated to their own group.

So if we simplify the problem a bit for RT, let's fix a maximum of 32768
threads on the system (15 bits). Let's also assume we have 19
priorities.

- Writer and first priority group (fits in 31 bits)
15 bits for writers subscribed count
1 bit for writer mutex
15 bits reader count

Then we have an array of 18 : (each fit in 16 bits)
15 reader count bits
1 reader exclusion bit

Therefore, the whole data would fit in 40 bytes.

The only thing we would not have is the ability to do a single cmpxchg
on all the bits in the writer fast path, making the writer common case
much much slower.

However, the reader side would still be lightning-fast as it would only
have to do a cmpxchg on its own 16 or 32 bits bitfield.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-19  7:33                     ` Mathieu Desnoyers
  2008-08-19  9:06                       ` Mathieu Desnoyers
@ 2008-08-19 16:48                       ` Linus Torvalds
  2008-08-21 20:50                         ` [RFC PATCH] Writer-biased low-latency rwlock v8 Mathieu Desnoyers
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-19 16:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Tue, 19 Aug 2008, Mathieu Desnoyers wrote:
> 
> It strikes me that Intel has a nice (probably slow?) cmpxchg16b
> instruction on x86_64. Therefore, we could atomically update 128 bits,
> which gives the following table :

Stop this crapola.

Take a look at the rwsem thing I pointed you to. After you understand 
that, come back.

The WHOLE POINT of that thing was to use only 32-bit atomics on the hot 
path. Don't even start think9ing about cmpxchg16b. If you cannot do your 
atomics in 32-bit, they're broken.

Please. I realize that the rwsem implementation I did is subtle. But 
really. Spend the time to understand it.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-19 16:48                       ` Linus Torvalds
@ 2008-08-21 20:50                         ` Mathieu Desnoyers
  2008-08-21 21:00                           ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-21 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Tue, 19 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > It strikes me that Intel has a nice (probably slow?) cmpxchg16b
> > instruction on x86_64. Therefore, we could atomically update 128 bits,
> > which gives the following table :
> 
> Stop this crapola.
> 
> Take a look at the rwsem thing I pointed you to. After you understand 
> that, come back.
> 
> The WHOLE POINT of that thing was to use only 32-bit atomics on the hot 
> path. Don't even start think9ing about cmpxchg16b. If you cannot do your 
> atomics in 32-bit, they're broken.
> 
> Please. I realize that the rwsem implementation I did is subtle. But 
> really. Spend the time to understand it.
> 
> 			Linus

You are right, before posting my new version, let's review your rwsem
implementation. Let's pseudo-code the fastpath in C (I don't like to
review the instruction set when thinking conceptually). Comments
inlined. My new patch follows.

* Read lock fast path

my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
  __my_rwlock_rdlock(lock);
return

Ok, so the reader just say "hey, I'm there" by adding itself to the
reader could. Unlikely to overflow, but could. Then checks the writer
mask, if set, call the read slowpath. In my solution, I do basically the
same as the writer fast path you did : I expect all bits to be 0.
Therefore, if there are many readers at once, I use the slow path even
though there is not "real" contention on the lock. Given that I need to
limit the number of readers to a smaller amount of bits, I need to
detect overflow before-the-fact (this is my new v8 implementation, will
be posted shortly). BTW, you could accelerate the slow path by passing
"v" :

my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
  __my_rwlock_rdlock(lock, v);
return;


* Write lock fast path

my_rwlock_wrlock :
if (cmpcxhg(&lock, 0, 0x1) != 0)
  __my_rwlock_wrlock(lock);
return;

Writer expects all bits to be 0 (lock uncontended). If true, take the
lock by setting 0x1, else take the slow path. Yes, that's more or less
what I have too. Here too, you could accelerate the write slow path by
passing v.

my_rwlock_wrlock :
int v = cmpcxhg(&lock, 0, 0x1);
if (v)
  __my_rwlock_wrlock(lock, v);
return;


* Read and write unlock fastpath

my_rwlock_unlock :
int v = atomic_read(&lock);
if (!(v & 0x1)) {
  /* unlock reader */
  v = atomic_sub_return(0x4, &lock);
  if (v & 0x3)
    __my_rwlock_rdunlock(lock);
  return;
} else {
  /* unlock writer */
  if (!(v == 0x1))
    __my_rwlock_wrunlock(lock);
  if (cmpxchg(&lock, v, 0) != v)
    __my_rwlock_wrunlock(lock);
  return;
}

Hrm, why don't you split the unlocks ? You could remove the initial read
from the writer unlock. You could also accelerate the slow path by
passing v.

my_rwlock_rdunlock:
  int v = atomic_read(&lock);
  /* unlock reader */
  v = atomic_sub_return(0x4, &lock);
  if (v & 0x3)
    __my_rwlock_rdunlock(lock, v);
  return;

my_rwlock_wrunlock:
  /* unlock writer */
  int v = cmpxchg(&lock, 0x1, 0);
  if (v != 0x1)
    __my_rwlock_wrunlock(lock, v);
  return;


Ok, so I've looks at the slow paths, and I must say it looks pretty much
like what I have, except that :

- I keep reader and writer "pending" counts for each execution context.
  This allow me to keep very low interrupt and softirq latency by moving
  a pending writer successively up in execution priorities. Because
  I need more counters, I need to deal with overflow beforehand by
  busy-looping if a counter overflow is detected.
- I don't need the contention bit, because no contention is indicated by
  the whole bits being 0. Given that I need to detect overflows, I have
  to do a cmpxchg instead of a xadd, which implies that I need to know
  the value beforehand and don't have the ability to use a mask like
  you do in the read fast path. Given that I do a cmpxchg which also
  changes the reader count, I can't afford to try to only consider parts
  of the word in this cmpxchg.

Note that my v8 now implements support for readers in any execution context :
- preemptable
- non-preemptable
- softirq
- irq


Writer-biased low-latency rwlock v8

Writer subscription low-latency rwlock favors writers against reader threads,
but lets softirq and irq readers in even when there are subscribed writers
waiting for the lock. Very frequent writers could starve reader threads.

Changelog since v7 :
Allow both preemptable and non-preemptable thread readers and writers.
Allow preemptable readers and writers to sleep using a wait queue. Use two bits
for the wait queue : WQ_ACTIVE (there is a thread waiting to be woken up) and
WQ_MUTEX (wait queue protection, single bit).

Changelog since v6 :
Made many single-instruction functions static inline, moved to the header.
Separated in slowpath and fastpath C file implementations. The fastpath is meant
to be reimplemented in assembly. Architectures which choose to implement an
optimized fast path must define HAVE_WBIAS_RWLOCK_FASTPATH, which will disable
the generic C version.
Rename to wbias-rwlock.c
Added support for IRQ/Softirq/preemption latency profiling, provided by a
separate patch.
Readers and writers will busy loop if the counter for their context is full. We
can now have less bits reserved for almost unused context (e.g. softirq). This
let us remove any NR_CPUS limitation.

Changelog since v5 :
Test irqs_disabled() in the reader to select the appropriate read context
when interrupts are disabled. Softirqs are already tested with in_softirq(),
which includes softirq disabled sections.
Change subscription behavior : A writer is subscribed until it gets into the
"mutex held" section. Reader threads test for subscription mask or writer mutex
to detect writers. It makes the trylock/subscription API much more efficient.
Updated benchmarks in changelog, with [min,avg,max] delays for each lock use
(contended and uncontended case). Also include "reader thread protection only"
scenario where the writer does not disable interrupts nor bottom-halves.

Changelog since v4 :
rename _fast -> trylock uncontended
Remove the read from the uncontended read and write cases : just do a cmpxchg
expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock".
Use the value returned by the first cmpxchg as following value instead of doing
an unnecessary read.

Here is why read + rmw is wrong (quoting Linus) :

"This is actually potentially very slow.

Why? If the lock is uncontended, but is not in the current CPU's caches,
the read -> rmw operation generates multiple cache coherency protocol
events. First it gets the line in shared mode (for the read), and then
later it turns it into exclusive mode."

Changelog since v3 :
Added writer subscription and trylock. The writer subscription keeps the reader
threads out of the lock.

Changelog since v2 :
Add writer bias in addition to low-latency wrt interrupts and softirqs.
Added contention delays performance tests for thread and interrupt contexts to
changelog.

Changelog since v1 :
- No more spinlock to protect against concurrent writes, it is done within
  the existing atomic variable.
- Write fastpath with a single atomic op.


I used LTTng traces and eventually made a small patch to lockdep to detect
whenever a spinlock or a rwlock is used both with interrupts enabled and
disabled. Those sites are likely to produce very high latencies and should IMHO
be considered as bogus. The basic bogus scenario is to have a spinlock held on
CPU A with interrupts enabled being interrupted and then a softirq runs. On CPU
B, the same lock is acquired with interrupts off. We therefore disable
interrupts on CPU B for the duration of the softirq currently running on the CPU
A, which is really not something that helps keeping short latencies. My
preliminary results shows that there are a lot of inconsistent spinlock/rwlock
irq on/off uses in the kernel.

This kind of scenario is pretty easy to fix for spinlocks (either move
the interrupt disable within the spinlock section if the spinlock is
never used by an interrupt handler or make sure that every users has
interrupts disabled).

The problem comes with rwlocks : it is correct to have readers both with
and without irq disable, even when interrupt handlers use the read lock.
However, the write lock has to disable interrupt in that case, and we
suffer from the high latency I pointed out. The tasklist_lock is the
perfect example of this. In the following patch, I try to address this
issue.

The core idea is this :

This writer biased rwlock writer subscribes to the preemptable lock, which
locks out the preemptable reader threads. It waits for all the preemptable
reader threads to exit their critical section. Then, it disables preemption,
subscribes for the non-preemptable lock. Then, it waits until all
non-preemptable reader threads exited their critical section and takes the mutex
(1 bit within the "long"). Then it disables softirqs, locks out the softirqs,
waits for all softirqs to exit their critical section, disables irqs and locks
out irqs. It then waits for irqs to exit their critical section. Only then is
the writer allowed to modify the data structure.

The writer fast path checks for a non-contended lock (all bits set to 0) and
does an atomic cmpxchg to set the subscriptions, witer mutex, softirq exclusion
and irq exclusion bits.

The reader does an atomic cmpxchg to check if there is any contention on the
lock. If not, it increments the reader count for its context (preemptable
thread, non-preemptable thread, softirq, irq).

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-wbias-rwlock.c

** Performance tests

Dual quad-core Xeon 2.0GHz E5405

* Latency

This section presents the detailed breakdown of latency preemption, softirq and
interrupt latency generated by the Writer-biased rwlocks. In the "High
contention" section, is compares the "irqoff latency tracer" results between
standard Linux kernel rwlocks and the wbias-rwlocks.

get_cycles takes [min,avg,max] 72,75,78 cycles, results calibrated on avg

** Single writer test, no contention **
SINGLE_WRITER_TEST_DURATION 10s

IRQ latency for cpu 6 disabled 99490 times, [min,avg,max] 471,485,1527 cycles
SoftIRQ latency for cpu 6 disabled 99490 times, [min,avg,max] 693,704,3969 cycles
Preemption latency for cpu 6 disabled 99490 times, [min,avg,max] 909,917,4593 cycles


** Single trylock writer test, no contention **
SINGLE_WRITER_TEST_DURATION 10s

IRQ latency for cpu 2 disabled 10036 times, [min,avg,max] 393,396,849 cycles
SoftIRQ latency for cpu 2 disabled 10036 times, [min,avg,max] 609,614,1317 cycles
Preemption latency for cpu 2 disabled 10036 times, [min,avg,max] 825,826,1971 cycles


** Single reader test, no contention **
SINGLE_READER_TEST_DURATION 10s

Preemption latency for cpu 2 disabled 31596702 times, [min,avg,max] 502,508,54256 cycles


** Multiple readers test, no contention (4 readers, busy-loop) **
MULTIPLE_READERS_TEST_DURATION 10s
NR_READERS 4

Preemption latency for cpu 1 disabled 9302974 times, [min,avg,max] 502,2039,88060 cycles
Preemption latency for cpu 3 disabled 9270742 times, [min,avg,max] 508,2045,61342 cycles
Preemption latency for cpu 6 disabled 13331943 times, [min,avg,max] 508,1387,309088 cycles
Preemption latency for cpu 7 disabled 4781453 times, [min,avg,max] 508,4092,230752 cycles


** High contention test **
TEST_DURATION 60s
NR_WRITERS 2
NR_TRYLOCK_WRITERS 1
NR_READERS 4
NR_TRYLOCK_READERS 1
WRITER_DELAY 100us
TRYLOCK_WRITER_DELAY 1000us
TRYLOCK_WRITERS_FAIL_ITER 100
THREAD_READER_DELAY 0   /* busy loop */
INTERRUPT_READER_DELAY 100ms

Standard Linux rwlock

irqsoff latency trace v1.1.5 on 2.6.27-rc3-trace
--------------------------------------------------------------------
 latency: 2902 us, #3/3, CPU#5 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
    -----------------
    | task: wbiasrwlock_wri-4984 (uid:0 nice:-5 policy:0 rt_prio:0)
    -----------------
 => started at: _write_lock_irq
 => ended at:   _write_unlock_irq

#                _------=> CPU#            
#               / _-----=> irqs-off        
#              | / _----=> need-resched    
#              || / _---=> hardirq/softirq 
#              ||| / _--=> preempt-depth   
#              |||| /                      
#              |||||     delay             
#  cmd     pid ||||| time  |   caller      
#     \   /    |||||   \   |   /           
wbiasrwl-4984  5d..1    0us!: _write_lock_irq (0)
wbiasrwl-4984  5d..2 2902us : _write_unlock_irq (0)
wbiasrwl-4984  5d..3 2903us : trace_hardirqs_on (_write_unlock_irq)


Writer-biased rwlock, same test routine

irqsoff latency trace v1.1.5 on 2.6.27-rc3-trace
--------------------------------------------------------------------
 latency: 33 us, #3/3, CPU#7 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
    -----------------
    | task: events/7-27 (uid:0 nice:-5 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave
 => ended at:   _spin_unlock_irqrestore

#                _------=> CPU#            
#               / _-----=> irqs-off        
#              | / _----=> need-resched    
#              || / _---=> hardirq/softirq 
#              ||| / _--=> preempt-depth   
#              |||| /                      
#              |||||     delay             
#  cmd     pid ||||| time  |   caller      
#     \   /    |||||   \   |   /           
events/7-27    7d...    0us+: _spin_lock_irqsave (0)
events/7-27    7d..1   33us : _spin_unlock_irqrestore (0)
events/7-27    7d..2   33us : trace_hardirqs_on (_spin_unlock_irqrestore)

(latency unrelated to the tests, therefore irq latency <= 33us)

wbias rwlock instrumentation (below) shows that interrupt latency has been 14176
cycles, for a total of 7us.

Detailed writer-biased rwlock latency breakdown :

IRQ latency for cpu 0 disabled 1086419 times, [min,avg,max] 316,2833,14176 cycles
IRQ latency for cpu 1 disabled 1099517 times, [min,avg,max] 316,1820,8254 cycles
IRQ latency for cpu 3 disabled 159088 times, [min,avg,max] 316,1409,5632 cycles
IRQ latency for cpu 4 disabled 161 times, [min,avg,max] 340,1882,5206 cycles
SoftIRQ latency for cpu 0 disabled 1086419 times, [min,avg,max] 2212,5350,166402 cycles
SoftIRQ latency for cpu 1 disabled 1099517 times, [min,avg,max] 2230,4265,138988 cycles
SoftIRQ latency for cpu 3 disabled 159088 times, [min,avg,max] 2212,3319,14992 cycles
SoftIRQ latency for cpu 4 disabled 161 times, [min,avg,max] 2266,3802,7138 cycles
Preemption latency for cpu 3 disabled 59855 times, [min,avg,max] 5266,15706,53494 cycles
Preemption latency for cpu 4 disabled 72 times, [min,avg,max] 5728,14132,28042 cycles
Preemption latency for cpu 5 disabled 55586612 times, [min,avg,max] 196,2080,126526 cycles

Note : preemptable critical sections has been implemented after the previous
latency tests. It should be noted that the worse latency obtained in wbias
rwlock comes from the busy-loop for the wait queue protection mutex (100us) :

IRQ latency for cpu 3 disabled 2822178 times, [min,avg,max] 256,8892,209926 cycles
disable : [<ffffffff803acff5>] rwlock_wait+0x265/0x2c0



* Lock contention delays, per context, 60s test
(number of cycles required to take the lock are benchmarked)


** get_cycles calibration **
get_cycles takes [min,avg,max] 72,77,78 cycles, results calibrated on avg


** Single writer test, no contention **
A bit slower here on the fast path both for lock/unlock probably due to preempt
disable+bh disable+irq disable in wbias rwlock :

* Writer-biased rwlocks v8
writer_thread/0 iterations : 100348, lock delay [min,avg,max] 51,58,279 cycles
writer_thread/0 iterations : 100348, unlock delay [min,avg,max] 57,59,3633 cycles

* Standard rwlock, kernel 2.6.27-rc3
writer_thread/0 iterations : 100322, lock delay [min,avg,max] 37,40,4537 cycles
writer_thread/0 iterations : 100322, unlock delay [min,avg,max] 37,40,25435 cycles


** Single preemptable reader test, no contention **
Writer-biased rwlock has a twice faster lock and unlock uncontended fast path.
Note that wbias rwlock support preemptable readers. Standard rwlocks disables
preemption.

* Writer-biased rwlocks v8
preader_thread/0 iterations : 34204828, lock delay [min,avg,max] 21,23,48537 cycles
preader_thread/0 iterations : 34204828, unlock delay [min,avg,max] 15,22,22497 cycles

* Standard rwlock, kernel 2.6.27-rc3
preader_thread/0 iterations : 31993512, lock delay [min,avg,max] 37,39,264955 cycles
preader_thread/0 iterations : 31993512, unlock delay [min,avg,max] 37,42,151201 cycles


** Single non-preemptable reader test, no contention **
wbias rwlock read is still twice faster than standard rwlock even for read done
in non-preemptable context.

* Writer-biased rwlocks v8
npreader_thread/0 iterations : 33631953, lock delay [min,avg,max] 21,23,14919 cycles
npreader_thread/0 iterations : 33631953, unlock delay [min,avg,max] 15,21,16635 cycles

* Standard rwlock, kernel 2.6.27-rc3
npreader_thread/0 iterations : 31639225, lock delay [min,avg,max] 37,39,127111 cycles
npreader_thread/0 iterations : 31639225, unlock delay [min,avg,max] 37,42,215587 cycles


** Multiple p(reemptable)/n(on-)p(reemptable) readers test, no contention **
This contended case where multiple readers try to access the data structure in
loop, without any writer, shows that standard rwlock average is a little bit
better than wbias rwlock. It could be explained by the fact that wbias rwlock
cmpxchg operation, which is used to keep the count of active readers, may fail
if there is contention and must therefore be retried. The fastpath actually
expects the number of readers to be 0, which isn't the case here.

* Writer-biased rwlocks v8
npreader_thread/0 iterations : 16781774, lock delay [min,avg,max] 33,421,33435 cycles
npreader_thread/0 iterations : 16781774, unlock delay [min,avg,max] 21,222,35925 cycles
npreader_thread/1 iterations : 16819056, lock delay [min,avg,max] 21,421,24957 cycles
npreader_thread/1 iterations : 16819056, unlock delay [min,avg,max] 21,222,37707 cycles
preader_thread/0 iterations : 17151363, lock delay [min,avg,max] 21,425,47745 cycles
preader_thread/0 iterations : 17151363, unlock delay [min,avg,max] 21,223,33207 cycles
preader_thread/1 iterations : 17248440, lock delay [min,avg,max] 21,423,689061 cycles
preader_thread/1 iterations : 17248440, unlock delay [min,avg,max] 15,222,685935 cycles

* Standard rwlock, kernel 2.6.27-rc3
npreader_thread/0 iterations : 19248438, lock delay [min,avg,max] 37,273,364459 cycles
npreader_thread/0 iterations : 19248438, unlock delay [min,avg,max] 43,216,272539 cycles
npreader_thread/1 iterations : 19251717, lock delay [min,avg,max] 37,242,365719 cycles
npreader_thread/1 iterations : 19251717, unlock delay [min,avg,max] 43,249,162847 cycles
preader_thread/0 iterations : 19557931, lock delay [min,avg,max] 37,250,334921 cycles
preader_thread/0 iterations : 19557931, unlock delay [min,avg,max] 37,245,266377 cycles
preader_thread/1 iterations : 19671318, lock delay [min,avg,max] 37,258,390913 cycles
preader_thread/1 iterations : 19671318, unlock delay [min,avg,max] 37,234,604507 cycles



** High contention test **

In high contention test :

TEST_DURATION 60s
NR_WRITERS 2
NR_TRYLOCK_WRITERS 1
NR_PREEMPTABLE_READERS 2
NR_NON_PREEMPTABLE_READERS 2
NR_TRYLOCK_READERS 1
WRITER_DELAY 100us
TRYLOCK_WRITER_DELAY 1000us
TRYLOCK_WRITERS_FAIL_ITER 100
THREAD_READER_DELAY 0   /* busy loop */
INTERRUPT_READER_DELAY 100ms


* Preemptable writers

* Writer-biased rwlocks v8
writer_thread/0 iterations : 556371, lock delay [min,avg,max] 153,9020,3138519 cycles
writer_thread/0 iterations : 556371, unlock delay [min,avg,max] 171,7277,1134309 cycles
writer_thread/1 iterations : 558540, lock delay [min,avg,max] 147,7047,3048447 cycles
writer_thread/1 iterations : 558540, unlock delay [min,avg,max] 171,8554,99633 cycles

* Standard rwlock, kernel 2.6.27-rc3
writer_thread/0 iterations : 222797, lock delay [min,avg,max] 127,336611,4710367 cycles
writer_thread/0 iterations : 222797, unlock delay [min,avg,max] 151,2009,714115 cycles
writer_thread/1 iterations : 6845, lock delay [min,avg,max] 139,17271138,352848961 cycles
writer_thread/1 iterations : 6845, unlock delay [min,avg,max] 217,93935,1991509 cycles


* Non-preemptable readers

* Writer-biased rwlocks v8
npreader_thread/0 iterations : 53472673, lock delay [min,avg,max] 21,804,119709 cycles
npreader_thread/0 iterations : 53472673, unlock delay [min,avg,max] 15,890,90213 cycles
npreader_thread/1 iterations : 53883582, lock delay [min,avg,max] 21,804,120225 cycles
npreader_thread/1 iterations : 53883582, unlock delay [min,avg,max] 15,876,82809 cycles

* Standard rwlock, kernel 2.6.27-rc3
npreader_thread/0 iterations : 68298472, lock delay [min,avg,max] 37,640,733423 cycles
npreader_thread/0 iterations : 68298472, unlock delay [min,avg,max] 37,565,672241 cycles
npreader_thread/1 iterations : 70331311, lock delay [min,avg,max] 37,603,393925 cycles
npreader_thread/1 iterations : 70331311, unlock delay [min,avg,max] 37,558,373477 cycles


* Preemptable readers

* Writer-biased rwlocks v8 (note : preemption enabled, explains higher numbers)
preader_thread/0 iterations : 40652556, lock delay [min,avg,max] 21,1237,5796567 cycles
preader_thread/0 iterations : 40652556, unlock delay [min,avg,max] 15,1196,1655289 cycles
preader_thread/1 iterations : 18260923, lock delay [min,avg,max] 21,4833,3181587 cycles
preader_thread/1 iterations : 18260923, unlock delay [min,avg,max] 21,1228,862395 cycles

* Standard rwlock, kernel 2.6.27-rc3 (note : std rwlock disables preemption)
preader_thread/0 iterations : 69683738, lock delay [min,avg,max] 37,622,742165 cycles
preader_thread/0 iterations : 69683738, unlock delay [min,avg,max] 37,566,1331971 cycles
preader_thread/1 iterations : 79758111, lock delay [min,avg,max] 37,543,757315 cycles
preader_thread/1 iterations : 79758111, unlock delay [min,avg,max] 37,427,813199 cycles


* Interrupt context readers

* Writer-biased rwlocks v8 (note : some of the higher unlock delays (38us) are
caused by the wakeup of the wait queue done at the exit of the critical section
if the waitqueue is active)
interrupt readers on CPU 0, lock delay [min,avg,max] 153,975,8607 cycles
interrupt readers on CPU 0, unlock delay [min,avg,max] 9,2186,45843 cycles
interrupt readers on CPU 1, lock delay [min,avg,max] 69,813,13557 cycles
interrupt readers on CPU 1, unlock delay [min,avg,max] 9,2323,76005 cycles
interrupt readers on CPU 2, lock delay [min,avg,max] 147,949,9111 cycles
interrupt readers on CPU 2, unlock delay [min,avg,max] 9,3232,64269 cycles
interrupt readers on CPU 3, lock delay [min,avg,max] 69,850,22137 cycles
interrupt readers on CPU 3, unlock delay [min,avg,max] 15,1760,46509 cycles
interrupt readers on CPU 4, lock delay [min,avg,max] 135,804,8559 cycles
interrupt readers on CPU 4, unlock delay [min,avg,max] 9,1898,57009 cycles
interrupt readers on CPU 6, lock delay [min,avg,max] 129,904,11877 cycles
interrupt readers on CPU 6, unlock delay [min,avg,max] 9,3009,81321 cycles
interrupt readers on CPU 7, lock delay [min,avg,max] 135,1239,21489 cycles
interrupt readers on CPU 7, unlock delay [min,avg,max] 9,2215,68991 cycles

* Standard rwlock, kernel 2.6.27-rc3 (note : these numbers are not that bad, but
they do not take interrupt latency in account. See tests above for discussion of
this issue)
interrupt readers on CPU 0, lock delay [min,avg,max] 55,573,4417 cycles
interrupt readers on CPU 0, unlock delay [min,avg,max] 43,529,1591 cycles
interrupt readers on CPU 1, lock delay [min,avg,max] 139,591,5731 cycles
interrupt readers on CPU 1, unlock delay [min,avg,max] 31,534,2395 cycles
interrupt readers on CPU 2, lock delay [min,avg,max] 127,671,6043 cycles
interrupt readers on CPU 2, unlock delay [min,avg,max] 37,401,1567 cycles
interrupt readers on CPU 3, lock delay [min,avg,max] 151,676,5569 cycles
interrupt readers on CPU 3, unlock delay [min,avg,max] 127,536,2797 cycles
interrupt readers on CPU 5, lock delay [min,avg,max] 127,531,15397 cycles
interrupt readers on CPU 5, unlock delay [min,avg,max] 31,323,1747 cycles
interrupt readers on CPU 6, lock delay [min,avg,max] 121,548,29125 cycles
interrupt readers on CPU 6, unlock delay [min,avg,max] 31,435,2089 cycles
interrupt readers on CPU 7, lock delay [min,avg,max] 37,613,5485 cycles
interrupt readers on CPU 7, unlock delay [min,avg,max] 49,541,1645 cycles

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
---
 include/linux/wbias-rwlock.h |  548 ++++++++++++++++++++++++++++++++++
 lib/Kconfig                  |    3 
 lib/Kconfig.debug            |    4 
 lib/Makefile                 |    7 
 lib/wbias-rwlock-fastpath.c  |  204 ++++++++++++
 lib/wbias-rwlock.c           |  679 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1445 insertions(+)

Index: linux-2.6-lttng/include/linux/wbias-rwlock.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/wbias-rwlock.h	2008-08-21 16:33:58.000000000 -0400
@@ -0,0 +1,548 @@
+#ifndef _LINUX_WBIAS_RWLOCK_H
+#define _LINUX_WBIAS_RWLOCK_H
+
+/*
+ * Writer-biased low-latency rwlock
+ *
+ * Allows writer bias wrt readers threads and also minimally impact the irq
+ * latency of the system.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * August 2008
+ */
+
+#include <linux/hardirq.h>
+#include <linux/wait.h>
+
+#include <asm/atomic.h>
+
+/*
+ * Those are the maximum number of writers and readers for each context.
+ * Carefully chosen using a precise rule of thumb.
+ * The readers and writers will busy-loop when the count reaches its max,
+ * therefore no overflow is possible. The number of execution threads which can
+ * enter a specific lock or subscribe for it is MAX - 1 (because 0 has to be
+ * counted). Therefore, the minimum values for those MAX is 2, which allows 0 or
+ * 1 execution context to hold the lock or subscribe at any given time.
+ *
+ * General rule : there should be no reader from context lower than the priority
+ * of the writer. e.g. if a writer lock is taken in softirq context or when
+ * softirqs are disabled, only softirq and irq readers are allowed.
+ *
+ * A writer should use the lock primitives protecting it against the highest
+ * reader priority which reads the structure.
+ *
+ * Must keep 5 more bits for writer :
+ * WRITER_MUTEX, SOFTIRQ_WMASK, HARDIRQ_WOFFSET, WQ_MUTEX, WQ_ACTIVE
+ *
+ * PTHREAD stands for preemptable thread.
+ * NPTHREAD stands for non-preemptable thread.
+ *
+ * The ordering of WQ_MUTEX, ACTIVE_MUTEX wrt the checks done by the
+ * busy-looping/waiting threads is tricky. WQ_MUTEX protects the waitqueue and
+ * ACTIVE_MUTEX changes. WQ_MUTEX must be taken with interrupts off.
+ *
+ * Upon unlock, the following sequence is done :
+ * - atomically unlock and return the lock value, which contains the WQ_MUTEX
+ *   and ACTIVE_MUTEX at the moment of the unlock.
+ * - check if either ACTIVE_MUTEX or WQ_MUTEX are set, if so :
+ *   - take WQ_MUTEX
+ *   - wake a thread
+ *   - release WQ_MUTEX
+ *
+ * When a thread is ready to be added to the wait queue :
+ * - the last busy-looping iteration fails.
+ * - take the WQ_MUTEX
+ * - check again for the failed condition, since its status may have changed
+ *   since the busy-loop failed. If the condition now succeeds, return to
+ *   busy-looping after releasing the WQ_MUTEX.
+ * - Add the current thread to the wait queue, change state.
+ * - Atomically release WQ_MUTEX and set WQ_ACTIVE if the list passed from
+ *   inactive to active.
+ *
+ * Upon wakeup :
+ * - take WQ_MUTEX
+ * - Set state to running, remove from the wait queue.
+ * - Atomically release WQ_MUTEX and clear WQ_ACTIVE if the list passed to
+ *   inactive.
+ */
+
+#ifdef CONFIG_DEBUG_WBIAS_RWLOCK
+#define PTHREAD_RMAX	2
+#define NPTHREAD_RMAX	2
+#define SOFTIRQ_RMAX	2
+#define HARDIRQ_RMAX	2
+
+#define PTHREAD_WMAX	2
+#define NPTHREAD_WMAX	2
+#elif (BITS_PER_LONG == 32)
+#define PTHREAD_RMAX	64
+#define NPTHREAD_RMAX	16
+#define SOFTIRQ_RMAX	16
+#define HARDIRQ_RMAX	8
+
+#define PTHREAD_WMAX	32
+#define NPTHREAD_WMAX	16
+#else
+#define PTHREAD_RMAX	16384
+#define NPTHREAD_RMAX	2048
+#define SOFTIRQ_RMAX	512
+#define HARDIRQ_RMAX	256
+
+#define PTHREAD_WMAX	512
+#define NPTHREAD_WMAX	256
+#endif
+
+#define PTHREAD_ROFFSET		1UL
+#define PTHREAD_RMASK		((PTHREAD_RMAX - 1) * PTHREAD_ROFFSET)
+#define NPTHREAD_ROFFSET	(PTHREAD_RMASK + 1)
+#define NPTHREAD_RMASK		((NPTHREAD_RMAX - 1) * NPTHREAD_ROFFSET)
+#define SOFTIRQ_ROFFSET		((NPTHREAD_RMASK | PTHREAD_RMASK) + 1)
+#define SOFTIRQ_RMASK		((SOFTIRQ_RMAX - 1) * SOFTIRQ_ROFFSET)
+#define HARDIRQ_ROFFSET		\
+	((SOFTIRQ_RMASK | NPTHREAD_RMASK | PTHREAD_RMASK) + 1)
+#define HARDIRQ_RMASK		((HARDIRQ_RMAX - 1) * HARDIRQ_ROFFSET)
+
+#define PTHREAD_WOFFSET	\
+	((HARDIRQ_RMASK | SOFTIRQ_RMASK | NPTHREAD_RMASK | PTHREAD_RMASK) + 1)
+#define PTHREAD_WMASK	((PTHREAD_WMAX - 1) * PTHREAD_WOFFSET)
+#define NPTHREAD_WOFFSET ((PTHREAD_WMASK | HARDIRQ_RMASK | SOFTIRQ_RMASK | \
+			NPTHREAD_RMASK | PTHREAD_RMASK) + 1)
+#define NPTHREAD_WMASK	((NPTHREAD_WMAX - 1) * NPTHREAD_WOFFSET)
+#define WRITER_MUTEX	((PTHREAD_WMASK | NPTHREAD_WMASK | HARDIRQ_RMASK | \
+			SOFTIRQ_RMASK | NPTHREAD_RMASK | PTHREAD_RMASK) + 1)
+#define SOFTIRQ_WMASK		(WRITER_MUTEX << 1)
+#define SOFTIRQ_WOFFSET		SOFTIRQ_WMASK
+#define HARDIRQ_WMASK		(SOFTIRQ_WMASK << 1)
+#define HARDIRQ_WOFFSET		HARDIRQ_WMASK
+#define WQ_MUTEX		(HARDIRQ_WMASK << 1)
+#define WQ_ACTIVE		(WQ_MUTEX << 1)
+
+#define NR_PREEMPT_BUSY_LOOPS	100
+
+typedef struct wbias_rwlock {
+	atomic_long_t v;
+	wait_queue_head_t wq_read;	/* Preemptable readers wait queue */
+	wait_queue_head_t wq_write;	/* Preemptable writers wait queue */
+} wbias_rwlock_t;
+
+#define __RAW_WBIAS_RWLOCK_UNLOCKED	{ 0 }
+
+#define __WBIAS_RWLOCK_UNLOCKED(x)					\
+	{								\
+		.v = __RAW_WBIAS_RWLOCK_UNLOCKED,			\
+		.wq_read = __WAIT_QUEUE_HEAD_INITIALIZER((x).wq_read),	\
+		.wq_write = __WAIT_QUEUE_HEAD_INITIALIZER((x).wq_write),\
+	}
+
+#define DEFINE_WBIAS_RWLOCK(x)	wbias_rwlock_t x = __WBIAS_RWLOCK_UNLOCKED(x)
+
+/*
+ * Internal slow paths.
+ */
+extern int _wbias_read_lock_slow(wbias_rwlock_t *rwlock,
+		long v, long roffset, long rmask, long wmask, int trylock);
+extern int _wbias_read_lock_slow_preempt(wbias_rwlock_t *rwlock,
+		long v, long roffset, long rmask, long wmask, int trylock);
+extern void _wbias_write_lock_irq_slow(wbias_rwlock_t *rwlock, long v);
+extern int _wbias_write_trylock_irq_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern int _wbias_write_trylock_irq_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern void _wbias_write_lock_bh_slow(wbias_rwlock_t *rwlock, long v);
+extern int _wbias_write_trylock_bh_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern int _wbias_write_trylock_bh_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern void _wbias_write_lock_atomic_slow(wbias_rwlock_t *rwlock, long v);
+extern int
+_wbias_write_trylock_atomic_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern int _wbias_write_trylock_atomic_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern void _wbias_write_lock_slow(wbias_rwlock_t *rwlock, long v);
+extern int _wbias_write_trylock_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern int _wbias_write_trylock_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v);
+extern void _wbias_rwlock_wakeup(wbias_rwlock_t *rwlock, long v);
+
+static inline void wbias_rwlock_preempt_check(wbias_rwlock_t *rwlock, long v)
+{
+	if (unlikely(v & (WQ_ACTIVE | WQ_MUTEX)))
+		_wbias_rwlock_wakeup(rwlock, v);
+}
+
+/*
+ * rwlock-specific latency tracing, maps to standard macros by default.
+ */
+#ifdef CONFIG_WBIAS_RWLOCK_LATENCY_TEST
+#include <linux/wbias-latency-trace.h>
+#else
+static inline void wbias_rwlock_profile_latency_reset(void)
+{ }
+static inline void wbias_rwlock_profile_latency_print(void)
+{ }
+
+#define wbias_rwlock_irq_save(flags)	local_irq_save(flags)
+#define wbias_rwlock_irq_restore(flags)	local_irq_restore(flags)
+#define wbias_rwlock_irq_disable()	local_irq_disable()
+#define wbias_rwlock_irq_enable()	local_irq_enable()
+#define wbias_rwlock_bh_disable()	local_bh_disable()
+#define wbias_rwlock_bh_enable()	local_bh_enable()
+#define wbias_rwlock_bh_enable_ip(ip)	local_bh_enable_ip(ip)
+#define wbias_rwlock_preempt_disable()	preempt_disable()
+#define wbias_rwlock_preempt_enable()	preempt_enable()
+#define wbias_rwlock_preempt_enable_no_resched()	\
+		preempt_enable_no_resched()
+#endif
+
+
+/*
+ * API
+ */
+
+
+/*
+ * This table represents which write lock should be taken depending on the
+ * highest priority read context and the write context used.
+ *
+ * e.g. given the highest priority context from which we take the read lock is
+ * interrupt context (IRQ) and that we always use the write lock from
+ * non-preemptable context (NP), we should use the IRQ version of the write
+ * lock.
+ *
+ * X means : don't !
+ *
+ * Highest priority reader context / Writer context   P     NP    BH    IRQ
+ * ------------------------------------------------------------------------
+ * P                                                | P     X     X     X
+ * NP                                               | NP    NP    X     X
+ * BH                                               | BH    BH    BH    X
+ * IRQ                                              | IRQ   IRQ   IRQ   IRQ
+ */
+
+
+/* Reader lock */
+
+/*
+ * many readers, from irq/softirq/thread context.
+ * protects against writers.
+ */
+
+/*
+ * Called from interrupt disabled or interrupt context.
+ */
+static inline void wbias_read_lock_irq(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, HARDIRQ_ROFFSET);
+	if (likely(!v))
+		return;
+	_wbias_read_lock_slow(rwlock, v,
+		HARDIRQ_ROFFSET, HARDIRQ_RMASK, HARDIRQ_WMASK, 0);
+}
+
+static inline int wbias_read_trylock_irq(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, HARDIRQ_ROFFSET);
+	if (likely(!v))
+		return 1;
+	return _wbias_read_lock_slow(rwlock, v,
+		HARDIRQ_ROFFSET, HARDIRQ_RMASK, HARDIRQ_WMASK, 1);
+}
+
+static inline void wbias_read_unlock_irq(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(HARDIRQ_ROFFSET, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+
+/*
+ * Called from softirq context.
+ */
+
+static inline void wbias_read_lock_bh(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, SOFTIRQ_ROFFSET);
+	if (likely(!v))
+		return;
+	_wbias_read_lock_slow(rwlock, v,
+		SOFTIRQ_ROFFSET, SOFTIRQ_RMASK, SOFTIRQ_WMASK, 0);
+}
+
+static inline int wbias_read_trylock_bh(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, SOFTIRQ_ROFFSET);
+	if (likely(!v))
+		return 1;
+	return _wbias_read_lock_slow(rwlock, v,
+		SOFTIRQ_ROFFSET, SOFTIRQ_RMASK, SOFTIRQ_WMASK, 1);
+}
+
+static inline void wbias_read_unlock_bh(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(SOFTIRQ_ROFFSET, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+
+/*
+ * Called from non-preemptable thread context.
+ */
+
+static inline void wbias_read_lock_inatomic(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, NPTHREAD_ROFFSET);
+	if (likely(!v))
+		return;
+	_wbias_read_lock_slow(rwlock, v,
+		NPTHREAD_ROFFSET, NPTHREAD_RMASK,
+		NPTHREAD_WMASK | WRITER_MUTEX, 0);
+}
+
+static inline int wbias_read_trylock_inatomic(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, NPTHREAD_ROFFSET);
+	if (likely(!v))
+		return 1;
+	return _wbias_read_lock_slow(rwlock, v,
+		NPTHREAD_ROFFSET, NPTHREAD_RMASK,
+		NPTHREAD_WMASK | WRITER_MUTEX, 1);
+}
+
+static inline void wbias_read_unlock_inatomic(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(NPTHREAD_ROFFSET, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+
+/*
+ * Called from preemptable thread context.
+ */
+
+static inline void wbias_read_lock(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, PTHREAD_ROFFSET);
+	if (likely(!v))
+		return;
+	_wbias_read_lock_slow_preempt(rwlock, v,
+		PTHREAD_ROFFSET, PTHREAD_RMASK,
+		PTHREAD_WMASK | NPTHREAD_WMASK | WRITER_MUTEX, 0);
+}
+
+static inline int wbias_read_trylock(wbias_rwlock_t *rwlock)
+{
+	long v = atomic_long_cmpxchg(&rwlock->v, 0, PTHREAD_ROFFSET);
+	if (likely(!v))
+		return 1;
+	return _wbias_read_lock_slow_preempt(rwlock, v,
+		PTHREAD_ROFFSET, PTHREAD_RMASK,
+		PTHREAD_WMASK | NPTHREAD_WMASK | WRITER_MUTEX, 1);
+}
+
+static inline void wbias_read_unlock(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(PTHREAD_ROFFSET, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+
+/*
+ * Automatic selection of the right locking primitive depending on the
+ * current context. Should be used when the context cannot be statically
+ * determined or as a compatibility layer for the old rwlock.
+ */
+static inline void wbias_read_lock_generic(wbias_rwlock_t *rwlock)
+{
+	if (in_irq() || irqs_disabled())
+		wbias_read_lock_irq(rwlock);
+	else if (in_softirq())
+		wbias_read_lock_bh(rwlock);
+	else if (in_atomic())
+		wbias_read_lock_inatomic(rwlock);
+	else
+		wbias_read_lock(rwlock);
+}
+
+static inline int wbias_read_trylock_generic(wbias_rwlock_t *rwlock)
+{
+	if (in_irq() || irqs_disabled())
+		return wbias_read_trylock_irq(rwlock);
+	else if (in_softirq())
+		return wbias_read_trylock_bh(rwlock);
+	else if (in_atomic())
+		return wbias_read_trylock_inatomic(rwlock);
+	else
+		return wbias_read_trylock(rwlock);
+}
+
+static inline void wbias_read_unlock_generic(wbias_rwlock_t *rwlock)
+{
+	if (in_irq() || irqs_disabled())
+		wbias_read_unlock_irq(rwlock);
+	else if (in_softirq())
+		wbias_read_unlock_bh(rwlock);
+	else if (in_atomic())
+		wbias_read_unlock_inatomic(rwlock);
+	else
+		wbias_read_unlock(rwlock);
+}
+
+/* Writer Lock */
+
+
+/*
+ * Safe against other writers in same context.
+ * If used in irq handler or with irq off context, safe against :
+ * - irq readers.
+ * If used in softirq handler or softirq off context, safe against :
+ * - irq/softirq readers.
+ * If used in non-preemptable context, safe against :
+ * - irq/softirq/non-preemptable readers.
+ * If used in preemptable context, safe against :
+ * - irq/softirq/non-preemptable/preemptable readers.
+ *
+ * Write lock use :
+ *  wbias_write_lock_irq(&lock);
+ *  ...
+ *  wbias_write_unlock_irq(&lock);
+ *
+ * or
+ *
+ * int i;
+ *
+ * repeat:
+ *  if (wbias_write_trylock_irq_else_subscribe(&lock))
+ *    goto locked;
+ *  for (i = 0; i < NR_LOOPS; i++)
+ *    if (wbias_write_trylock_irq_subscribed(&lock))
+ *      goto locked;
+ *  do_failed_to_grab_lock(...);
+ *  wbias_write_unsubscribe(&lock);
+ *  goto repeat;
+ * locked:
+ *  ...
+ * unlock:
+ *  wbias_write_unlock_irq(&lock);
+ */
+void wbias_write_lock_irq(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_irq_else_subscribe(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_irq_subscribed(wbias_rwlock_t *rwlock);
+
+static inline void wbias_write_unlock_irq(wbias_rwlock_t *rwlock)
+{
+	long v;
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	v = atomic_long_sub_return(HARDIRQ_WOFFSET
+			| SOFTIRQ_WOFFSET | WRITER_MUTEX,
+			&rwlock->v);
+	wbias_rwlock_irq_enable();
+	wbias_rwlock_bh_enable();
+	wbias_rwlock_preempt_enable();
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+/*
+ * Safe against other writers in same context.
+ * If used in softirq handler or softirq off context, safe against :
+ * - softirq readers.
+ * If used in non-preemptable context, safe against :
+ * - softirq/npthread readers.
+ * If used in preemptable context, safe against :
+ * - softirq/npthread/pthread readers.
+ */
+void wbias_write_lock_bh(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_bh_else_subscribe(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_bh_subscribed(wbias_rwlock_t *rwlock);
+
+static inline void wbias_write_unlock_bh(wbias_rwlock_t *rwlock)
+{
+	long v;
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	v = atomic_long_sub_return(SOFTIRQ_WOFFSET | WRITER_MUTEX, &rwlock->v);
+	wbias_rwlock_bh_enable();
+	wbias_rwlock_preempt_enable();
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+/*
+ * Safe against other writers in same context.
+ * If used in non-preemptable context, safe against :
+ * - non-preemptable readers.
+ * If used in preemptable context, safe against :
+ * - non-preemptable and preemptable thread readers.
+ */
+void wbias_write_lock_atomic(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_atomic_else_subscribe(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_atomic_subscribed(wbias_rwlock_t *rwlock);
+
+static inline void wbias_write_unlock_atomic(wbias_rwlock_t *rwlock)
+{
+	long v;
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	v = atomic_long_sub_return(WRITER_MUTEX, &rwlock->v);
+	wbias_rwlock_preempt_enable();
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+/*
+ * Safe against other writers in same context.
+ * Safe against preemptable thread readers.
+ */
+void wbias_write_lock(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_else_subscribe(wbias_rwlock_t *rwlock);
+int wbias_write_trylock_subscribed(wbias_rwlock_t *rwlock);
+
+static inline void wbias_write_unlock(wbias_rwlock_t *rwlock)
+{
+	long v;
+	/*
+	 * atomic_long_sub makes sure we commit the data before reenabling
+	 * the lock.
+	 */
+	v = atomic_long_sub_return(WRITER_MUTEX, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+/*
+ * Unsubscribe writer after a failed trylock.
+ * Should not be called after a successful trylock, since it unsubscribes
+ * when the lock is successfully taken.
+ */
+static inline void wbias_write_unsubscribe_atomic(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(PTHREAD_WOFFSET | NPTHREAD_WOFFSET,
+		&rwlock->v);
+	wbias_rwlock_preempt_enable();
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+#define wbias_write_unsubscribe_irq(rwlock)	\
+	wbias_write_unsubscribe_atomic(rwlock)
+#define wbias_write_unsubscribe_bh(rwlock)	\
+	wbias_write_unsubscribe_atomic(rwlock)
+
+static inline void wbias_write_unsubscribe(wbias_rwlock_t *rwlock)
+{
+	long v;
+	v = atomic_long_sub_return(PTHREAD_WOFFSET, &rwlock->v);
+	wbias_rwlock_preempt_check(rwlock, v);
+}
+
+#endif /* _LINUX_WBIAS_RWLOCK_H */
Index: linux-2.6-lttng/lib/Makefile
===================================================================
--- linux-2.6-lttng.orig/lib/Makefile	2008-08-21 04:06:52.000000000 -0400
+++ linux-2.6-lttng/lib/Makefile	2008-08-21 04:07:01.000000000 -0400
@@ -43,6 +43,13 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_proce
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
+obj-y += wbias-rwlock.o
+ifneq ($(CONFIG_HAVE_WBIAS_RWLOCK_FASTPATH),y)
+obj-y += wbias-rwlock-fastpath.o
+endif
+obj-$(CONFIG_WBIAS_RWLOCK_LATENCY_TEST) += wbias-rwlock-latency-trace.o
+
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
Index: linux-2.6-lttng/lib/wbias-rwlock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/lib/wbias-rwlock.c	2008-08-21 16:28:34.000000000 -0400
@@ -0,0 +1,679 @@
+/*
+ * Writer-biased low-latency rwlock
+ *
+ * Allows writer bias wrt readers threads and also minimally impact the irq
+ * and softirq latency of the system.
+ *
+ * A typical case leading to long interrupt latencies :
+ *
+ * - rwlock shared between
+ *   - Rare update in thread context
+ *   - Frequent slow read in thread context (task list iteration)
+ *   - Fast interrupt handler read
+ *
+ * The slow write must therefore disable interrupts around the write lock,
+ * but will therefore add up to the global interrupt latency; worse case being
+ * the duration of the slow read.
+ *
+ * This writer biased rwlock writer subscribes to the preemptable lock, which
+ * locks out the preemptable reader threads. It waits for all the preemptable
+ * reader threads to exit their critical section. Then, it disables preemption,
+ * subscribes for the non-preemptable lock. Then, it waits until all
+ * non-preemptable reader threads exited their critical section and takes the
+ * mutex (1 bit within the "long"). Then it disables softirqs, locks out the
+ * softirqs, waits for all softirqs to exit their critical section, disables
+ * irqs and locks out irqs. It then waits for irqs to exit their critical
+ * section. Only then is the writer allowed to modify the data structure.
+ *
+ * The writer fast path checks for a non-contended lock (all bits set to 0) and
+ * does an atomic cmpxchg to set the subscriptions, witer mutex, softirq
+ * exclusion and irq exclusion bits.
+ *
+ * The reader does an atomic cmpxchg to check if there is any contention on the
+ * lock. If not, it increments the reader count for its context (preemptable
+ * thread, non-preemptable thread, softirq, irq).
+ *
+ * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/wbias-rwlock.h>
+#include <linux/wait.h>
+#include <linux/freezer.h>
+#include <linux/module.h>
+
+#ifdef WBIAS_RWLOCK_DEBUG
+#define printk_dbg printk
+#else
+#define printk_dbg(fmt, args...)
+#endif
+
+/*
+ * Lock out a specific execution context from the read lock. Wait for both the
+ * rmask and the wmask to be empty before proceeding to take the lock.
+ *
+ * Decrement wsuboffset when taking the wmask.
+ */
+static long _wbias_write_lock_ctx_wait(wbias_rwlock_t *rwlock,
+		long v, long rmask, long wmask, long woffset, long wsuboffset)
+{
+	long newv;
+
+	for (;;) {
+		if (v & (rmask | wmask)) {
+			/* Order v reads */
+			barrier();
+			v = atomic_long_read(&rwlock->v);
+			continue;
+		}
+		newv = atomic_long_cmpxchg(&rwlock->v,
+				v, v - wsuboffset + woffset);
+		if (likely(newv == v))
+			break;
+		else
+			v = newv;
+	}
+	printk_dbg("lib writer got in with v %lX, new %lX, rmask %lX\n",
+		v, v - wsuboffset + woffset, rmask);
+	return v - wsuboffset + woffset;
+}
+
+/*
+ * mask, v & mask_fill == mask_fill are the conditions for which we wait.
+ */
+static void rwlock_wait(wbias_rwlock_t *rwlock, long v, long mask,
+		long mask_fill, int check_mask_fill, int write)
+{
+	DECLARE_WAITQUEUE(wbias_rwlock_wq, current);
+	int wq_active;
+
+	/*
+	 * Busy-loop waiting for the waitqueue mutex.
+	 */
+	wbias_rwlock_irq_disable();
+	v = _wbias_write_lock_ctx_wait(rwlock, v, 0, WQ_MUTEX, WQ_MUTEX, 0);
+	/*
+	 * Before we go to sleep, check that the lock we were expecting
+	 * did not free between the moment we last checked for the lock and the
+	 * moment we took the WQ_MUTEX.
+	 */
+	if (!(v & mask || (check_mask_fill && (v & mask_fill) == mask_fill))) {
+		atomic_long_sub(WQ_MUTEX, &rwlock->v);
+		wbias_rwlock_irq_enable();
+		return;
+	}
+	/*
+	 * Got the waitqueue mutex, get into the wait queue.
+	 */
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	wq_active = waitqueue_active(&rwlock->wq_read)
+			|| waitqueue_active(&rwlock->wq_write);
+	/*
+	 * Only one thread will be woken up at a time.
+	 */
+	if (write)
+		add_wait_queue_exclusive_locked(&rwlock->wq_write,
+			&wbias_rwlock_wq);
+	else
+		__add_wait_queue(&rwlock->wq_read, &wbias_rwlock_wq);
+	if (!wq_active)
+		atomic_long_sub(WQ_MUTEX - WQ_ACTIVE, &rwlock->v);
+	else
+		atomic_long_sub(WQ_MUTEX, &rwlock->v);
+	wbias_rwlock_irq_enable();
+	try_to_freeze();
+	schedule();
+	/*
+	 * Woken up; Busy-loop waiting for the waitqueue mutex.
+	 */
+	wbias_rwlock_irq_disable();
+	v = atomic_long_read(&rwlock->v);
+	_wbias_write_lock_ctx_wait(rwlock, v, 0, WQ_MUTEX, WQ_MUTEX, 0);
+	__set_current_state(TASK_RUNNING);
+	if (write)
+		remove_wait_queue_locked(&rwlock->wq_write, &wbias_rwlock_wq);
+	else
+		remove_wait_queue_locked(&rwlock->wq_read, &wbias_rwlock_wq);
+	wq_active = waitqueue_active(&rwlock->wq_read)
+			|| waitqueue_active(&rwlock->wq_write);
+	if (!wq_active)
+		atomic_long_sub(WQ_MUTEX + WQ_ACTIVE, &rwlock->v);
+	else
+		atomic_long_sub(WQ_MUTEX, &rwlock->v);
+	wbias_rwlock_irq_enable();
+}
+
+/*
+ * Reader lock
+ *
+ * First try to get the uncontended lock. If it is non-zero (can be common,
+ * since we allow multiple readers), pass the returned cmpxchg v to the loop
+ * to try to get the reader lock.
+ *
+ * trylock will fail if a writer is subscribed or holds the lock, but will
+ * spin if there is concurency to win the cmpxchg. It could happen if, for
+ * instance, other concurrent reads need to update the roffset or if a
+ * writer updated the lock bits which does not contend us. Since many
+ * concurrent readers is a common case, it makes sense not to fail is it
+ * happens.
+ *
+ * the non-trylock case will spin for both situations.
+ *
+ * Busy-loop if the reader count is full.
+ */
+
+int _wbias_read_lock_slow(wbias_rwlock_t *rwlock,
+		long v, long roffset, long rmask, long wmask, int trylock)
+{
+	long newv;
+
+	for (;;) {
+		if (v & wmask || ((v & rmask) == rmask)) {
+			if (!trylock) {
+				/* Order v reads */
+				barrier();
+				v = atomic_long_read(&rwlock->v);
+				continue;
+			} else
+				return 0;
+		}
+
+		newv = atomic_long_cmpxchg(&rwlock->v, v, v + roffset);
+		if (likely(newv == v))
+			break;
+		else
+			v = newv;
+	}
+	printk_dbg("lib reader got in with v %lX, wmask %lX\n", v, wmask);
+	return 1;
+}
+EXPORT_SYMBOL(_wbias_read_lock_slow);
+
+int _wbias_read_lock_slow_preempt(wbias_rwlock_t *rwlock,
+		long v, long roffset, long rmask, long wmask, int trylock)
+{
+	int try = NR_PREEMPT_BUSY_LOOPS;
+	long newv;
+
+	might_sleep();
+	for (;;) {
+		if (v & wmask || ((v & rmask) == rmask)) {
+			if (unlikely(!(--try))) {
+				rwlock_wait(rwlock, v, wmask, rmask, 1, 0);
+				try = NR_PREEMPT_BUSY_LOOPS;
+			}
+			if (!trylock) {
+				/* Order v reads */
+				barrier();
+				v = atomic_long_read(&rwlock->v);
+				continue;
+			} else
+				return 0;
+		}
+
+		newv = atomic_long_cmpxchg(&rwlock->v, v, v + roffset);
+		if (likely(newv == v))
+			break;
+		else
+			v = newv;
+	}
+	printk_dbg("lib reader got in with v %lX, wmask %lX\n", v, wmask);
+	return 1;
+}
+EXPORT_SYMBOL(_wbias_read_lock_slow_preempt);
+
+/* Writer lock */
+
+static long _wbias_write_lock_ctx_wait_preempt(wbias_rwlock_t *rwlock,
+		long v, long rmask)
+{
+	int try = NR_PREEMPT_BUSY_LOOPS;
+
+	while (v & rmask) {
+		if (unlikely(!(--try))) {
+			rwlock_wait(rwlock, v, rmask, 0, 0, 1);
+			try = NR_PREEMPT_BUSY_LOOPS;
+		}
+		v = atomic_long_read(&rwlock->v);
+		/* Order v reads */
+		barrier();
+	}
+	printk_dbg("lib writer got in with v %lX, rmask %lX\n", v, rmask);
+	return v;
+}
+
+/*
+ * Lock out a specific execution context from the read lock. First lock the read
+ * context out of the lock, then wait for every readers to exit their critical
+ * section. This function should always be called when the WRITER_MUTEX bit is
+ * held because it expects to the the woffset bit as a one-hot.
+ */
+static void _wbias_write_lock_ctx_exclusive(wbias_rwlock_t *rwlock,
+		long rmask, long woffset)
+{
+	long v;
+
+	atomic_long_add(woffset, &rwlock->v);
+	do {
+		v = atomic_long_read(&rwlock->v);
+		/* Order v reads */
+		barrier();
+	} while (v & rmask);
+	printk_dbg("lib writer got in with v %lX, woffset %lX, rmask %lX\n",
+		v, woffset, rmask);
+}
+
+/*
+ * Subscribe to write lock.
+ * Busy-loop if the writer count reached its max.
+ */
+static long _wbias_write_subscribe(wbias_rwlock_t *rwlock, long v,
+		long wmask, long woffset)
+{
+	long newv;
+
+	for (;;) {
+		if ((v & wmask) == wmask) {
+			/* Order v reads */
+			barrier();
+			v = atomic_long_read(&rwlock->v);
+			continue;
+		}
+		newv = atomic_long_cmpxchg(&rwlock->v, v, v + woffset);
+		if (likely(newv == v))
+			break;
+		else
+			v = newv;
+	}
+	return v + woffset;
+}
+
+void _wbias_write_lock_irq_slow(wbias_rwlock_t *rwlock, long v)
+{
+	WARN_ON_ONCE(!in_softirq() || !irqs_disabled());
+
+	wbias_rwlock_irq_enable();
+	wbias_rwlock_bh_enable();
+	wbias_rwlock_preempt_enable();
+
+	/* lock out preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v, PTHREAD_WMASK, PTHREAD_WOFFSET);
+
+	/*
+	 * continue when no preemptable reader threads left, but keep
+	 * subscription.
+	 */
+	v = _wbias_write_lock_ctx_wait_preempt(rwlock, v, PTHREAD_RMASK);
+
+	/*
+	 * lock out non-preemptable threads. Atomically unsubscribe from
+	 * preemptable thread lock out and subscribe for non-preemptable thread
+	 * lock out. Preemptable threads must check for NPTHREAD_WMASK and
+	 * WRITER_MUTEX too.
+	 */
+	wbias_rwlock_preempt_disable();
+	v = _wbias_write_subscribe(rwlock, v, NPTHREAD_WMASK,
+		NPTHREAD_WOFFSET - PTHREAD_WOFFSET);
+
+	/* lock out other writers when no non-preemptable reader threads left */
+	_wbias_write_lock_ctx_wait(rwlock, v, NPTHREAD_RMASK,
+		WRITER_MUTEX, WRITER_MUTEX, NPTHREAD_WOFFSET);
+
+	/* lock out softirqs */
+	wbias_rwlock_bh_disable();
+	_wbias_write_lock_ctx_exclusive(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* lock out hardirqs */
+	wbias_rwlock_irq_disable();
+	_wbias_write_lock_ctx_exclusive(rwlock, HARDIRQ_RMASK, HARDIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+
+/*
+ * Will take the lock or subscribe. Will fail if there is contention caused by
+ * other writers. Will also fail if there is concurrent cmpxchg update, even if
+ * it is only a writer subscription. It is not considered as a common case and
+ * therefore does not justify looping.
+ *
+ * Disables preemption if fails.
+ */
+int _wbias_write_trylock_irq_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(!in_softirq() || !irqs_disabled());
+
+	wbias_rwlock_irq_enable();
+	wbias_rwlock_bh_enable();
+
+	/* lock out preemptable and non-preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v,
+		PTHREAD_WMASK | NPTHREAD_WMASK,
+		PTHREAD_WOFFSET | NPTHREAD_WOFFSET);
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		wbias_rwlock_bh_disable();
+		wbias_rwlock_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)
+			+ (SOFTIRQ_WOFFSET | HARDIRQ_WOFFSET | WRITER_MUTEX))
+				== v))
+			return 1;
+		wbias_rwlock_irq_enable();
+		wbias_rwlock_bh_enable();
+	}
+	return 0;
+}
+
+int _wbias_write_trylock_irq_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(!in_softirq() || !irqs_disabled());
+
+	wbias_rwlock_irq_enable();
+	wbias_rwlock_bh_enable();
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		wbias_rwlock_bh_disable();
+		wbias_rwlock_irq_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)
+			+ (SOFTIRQ_WOFFSET | HARDIRQ_WOFFSET | WRITER_MUTEX))
+				== v))
+			return 1;
+		wbias_rwlock_irq_enable();
+		wbias_rwlock_bh_enable();
+	}
+	return 0;
+}
+
+
+void _wbias_write_lock_bh_slow(wbias_rwlock_t *rwlock, long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | HARDIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	wbias_rwlock_bh_enable();
+	wbias_rwlock_preempt_enable();
+
+	/* lock out preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v, PTHREAD_WMASK, PTHREAD_WOFFSET);
+
+	/*
+	 * continue when no preemptable reader threads left, but keep
+	 * subscription.
+	 */
+	v = _wbias_write_lock_ctx_wait_preempt(rwlock, v, PTHREAD_RMASK);
+
+	/*
+	 * lock out non-preemptable threads. Atomically unsubscribe from
+	 * preemptable thread lock out and subscribe for non-preemptable thread
+	 * lock out. Preemptable threads must check for NPTHREAD_WMASK and
+	 * WRITER_MUTEX too.
+	 */
+	wbias_rwlock_preempt_disable();
+	v = _wbias_write_subscribe(rwlock, v, NPTHREAD_WMASK,
+		NPTHREAD_WOFFSET - PTHREAD_WOFFSET);
+
+	/* lock out other writers when no non-preemptable reader threads left */
+	_wbias_write_lock_ctx_wait(rwlock, v, NPTHREAD_RMASK,
+		WRITER_MUTEX, WRITER_MUTEX, NPTHREAD_WOFFSET);
+
+	/* lock out softirqs */
+	wbias_rwlock_bh_disable();
+	_wbias_write_lock_ctx_exclusive(rwlock, SOFTIRQ_RMASK, SOFTIRQ_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+
+/*
+ * Will take the lock or subscribe. Will fail if there is contention caused by
+ * other writers. Will also fail if there is concurrent cmpxchg update, even if
+ * it is only a writer subscription. It is not considered as a common case and
+ * therefore does not justify looping.
+ *
+ * Disables preemption if fails.
+ */
+int _wbias_write_trylock_bh_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | HARDIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	wbias_rwlock_bh_enable();
+
+	/* lock out preemptable and non-preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v,
+		PTHREAD_WMASK | NPTHREAD_WMASK,
+		PTHREAD_WOFFSET | NPTHREAD_WOFFSET);
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		wbias_rwlock_bh_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)
+			+ (SOFTIRQ_WOFFSET | WRITER_MUTEX))
+				== v))
+			return 1;
+		wbias_rwlock_bh_enable();
+	}
+	return 0;
+}
+
+int _wbias_write_trylock_bh_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | HARDIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	wbias_rwlock_bh_enable();
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		wbias_rwlock_bh_disable();
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)
+			+ (SOFTIRQ_WOFFSET | WRITER_MUTEX))
+				== v))
+			return 1;
+		wbias_rwlock_bh_enable();
+	}
+	return 0;
+}
+
+void _wbias_write_lock_atomic_slow(wbias_rwlock_t *rwlock, long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	wbias_rwlock_preempt_enable();
+
+	/* lock out preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v, PTHREAD_WMASK, PTHREAD_WOFFSET);
+
+	/*
+	 * continue when no preemptable reader threads left, but keep
+	 * subscription.
+	 */
+	v = _wbias_write_lock_ctx_wait_preempt(rwlock, v, PTHREAD_RMASK);
+
+	/*
+	 * lock out non-preemptable threads. Atomically unsubscribe from
+	 * preemptable thread lock out and subscribe for non-preemptable thread
+	 * lock out. Preemptable threads must check for NPTHREAD_WMASK and
+	 * WRITER_MUTEX too.
+	 */
+	wbias_rwlock_preempt_disable();
+	v = _wbias_write_subscribe(rwlock, v, NPTHREAD_WMASK,
+		NPTHREAD_WOFFSET - PTHREAD_WOFFSET);
+
+	/* lock out other writers when no non-preemptable reader threads left */
+	_wbias_write_lock_ctx_wait(rwlock, v, NPTHREAD_RMASK,
+		WRITER_MUTEX, WRITER_MUTEX, NPTHREAD_WOFFSET);
+
+	/* atomic_long_cmpxchg orders writes */
+}
+
+/*
+ * Will take the lock or subscribe. Will fail if there is contention caused by
+ * other writers. Will also fail if there is concurrent cmpxchg update, even if
+ * it is only a writer subscription. It is not considered as a common case and
+ * therefore does not justify looping.
+ *
+ * Disables preemption if fails.
+ */
+int _wbias_write_trylock_atomic_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	/* lock out preemptable and non-preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v,
+		PTHREAD_WMASK | NPTHREAD_WMASK,
+		PTHREAD_WOFFSET | NPTHREAD_WOFFSET);
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET) + WRITER_MUTEX)
+				== v))
+			return 1;
+	}
+	return 0;
+}
+
+int _wbias_write_trylock_atomic_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK));
+	WARN_ON_ONCE(in_interrupt() || irqs_disabled());
+
+	if (likely(!(v & ~(PTHREAD_WMASK | NPTHREAD_WMASK)))) {
+		/* no other reader nor writer present, try to take the lock */
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - (PTHREAD_WOFFSET | NPTHREAD_WOFFSET) + WRITER_MUTEX)
+				== v))
+			return 1;
+	}
+	return 0;
+}
+
+void _wbias_write_lock_slow(wbias_rwlock_t *rwlock, long v)
+{
+	int try = NR_PREEMPT_BUSY_LOOPS;
+	long newv;
+
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK | NPTHREAD_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK | NPTHREAD_WMASK));
+	WARN_ON_ONCE(in_atomic() || irqs_disabled());
+	might_sleep();
+
+	/* lock out preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v, PTHREAD_WMASK, PTHREAD_WOFFSET);
+
+	/* lock out other writers when no preemptable reader threads left */
+	for (;;) {
+		if (v & (PTHREAD_RMASK | WRITER_MUTEX)) {
+			if (unlikely(!(--try))) {
+				rwlock_wait(rwlock, v,
+					PTHREAD_RMASK | WRITER_MUTEX, 0, 0, 1);
+				try = NR_PREEMPT_BUSY_LOOPS;
+			}
+			/* Order v reads */
+			barrier();
+			v = atomic_long_read(&rwlock->v);
+			continue;
+		}
+		newv = atomic_long_cmpxchg(&rwlock->v,
+			v, v - PTHREAD_WOFFSET + WRITER_MUTEX);
+		if (likely(newv == v))
+			break;
+		else
+			v = newv;
+	}
+
+	/* atomic_long_cmpxchg orders writes */
+}
+
+/*
+ * Will take the lock or subscribe. Will fail if there is contention caused by
+ * other writers. Will also fail if there is concurrent cmpxchg update, even if
+ * it is only a writer subscription. It is not considered as a common case and
+ * therefore does not justify looping.
+ */
+int _wbias_write_trylock_else_subscribe_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK | NPTHREAD_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK | NPTHREAD_WMASK));
+	WARN_ON_ONCE(in_atomic() || irqs_disabled());
+	might_sleep();
+
+	/* lock out preemptable threads */
+	v = _wbias_write_subscribe(rwlock, v, PTHREAD_WMASK, PTHREAD_WOFFSET);
+
+	if (likely(!(v & ~PTHREAD_WMASK))) {
+		/* no other reader nor writer present, try to take the lock */
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - PTHREAD_WOFFSET + WRITER_MUTEX) == v))
+			return 1;
+	}
+	return 0;
+}
+
+int _wbias_write_trylock_subscribed_slow(wbias_rwlock_t *rwlock,
+		long v)
+{
+	WARN_ON_ONCE(v & (HARDIRQ_RMASK | SOFTIRQ_RMASK | NPTHREAD_RMASK
+			| HARDIRQ_WMASK | SOFTIRQ_WMASK | NPTHREAD_WMASK));
+	WARN_ON_ONCE(in_atomic() || irqs_disabled());
+	might_sleep();
+
+	if (likely(!(v & ~PTHREAD_WMASK))) {
+		/* no other reader nor writer present, try to take the lock */
+		if (likely(atomic_long_cmpxchg(&rwlock->v, v,
+			v - PTHREAD_WOFFSET + WRITER_MUTEX) == v))
+			return 1;
+	}
+	return 0;
+}
+
+/*
+ * Called from any context (irq/softirq/preempt/non-preempt). Contains a
+ * busy-loop; must therefore disable interrupts, but only for a short time.
+ */
+void _wbias_rwlock_wakeup(wbias_rwlock_t *rwlock, long v)
+{
+	unsigned long flags;
+	/*
+	 * If there is at least one non-preemptable writer subscribed or holding
+	 * higher priority write masks, let it handle the wakeup when it exits
+	 * its critical section which excludes any preemptable context anyway.
+	 * The same applies to preemptable readers, which are the only ones
+	 * which can cause a preemptable writer to sleep.
+	 */
+	if (v & (HARDIRQ_WMASK | SOFTIRQ_WMASK | WRITER_MUTEX | NPTHREAD_WMASK
+		| PTHREAD_RMASK))
+		return;
+	/*
+	 * Busy-loop waiting for the waitqueue mutex.
+	 */
+	wbias_rwlock_irq_save(flags);
+	_wbias_write_lock_ctx_wait(rwlock, v, 0, WQ_MUTEX, WQ_MUTEX, 0);
+	/*
+	 * First do an exclusive wake-up of the first writer if there is one
+	 * waiting, else wake-up the readers.
+	 */
+	if (waitqueue_active(&rwlock->wq_write))
+		wake_up_locked(&rwlock->wq_write);
+	else
+		wake_up_locked(&rwlock->wq_read);
+	atomic_long_sub(WQ_MUTEX, &rwlock->v);
+	wbias_rwlock_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(_wbias_rwlock_wakeup);
Index: linux-2.6-lttng/lib/wbias-rwlock-fastpath.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/lib/wbias-rwlock-fastpath.c	2008-08-21 16:30:59.000000000 -0400
@@ -0,0 +1,204 @@
+/*
+ * Writer-biased low-latency rwlock
+ *
+ * Allows writer bias wrt readers and also minimally impact the irq latency
+ * of the system. Architecture-agnostic fastpath which can be replaced by
+ * architecture-specific assembly code by setting the Kconfig option
+ * HAVE_WBIAS_RWLOCK_FASTPATH.
+ *
+ * See wbias-rwlock.c for details. It contains the slow paths.
+ *
+ * Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/wbias-rwlock.h>
+#include <linux/module.h>
+
+void wbias_write_lock_irq(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	wbias_rwlock_bh_disable();
+	wbias_rwlock_irq_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0,
+			SOFTIRQ_WOFFSET | HARDIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(!v))
+		return;
+	else
+		_wbias_write_lock_irq_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_lock_irq);
+
+/*
+ * Disables preemption even if fails.
+ */
+int wbias_write_trylock_irq_else_subscribe(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	wbias_rwlock_bh_disable();
+	wbias_rwlock_irq_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0,
+			SOFTIRQ_WOFFSET | HARDIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(!v))
+		return 1;
+	else
+		return _wbias_write_trylock_irq_else_subscribe_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_irq_else_subscribe);
+
+int wbias_write_trylock_irq_subscribed(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_bh_disable();
+	wbias_rwlock_irq_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, PTHREAD_WOFFSET | NPTHREAD_WOFFSET,
+			SOFTIRQ_WOFFSET | HARDIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(v == (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)))
+		return 1;
+	else
+		return _wbias_write_trylock_irq_subscribed_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_irq_subscribed);
+
+void wbias_write_lock_bh(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	wbias_rwlock_bh_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, SOFTIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(!v))
+		return;
+	else
+		_wbias_write_lock_bh_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_lock_bh);
+
+/*
+ * Disables preemption even if fails.
+ */
+int wbias_write_trylock_bh_else_subscribe(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	wbias_rwlock_bh_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, SOFTIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(!v))
+		return 1;
+	else
+		return _wbias_write_trylock_bh_else_subscribe_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_bh_else_subscribe);
+
+int wbias_write_trylock_bh_subscribed(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, PTHREAD_WOFFSET | NPTHREAD_WOFFSET,
+			SOFTIRQ_WOFFSET | WRITER_MUTEX);
+	if (likely(v == (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)))
+		return 1;
+	else
+		return _wbias_write_trylock_bh_subscribed_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_bh_subscribed);
+
+
+void wbias_write_lock_atomic(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, WRITER_MUTEX);
+	if (likely(!v))
+		return;
+	else
+		_wbias_write_lock_atomic_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_lock_atomic);
+
+int wbias_write_trylock_atomic_else_subscribe(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	wbias_rwlock_preempt_disable();
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, WRITER_MUTEX);
+	if (likely(!v))
+		return 1;
+	else
+		return _wbias_write_trylock_atomic_else_subscribe_slow(rwlock,
+				v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_atomic_else_subscribe);
+
+int wbias_write_trylock_atomic_subscribed(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, PTHREAD_WOFFSET | NPTHREAD_WOFFSET,
+			WRITER_MUTEX);
+	if (likely(v == (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)))
+		return 1;
+	else
+		return _wbias_write_trylock_atomic_subscribed_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_atomic_subscribed);
+
+
+void wbias_write_lock(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, WRITER_MUTEX);
+	if (likely(!v))
+		return;
+	else
+		_wbias_write_lock_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_lock);
+
+int wbias_write_trylock_else_subscribe(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, 0, WRITER_MUTEX);
+	if (likely(!v))
+		return 1;
+	else
+		return _wbias_write_trylock_else_subscribe_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_else_subscribe);
+
+/*
+ * Should be called with preemption off, already subscribed.
+ * Supports having NPTHREAD_WOFFSET set even if unused to balance unsubscribe.
+ */
+int wbias_write_trylock_subscribed(wbias_rwlock_t *rwlock)
+{
+	long v;
+
+	/* no other reader nor writer present, try to take the lock */
+	v = atomic_long_cmpxchg(&rwlock->v, PTHREAD_WOFFSET | NPTHREAD_WOFFSET,
+			WRITER_MUTEX);
+	if (likely(v == (PTHREAD_WOFFSET | NPTHREAD_WOFFSET)))
+		return 1;
+	else
+		return _wbias_write_trylock_subscribed_slow(rwlock, v);
+}
+EXPORT_SYMBOL_GPL(wbias_write_trylock_subscribed);
Index: linux-2.6-lttng/lib/Kconfig
===================================================================
--- linux-2.6-lttng.orig/lib/Kconfig	2008-08-21 04:06:52.000000000 -0400
+++ linux-2.6-lttng/lib/Kconfig	2008-08-21 04:07:01.000000000 -0400
@@ -157,4 +157,7 @@ config CHECK_SIGNATURE
 config HAVE_LMB
 	boolean
 
+config HAVE_WBIAS_RWLOCK_FASTPATH
+	def_bool n
+
 endmenu
Index: linux-2.6-lttng/lib/Kconfig.debug
===================================================================
--- linux-2.6-lttng.orig/lib/Kconfig.debug	2008-08-21 12:26:46.000000000 -0400
+++ linux-2.6-lttng/lib/Kconfig.debug	2008-08-21 12:27:38.000000000 -0400
@@ -680,6 +680,10 @@ config FAULT_INJECTION_STACKTRACE_FILTER
 	help
 	  Provide stacktrace filter for fault-injection capabilities
 
+config DEBUG_WBIAS_RWLOCK
+	boolean "Debug writer-biased rwlock"
+	help
+
 config LATENCYTOP
 	bool "Latency measuring infrastructure"
 	select FRAME_POINTER if !MIPS

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 20:50                         ` [RFC PATCH] Writer-biased low-latency rwlock v8 Mathieu Desnoyers
@ 2008-08-21 21:00                           ` Linus Torvalds
  2008-08-21 21:15                             ` Linus Torvalds
  2008-08-21 21:26                             ` H. Peter Anvin
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-21 21:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Thu, 21 Aug 2008, Mathieu Desnoyers wrote:
>
> Given that I need to limit the number of readers to a smaller amount of 
> bits

I stopped reading right there.

Why?

Because that is already crap.

Go look at my code once more. Go look at how it has 128 bits of data, 
exactly so that it DOES NOT HAVE TO LIMIT THE NUMBER OF READERS.

And then go look at it again.

Look at it five times, and until you can understand that it still uses 
just a 32-bit word for the fast-path and no unnecessarily crap in it, but 
it actually has 128 bits of data for all the slow paths, don't bother 
emailing me any new versions.

Please. You -still- apparently haven't looked at it, at least not enough 
to understand the _point_ of it. You still go on about trying to fit in 
three or four different numbers in that one word. Even though the whole 
point of my rwlock is that you need exactly _one_ count (active writers), 
and _one_ bit (active reader) and _one_ extra bit ("contention, go to slow 
path, look at the other bits ONLY IN THE SLOW PATH!")

That leaves 30 bits for readers. If you still think you need to "limit the 
number of readers", then you aren't getting it.

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 21:00                           ` Linus Torvalds
@ 2008-08-21 21:15                             ` Linus Torvalds
  2008-08-21 22:22                               ` Linus Torvalds
  2008-08-23  5:09                               ` Mathieu Desnoyers
  2008-08-21 21:26                             ` H. Peter Anvin
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-21 21:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Thu, 21 Aug 2008, Linus Torvalds wrote:
> 
> That leaves 30 bits for readers. If you still think you need to "limit the 
> number of readers", then you aren't getting it.

Side note: the actual main rwlock thing is designed for a 64-bit word and 
the waiters separately as two 32-bit words, so it doesn't really do what I 
describe, but that's actually because the whole sleeping thing is _harder_ 
than a spinning thing, and has races with wakeups etc.

A spinning thing, in contrast, is pretty trivial.

So here's what I think your code should be like:

 rdlock:
	movl $4,%eax
	lock ; xaddl %eax,(%rdi)
	testl $3,%eax
	jne __rdlock_slowpath
	ret

 rwlock:
	xorl %eax,%eax
	movl $1,%edx
	lock ; cmpxchgl %edx,(%rdi)
	jne __rwlock_slowpath
	ret

 rdunlock:
	lock ; subl $4,(%rdi)
	ret

 rwunlock:
	lock ; andl $~1,(%rdi)
	ret

and I'm pretty damn sure that that should be totally sufficient for a 
spinning rwlock. The non-spinning one is more complex just because the 
unlock paths need to guarantee that something gets woken up, that just 
isn't an issue when you do spinlocks.

Now, in the slow-path:
 - on the rwlock slowpath side, set bit#1 to make sure that readers get 
   caught in the slowpath
 - then do a *separate* count of how many pending readers and writers 
   (ie the ones that got caught into the slowpath) you have (one word 
   each is probably fine), and then the slowpaths can just do the right 
   thing depending on whether there are pending readers/writers.

See? The main lock needs not worry about number of writers AT ALL, because 
it's totally irrelevant. So don't worry about running out of bits. You 
won't. Just put those counts somewhere else! The only thing that matters 
for the main lock word is whether there are active readers (30 bits), and 
whether there is an active writer (there can only ever be one: 1 bit), and 
whether new readers should be trapped (1 bit).

If you worry about overflows, you're doing something wrong.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 21:00                           ` Linus Torvalds
  2008-08-21 21:15                             ` Linus Torvalds
@ 2008-08-21 21:26                             ` H. Peter Anvin
  2008-08-21 21:41                               ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2008-08-21 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Paul E. McKenney, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

Linus Torvalds wrote:
> 
> Because that is already crap.
> 
> Go look at my code once more. Go look at how it has 128 bits of data, 
> exactly so that it DOES NOT HAVE TO LIMIT THE NUMBER OF READERS.
> 
> And then go look at it again.
> 
> Look at it five times, and until you can understand that it still uses 
> just a 32-bit word for the fast-path and no unnecessarily crap in it, but 
> it actually has 128 bits of data for all the slow paths, don't bother 
> emailing me any new versions.
> 
> Please. You -still- apparently haven't looked at it, at least not enough 
> to understand the _point_ of it. You still go on about trying to fit in 
> three or four different numbers in that one word. Even though the whole 
> point of my rwlock is that you need exactly _one_ count (active writers), 
> and _one_ bit (active reader) and _one_ extra bit ("contention, go to slow 
> path, look at the other bits ONLY IN THE SLOW PATH!")
> 
> That leaves 30 bits for readers. If you still think you need to "limit the 
> number of readers", then you aren't getting it.
> 

First of all, let me say I don't pretend to understand formally how you 
deal with overflow-after-the-fact, as unlikely as it is.

However, it seems to me to be an easy way to avoid it.  Simply by 
changing the read-test mask to $0x80000003, you will kick the code down 
the slow path once the read counter reaches $0x80000004 (2^29+1 
readers), where you can do any necessary fixup -- or BUG() -- at leisure.

This fastpath ends up being identical in size and performance to the one 
you posted, although yours could be reduced by changing the test to a 
testb instruction -- at the almost certainly unacceptable expense of 
taking a partial-register stall on the CPUs that have those.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 21:26                             ` H. Peter Anvin
@ 2008-08-21 21:41                               ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-21 21:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, Paul E. McKenney, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt



On Thu, 21 Aug 2008, H. Peter Anvin wrote:
>
> First of all, let me say I don't pretend to understand formally how you deal
> with overflow-after-the-fact, as unlikely as it is.

Just make sure it can't overflow. With spinlocks, you are guaranteed that 
you won't have more than NR_CPU's thing, so 20 bits is pretty safe. 30 
bits is ridiculously safe.

> However, it seems to me to be an easy way to avoid it.  Simply by changing the
> read-test mask to $0x80000003, you will kick the code down the slow path once
> the read counter reaches $0x80000004 (2^29+1 readers), where you can do any
> necessary fixup -- or BUG() -- at leisure.

Sure, you could do things like that, but that sounds like a separate 
"debugging" version, not the main one.

> This fastpath ends up being identical in size and performance to the one you
> posted, although yours could be reduced by changing the test to a testb
> instruction -- at the almost certainly unacceptable expense of taking a
> partial-register stall on the CPUs that have those.

Well, you could just change the "testl $3,%eax" into an "andl $3,%eax", 
and it will be two bytes shorter with no partial register stall.

I forgot that "testl" doesn't have the byte immediate version.

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 21:15                             ` Linus Torvalds
@ 2008-08-21 22:22                               ` Linus Torvalds
  2008-08-23  5:09                               ` Mathieu Desnoyers
  1 sibling, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-21 22:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Thu, 21 Aug 2008, Linus Torvalds wrote:
> 
>  rdlock:
> 	movl $4,%eax
> 	lock ; xaddl %eax,(%rdi)
> 	testl $3,%eax
> 	jne __rdlock_slowpath
> 	ret

Ok, so change that "testl" to "andl" to shave two bytes, and then make the 
whole lock structure look something like this:

	struct rwlock {
		u32 fastpath;
		u32 pending_readers;
		u32 pending_writers;
	};

and then the slowpath would look something like this:

	void __rdlock_slowpath(struct rwlock *lock)
	{
		/* Move the read count to the pending path and undo our xadd */
		atomic_add(1, &lock->pending_readers);
		atomic_sub(4, &lock->fastpath);

		for (;;) {
			unsigned fastpath;
			while (lock->pending_writers)
				cpu_relax();
			while ((fastpath = lock->fastpath) & 1)
				cpu_relax();
			/* This will also undo the contention bit */
			if (atomic_cmpxchg(&lock->fastpath, fastpath, (fastpath & ~2)+4)) == fastpath);
				break;
		}
		atomic_sub(1, &lock->pending_readers);
	}

and yes, there are more atomic accesses in there than would be necessary, 
in that if you can cram the whole thing into a single 64-bit word you can 
do both the initial pending fixup and the final cmpxchg/pending_readers 
thing as a single locked 64-bit op, but hey, the above is fairly simple.

You could also just use a spinlock to protect all the other data, of 
course.

There are fairness issues here too - do you really want to wait for 
pending writes every time, or just the first time through the loop? It 
depends on just how much you want to prefer writers.

The slow-paths for writers is similar, and has similar issues for the 
fairness issue. For example, instead of a simple "pending_writers" count, 
maybe you want to use a ticket lock there, to make writers be fair among 
themselves? Of course, the hope would be that there would never be that 
kind of contention by pure writers, so that sounds like overdesign, but 
the thing I'm trying to point out here is that this is all a separate 
decision from the actual fast-path, and having fair writers doesn't 
necessarily mean that the fastpath has to even know/care.

The placement of the counts is also something that can be optimized. For 
example, on teh assumption that readers are much more common than writers, 
I put the "pending_readers" next to the "fastpath" lock, exactly so that 
you -can- do the update of both with a single 64-bit locked access, even 
if the fastpath just used a 32-bit one. But then you should at least add a 
"__attribute__((aligned(8)))" to the structure definition.

And maybe you want to share the slowpath between all the irq-off etc 
versions, in which case you need to make the "irq enable while waiting" 
thing be conditional. 

And if the write path wants to wait for pending readers, and you want to 
make _that_ fair too, then you want to have that punch through if the 
writer is in an interrupt, to avoid deadlocks. And again, you might want 
to make the reader thing be some kind of ticket-based system, so that 
writers can wait for the readers that came in _before_ that writer, but 
not to later ones.

But again - none of that has anything to do with the fast-path. The 
fast-path remains exactly the same for all these issues.

And finally: NONE of the code is at all tested. It's all just written in 
my MUA. Caveat emptor.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-21 21:15                             ` Linus Torvalds
  2008-08-21 22:22                               ` Linus Torvalds
@ 2008-08-23  5:09                               ` Mathieu Desnoyers
  2008-08-23 18:02                                 ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-23  5:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Thu, 21 Aug 2008, Linus Torvalds wrote:
> > 
> > That leaves 30 bits for readers. If you still think you need to "limit the 
> > number of readers", then you aren't getting it.
> 
[...]
> and I'm pretty damn sure that that should be totally sufficient for a 
> spinning rwlock. The non-spinning one is more complex just because the 
> unlock paths need to guarantee that something gets woken up, that just 
> isn't an issue when you do spinlocks.
> 
> Now, in the slow-path:
>  - on the rwlock slowpath side, set bit#1 to make sure that readers get 
>    caught in the slowpath
>  - then do a *separate* count of how many pending readers and writers 
>    (ie the ones that got caught into the slowpath) you have (one word 
>    each is probably fine), and then the slowpaths can just do the right 
>    thing depending on whether there are pending readers/writers.
> 
> See? The main lock needs not worry about number of writers AT ALL, because 
> it's totally irrelevant. So don't worry about running out of bits. You 
> won't. Just put those counts somewhere else! The only thing that matters 
> for the main lock word is whether there are active readers (30 bits), and 
> whether there is an active writer (there can only ever be one: 1 bit), and 
> whether new readers should be trapped (1 bit).
> 
> If you worry about overflows, you're doing something wrong.
> 
> 			Linus

Hi Linus,

After having had a look 15 times more at your rwlock code, and again at
your answer, in loops, I come to the conclusion that I made a terrible
job at explaining the core idea beneath my rwlock design. So let's try
to make it more obvious. I am going to keep this explanation short.

First, the goal : to kill latency induced by rwlocks.

Now, let me explain why I need at least not one, but _four different_
contention bits. Simply because there are four types of contention, one
for each execution context which may take the read lock. (irq, softirq,
non-preemptable, preemptable)

So let's suppose I share the rwlock between non-preemptable writers,
non-preemptable readers and interrupt context readers. The idea is this:
If I want to take the write lock, I'll have to protect against both
non-preemptable and interrupt readers (and also against other writers).

A way to do that without inducing high interrupt latency with current
rwlocks would be for the writer to take _two_ locks, not just one. The
first excludes preemptable readers and the second excludes irqs. So we
have :

For the writer :

preempt_disable();
write_lock(&preempt_rwlock);

local_irq_disable();
write_lock(&irq_rwlock);

/* Write access */

write_unlock(&irq_rwlock);
local_irq_enable();

write_unlock(&preempt_rwlock);
preempt_enable();

And the readers :

Interrupt :
read_lock(&irq_rwlock);
...
read_unlock(&irq_rwlock);

non-preemptable :
read_lock(&preempt_rwlock);
...
read_unlock(&preempt_rwlock);

If you still wonder why I need to take two locks, thus elevating the
context priority by exluding one context at a time, we can go to this
example showing why current rwlock use in the kernel produces such high
latencies. The current use is :

For the writer, we exclude all the readers in one go :

local_irq_disable();
write_lock(&big_lock_against_all_readers);

/* Write access */

write_unlock(&big_lock_against_all_readers);
local_irq_enable();

write_unlock(&preempt_rwlock);
preempt_enable();

And for the readers :

Interrupt :
read_lock(&big_lock_against_all_readers);
...
read_unlock(&big_lock_against_all_readers);

non-preemptable :
read_lock(&big_lock_against_all_readers);
...
read_unlock(&big_lock_against_all_readers);

And where is the problem ? In the following execution sequence :
- non-preemptable reader takes the read lock
- writer disables irqs
- writer takes the write lock, contended by the reader
  - hrm, the reader can take a while to complete, may be iterating on
    the tasklist, interrupted by irqs and softirqs...

This contention is pretty bad because interrupts are disabled while we
busy-loop. And yes, the new implementation you proposed *does* suffer
from the same problem. Now if you go in the multiple locks scenario I
explained above, you'll notice that you don't have this problem. Then
comes the question : how can we efficiently take many locks, one per
reader execution context we want to exclude.

This is where we need to have at _least_ one contention bit _per
context_. Because at a given moment, non-preemptable readers can be
contended, but not irq readers. But we also have to know when a
particular reader execution context is locked out so that the writer
knows it has exclusive access and can therefore either disable
interrupts and exclude interrupt readers or finally access the critical
section. Therefore, we need to keep one reader count _per context_,
which makes that 4 reader counts (irq, softirq, non-preemptable,
preemptable). And that's where we start being short on bits.

But my current maximums for number of readers is not a problem,
especially because I detect overflow beforehand using cmpxchg and
busy-loop in the rare event these counters would be full. And I am not
limited to NR_CPUS _at all_ anymore, as shows these constants :

#define PTHREAD_RMAX    16384
#define NPTHREAD_RMAX   2048
#define SOFTIRQ_RMAX    512
#define HARDIRQ_RMAX    256

So, before discussing optimization or implementation details further,
I'd like to know if I made myself clear enough about the design goal. If
not, then again I must be doing a terrible job at explaining it and
please let me know. I made a big patch cleanup after v8, so you might
want to wait for us to come to an understanding before going through a
next patch release.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-23  5:09                               ` Mathieu Desnoyers
@ 2008-08-23 18:02                                 ` Linus Torvalds
  2008-08-23 20:30                                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2008-08-23 18:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Sat, 23 Aug 2008, Mathieu Desnoyers wrote:
> 
> Now, let me explain why I need at least not one, but _four different_
> contention bits. Simply because there are four types of contention, one
> for each execution context which may take the read lock. (irq, softirq,
> non-preemptable, preemptable)

No. You need _one_ contention bit in the fast-path.

Then, as you get into the slow-path, you can decide on four different 
behaviours.

Quite frankly, I don't think this discussion is going anywhere. I don't 
think I'd take anything from you, since you seem to have a really hard 
time separating out the issue of fast-path and slow-path. So I'm simply 
not going to bother, and I'm not going to expect to merge your work. 

Sorry, but it simply isn't worth my time or effort.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-23 18:02                                 ` Linus Torvalds
@ 2008-08-23 20:30                                   ` Mathieu Desnoyers
  2008-08-23 21:40                                     ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Desnoyers @ 2008-08-23 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sat, 23 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > Now, let me explain why I need at least not one, but _four different_
> > contention bits. Simply because there are four types of contention, one
> > for each execution context which may take the read lock. (irq, softirq,
> > non-preemptable, preemptable)
> 
> No. You need _one_ contention bit in the fast-path.
> 
> Then, as you get into the slow-path, you can decide on four different 
> behaviours.
> 
> Quite frankly, I don't think this discussion is going anywhere. I don't 
> think I'd take anything from you, since you seem to have a really hard 
> time separating out the issue of fast-path and slow-path. So I'm simply 
> not going to bother, and I'm not going to expect to merge your work. 
> 
> Sorry, but it simply isn't worth my time or effort.
> 
> 			Linus

Hi,

OK, I now see how I can apply this contention bit idea to my algo.
Actually, the point I just fixed in my head is that this bit will be a
"MAY_CONTEND" bit, which could let higher priority readers access the
lock in the slow path. Only one fast path reader count will be required,
just as you said.

Sorry about taking that much time to get my head around this. Thanks for
your helpful explanations and your time.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Writer-biased low-latency rwlock v8
  2008-08-23 20:30                                   ` Mathieu Desnoyers
@ 2008-08-23 21:40                                     ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2008-08-23 21:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, linux-kernel,
	Steven Rostedt

On Sat, 23 Aug 2008, Mathieu Desnoyers wrote:
>
> Actually, the point I just fixed in my head is that this bit will be a
> "MAY_CONTEND" bit, which could let higher priority readers access the
> lock in the slow path.

EXACTLY.

It's not even necessarily a "contention" bit per se - it's literally a 
"readers have to take the slow-path" bit (writers will obviously _always_ 
take the slowpath if there is any non-zero value at all, so they don't 
need it).

Then, the slow-path might actually decide that "hey, there is no _actual_ 
writer there yet - just some _waiting_ writer, but since this read lock is 
in an interrupt context, we have to let it go through _despite_ the fact 
that the lock is contended in order to avoid deadlock".

So it allows a fast-path for the trivial cases that is literally just a 
couple of instructions long, and that is nice not just because of 
performance issues, but because it then means that you can entirely ignore 
all those things in the slow path. It also means that everybody can look 
at the fast-path and decide that "ok, the fast-path really is optimal". 

That fast-path is what a lot of people care more about than just about 
anything else.

The slow-path, in comparison, can be in C, and can do all those checks 
like "are we in an (sw-)interrupt handler?" and basically prioritize 
certain classes of people.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] Fair low-latency rwlock v5
  2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
                                   ` (2 preceding siblings ...)
  2008-08-18 23:25                 ` Paul E. McKenney
@ 2008-08-25 19:20                 ` Peter Zijlstra
  3 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2008-08-25 19:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, H. Peter Anvin, Jeremy Fitzhardinge,
	Andrew Morton, Ingo Molnar, Joe Perches, Paul E. McKenney,
	linux-kernel

On Sun, 2008-08-17 at 15:10 -0400, Mathieu Desnoyers wrote:

> Fair low-latency rwlock v5
> 
> Fair low-latency rwlock provides fairness for writers against reader threads,
> but lets softirq and irq readers in even when there are subscribed writers
> waiting for the lock. Writers could starve reader threads.

New lock types should come with lockdep support.

Also, if you're serious about -rt, you'll need to consider full PI.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-08-25 19:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-16  7:39 [PATCH] x86_64 : support atomic ops with 64 bits integer values Mathieu Desnoyers
2008-08-16 15:04 ` H. Peter Anvin
2008-08-16 15:43   ` Mathieu Desnoyers
2008-08-16 17:30     ` Linus Torvalds
2008-08-16 21:19       ` [RFC PATCH] Fair rwlock Mathieu Desnoyers
2008-08-16 21:33         ` Linus Torvalds
2008-08-17  7:53           ` [RFC PATCH] Fair low-latency rwlock v3 Mathieu Desnoyers
2008-08-17 16:17             ` Linus Torvalds
2008-08-17 19:10               ` [RFC PATCH] Fair low-latency rwlock v5 Mathieu Desnoyers
2008-08-17 21:30                 ` [RFC PATCH] Fair low-latency rwlock v5 (updated benchmarks) Mathieu Desnoyers
2008-08-18 18:59                 ` [RFC PATCH] Fair low-latency rwlock v5 Linus Torvalds
2008-08-18 23:25                 ` Paul E. McKenney
2008-08-19  6:04                   ` Mathieu Desnoyers
2008-08-19  7:33                     ` Mathieu Desnoyers
2008-08-19  9:06                       ` Mathieu Desnoyers
2008-08-19 16:48                       ` Linus Torvalds
2008-08-21 20:50                         ` [RFC PATCH] Writer-biased low-latency rwlock v8 Mathieu Desnoyers
2008-08-21 21:00                           ` Linus Torvalds
2008-08-21 21:15                             ` Linus Torvalds
2008-08-21 22:22                               ` Linus Torvalds
2008-08-23  5:09                               ` Mathieu Desnoyers
2008-08-23 18:02                                 ` Linus Torvalds
2008-08-23 20:30                                   ` Mathieu Desnoyers
2008-08-23 21:40                                     ` Linus Torvalds
2008-08-21 21:26                             ` H. Peter Anvin
2008-08-21 21:41                               ` Linus Torvalds
2008-08-25 19:20                 ` [RFC PATCH] Fair low-latency rwlock v5 Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).