All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] rwsem fast-path write lock stealing
@ 2013-03-06 23:21 Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

These patches extend Alex Shi's work (which added write lock stealing
on the rwsem slow path) in order to provide rwsem write lock stealing
on the fast path (that is, without taking the rwsem's wait_lock).

I initially sent a shorter series shortly before v3.9, however some
patches were doing too much at once which made them confusing to
review. I have now split the series at a smaller granularity;
hope this will help :)

Patches 1-2 are for cleanups:

- Patch 1 replaces the waiter type bitmask with an enumeration
  (as we don't have any other planned uses for the bitmap)

- Patch 2 shortens critical sections in rwsem_down_failed_common() so
  they don't cover more than what is absolutely necessary

Patches 3-5 splits rwsem_down_failed_common() into separate functions
for the read and write sides:

- Patch 3 simply puts two identical copies of rwsem_down_failed_common()
  into rwsem_down_{read,write}_failed (no code changes, in order to
  make the review easier)

- Patch 4 does easy simplifications in rwsem_down_read_failed():
  - We don't need to wake readers queued before us;
  - We don't need to try to steal the lock, and thus we don't need to
    acquire the wait_lock after sleeping.

- Patch 5 does easy simplifications in rwsem_down_write_failed():
  - We don't need to check for !waiter.task since __rwsem_do_wake()
    doesn't remove writers from the wait_list;
  - Since the only way to exit the wait loop is by stealing the write lock,
    the corresponding exit code can be moved after the loop;
  - There is no point releaseing the wait_lock before entering the
    wait loop, as we will need to reacquire it immediately;
  - We don't need to get a reference on the task structure, since the task
    is responsible for removing itself from the wait_list;

Patches 6-9 apply additional optimizations to rwsem_down_write_failed():

- Patch 6 tries write lock stealing more aggressively in order to avoid
  extra checks;

- Patch 7 uses cmpxchg to implement the write lock stealing, instead of
  doing an additive adjustment that might need to be backed out;

- Patch 8 avoids taking the wait_lock if there are already active locks
  when we wake up;

- Patch 9 avoids the initial trylock if there were already active locks
  when we entered rwsem_down_write_failed()

Patches 10-11 wake all readers whenever the first waiter is a reader:

- Patch 10 does this in rwsem-spinlock. This is both for symetry with
  the other rwsem implementation, and because this should result in
  increased parallelism for workloads that mix readers and writers.

- Patch 11 does this in rwsem. This is partly for increased parallelism,
  but the main reason is that it gets rid of a case where __rwsem_do_wake
  assumed that the rwsem lock can't be stolen when it holds the wait_lock
  and the rwsem count indicates there are queued waiters. This assumption
  won't be true after patch 12, so we need to fix __rwsem_do_wake and it
  turns out the easiest fix involves waking all readers.

Patch 12 finally implements rwsem fast path lock stealing for x86 arch.

Michel Lespinasse (12):
  rwsem: make the waiter type an enumeration rather than a bitmask
  rwsem: shorter spinlocked section in rwsem_down_failed_common()
  rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
  rwsem: simplify rwsem_down_read_failed
  rwsem: simplify rwsem_down_write_failed
  rwsem: more agressive lock stealing in rwsem_down_write_failed
  rwsem: use cmpxchg for trying to steal write lock
  rwsem: avoid taking wait_lock in rwsem_down_write_failed
  rwsem: skip initial trylock in rwsem_down_write_failed
  rwsem-spinlock: wake all readers when first waiter is a reader
  rwsem: wake all readers when first waiter is a reader
  x86 rwsem: avoid taking slow path when stealing write lock

 arch/x86/include/asm/rwsem.h |  28 +++--
 include/linux/rwsem.h        |   2 +
 lib/rwsem-spinlock.c         |  54 ++++-----
 lib/rwsem.c                  | 272 +++++++++++++++++++------------------------
 4 files changed, 166 insertions(+), 190 deletions(-)

-- 
1.8.1.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-13 21:33   ` Rik van Riel
  2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

We are not planning to add some new waiter flags, so we can convert the
waiter type into an enumeration.

Background: David Howells suggested I do this back when I tried adding
a new waiter type for unfair readers. However, I believe the cleanup
applies regardless of that use case.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem-spinlock.c | 19 +++++++++++--------
 lib/rwsem.c          | 23 +++++++++++++----------
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index 7542afbb22b3..5f117f37ac0a 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -9,12 +9,15 @@
 #include <linux/sched.h>
 #include <linux/export.h>
 
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
+	enum rwsem_waiter_type type;
 };
 
 int rwsem_is_locked(struct rw_semaphore *sem)
@@ -68,7 +71,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 
 	if (!wakewrite) {
-		if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
 			goto out;
 		goto dont_wake_writers;
 	}
@@ -78,7 +81,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 	 * to -1 here to indicate we get the lock. Instead, we wake it up
 	 * to let it go get it again.
 	 */
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		wake_up_process(waiter->task);
 		goto out;
 	}
@@ -86,7 +89,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 	/* grant an infinite number of read locks to the front of the queue */
  dont_wake_writers:
 	woken = 0;
-	while (waiter->flags & RWSEM_WAITING_FOR_READ) {
+	while (waiter->type == RWSEM_WAITING_FOR_READ) {
 		struct list_head *next = waiter->list.next;
 
 		list_del(&waiter->list);
@@ -144,7 +147,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_READ;
+	waiter.type = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
 	list_add_tail(&waiter.list, &sem->wait_list);
@@ -201,7 +204,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
 	/* set up my own style of waitqueue */
 	tsk = current;
 	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* wait for someone to release the lock */
diff --git a/lib/rwsem.c b/lib/rwsem.c
index ad5e0df16ab4..672eb33218ac 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -30,12 +30,15 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 
 EXPORT_SYMBOL(__init_rwsem);
 
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
+	enum rwsem_waiter_type type;
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -65,7 +68,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	signed long woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
@@ -112,10 +115,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(waiter->list.next,
 					struct rwsem_waiter, list);
 
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
+	} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
-	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		/* hit end of list above */
 		adjustment -= RWSEM_WAITING_BIAS;
 
@@ -148,7 +151,7 @@ static int try_get_writer_sem(struct rw_semaphore *sem,
 
 	/* only steal when first waiter is writing */
 	fwaiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (!(fwaiter->flags & RWSEM_WAITING_FOR_WRITE))
+	if (fwaiter->type != RWSEM_WAITING_FOR_WRITE)
 		return 0;
 
 	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
@@ -179,7 +182,7 @@ try_again_write:
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			 unsigned int flags, signed long adjustment)
+			 enum rwsem_waiter_type type, signed long adjustment)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
@@ -190,7 +193,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* set up my own style of waitqueue */
 	raw_spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
-	waiter.flags = flags;
+	waiter.type = type;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
@@ -221,7 +224,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 		raw_spin_lock_irq(&sem->wait_lock);
 		/* Try to get the writer sem, may steal from the head writer: */
-		if (flags == RWSEM_WAITING_FOR_WRITE)
+		if (type == RWSEM_WAITING_FOR_WRITE)
 			if (try_get_writer_sem(sem, &waiter)) {
 				raw_spin_unlock_irq(&sem->wait_lock);
 				return sem;
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common()
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

This change reduces the size of the spinlocked and TASK_UNINTERRUPTIBLE
sections in rwsem_down_failed_common():

- We only need the sem->wait_lock to insert ourselves on the wait_list;
  the waiter node can be prepared outside of the wait_lock.

- The task state only needs to be set to TASK_UNINTERRUPTIBLE immediately
  before checking if we actually need to sleep; it doesn't need to protect
  the entire function.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 672eb33218ac..40636454cf3c 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -188,14 +188,12 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	struct task_struct *tsk = current;
 	signed long count;
 
-	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-
 	/* set up my own style of waitqueue */
-	raw_spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
 	waiter.type = type;
 	get_task_struct(tsk);
 
+	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter.list, &sem->wait_list);
@@ -218,7 +216,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	raw_spin_unlock_irq(&sem->wait_lock);
 
 	/* wait to be given the lock */
-	for (;;) {
+	while (true) {
+		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!waiter.task)
 			break;
 
@@ -231,7 +230,6 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 			}
 		raw_spin_unlock_irq(&sem->wait_lock);
 		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
 	tsk->state = TASK_RUNNING;
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

Remove the rwsem_down_failed_common function and replace it with two
identical copies of its code in rwsem_down_{read,write}_failed.

This is because we want to make different optimizations in
rwsem_down_{read,write}_failed; we are adding this pure-duplication
step as a separate commit in order to make it easier to check the
following steps.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 40636454cf3c..fb658af1c12c 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -178,12 +178,12 @@ try_again_write:
 }
 
 /*
- * wait for a lock to be granted
+ * wait for the read lock to be granted
  */
-static struct rw_semaphore __sched *
-rwsem_down_failed_common(struct rw_semaphore *sem,
-			 enum rwsem_waiter_type type, signed long adjustment)
+struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 {
+	enum rwsem_waiter_type type = RWSEM_WAITING_FOR_READ;
+	signed long adjustment = -RWSEM_ACTIVE_READ_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	signed long count;
@@ -238,21 +238,63 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 }
 
 /*
- * wait for the read lock to be granted
- */
-struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_READ,
-					-RWSEM_ACTIVE_READ_BIAS);
-}
-
-/*
  * wait for the write lock to be granted
  */
 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_WRITE,
-					-RWSEM_ACTIVE_WRITE_BIAS);
+	enum rwsem_waiter_type type = RWSEM_WAITING_FOR_WRITE;
+	signed long adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+	struct rwsem_waiter waiter;
+	struct task_struct *tsk = current;
+	signed long count;
+
+	/* set up my own style of waitqueue */
+	waiter.task = tsk;
+	waiter.type = type;
+	get_task_struct(tsk);
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list))
+		adjustment += RWSEM_WAITING_BIAS;
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock, but no longer actively locking */
+	count = rwsem_atomic_update(adjustment, sem);
+
+	/* If there are no active locks, wake the front queued process(es) up.
+	 *
+	 * Alternatively, if we're called from a failed down_write(), there
+	 * were already threads queued before us and there are no active
+	 * writers, the lock must be read owned; so we try to wake any read
+	 * locks that were queued ahead of us. */
+	if (count == RWSEM_WAITING_BIAS)
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
+	else if (count > RWSEM_WAITING_BIAS &&
+		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		if (!waiter.task)
+			break;
+
+		raw_spin_lock_irq(&sem->wait_lock);
+		/* Try to get the writer sem, may steal from the head writer: */
+		if (type == RWSEM_WAITING_FOR_WRITE)
+			if (try_get_writer_sem(sem, &waiter)) {
+				raw_spin_unlock_irq(&sem->wait_lock);
+				return sem;
+			}
+		raw_spin_unlock_irq(&sem->wait_lock);
+		schedule();
+	}
+
+	tsk->state = TASK_RUNNING;
+
+	return sem;
 }
 
 /*
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/12] rwsem: simplify rwsem_down_read_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (2 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

When trying to acquire a read lock, the RWSEM_ACTIVE_READ_BIAS adjustment
doesn't cause other readers to block, so we never have to worry about waking
them back after canceling this adjustment in rwsem_down_read_failed().

We also never want to steal the lock in rwsem_down_read_failed(), so we
don't have to grab the wait_lock either.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index fb658af1c12c..66f307e90761 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -182,7 +182,6 @@ try_again_write:
  */
 struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	enum rwsem_waiter_type type = RWSEM_WAITING_FOR_READ;
 	signed long adjustment = -RWSEM_ACTIVE_READ_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
@@ -190,7 +189,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
-	waiter.type = type;
+	waiter.type = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
 	raw_spin_lock_irq(&sem->wait_lock);
@@ -201,17 +200,9 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
-	/* If there are no active locks, wake the front queued process(es) up.
-	 *
-	 * Alternatively, if we're called from a failed down_write(), there
-	 * were already threads queued before us and there are no active
-	 * writers, the lock must be read owned; so we try to wake any read
-	 * locks that were queued ahead of us. */
+	/* If there are no active locks, wake the front queued process(es). */
 	if (count == RWSEM_WAITING_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
-	else if (count > RWSEM_WAITING_BIAS &&
-		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -220,15 +211,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!waiter.task)
 			break;
-
-		raw_spin_lock_irq(&sem->wait_lock);
-		/* Try to get the writer sem, may steal from the head writer: */
-		if (type == RWSEM_WAITING_FOR_WRITE)
-			if (try_get_writer_sem(sem, &waiter)) {
-				raw_spin_unlock_irq(&sem->wait_lock);
-				return sem;
-			}
-		raw_spin_unlock_irq(&sem->wait_lock);
 		schedule();
 	}
 
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/12] rwsem: simplify rwsem_down_write_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (3 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

When waking writers, we never grant them the lock - instead, they have
to acquire it themselves when they run, and remove themselves from the
wait_list when they succeed.

As a result, we can do a few simplifications in rwsem_down_write_failed():

- We don't need to check for !waiter.task since __rwsem_do_wake() doesn't
  remove writers from the wait_list

- There is no point releaseing the wait_lock before entering the wait loop,
  as we will need to reacquire it immediately. We can change the loop so
  that the lock is always held at the start of each loop iteration.

- We don't need to get a reference on the task structure, since the task
  is responsible for removing itself from the wait_list. There is no risk,
  like in the rwsem_down_read_failed() case, that a task would wake up and
  exit (thus destroying its task structure) while __rwsem_do_wake() is
  still running - wait_lock protects against that.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 33 +++++++++------------------------
 1 file changed, 9 insertions(+), 24 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 66f307e90761..c73bd96dc30c 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -161,16 +161,8 @@ static int try_get_writer_sem(struct rw_semaphore *sem,
 
 try_again_write:
 	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-	if (!(oldcount & RWSEM_ACTIVE_MASK)) {
-		/* No active lock: */
-		struct task_struct *tsk = waiter->task;
-
-		list_del(&waiter->list);
-		smp_mb();
-		put_task_struct(tsk);
-		tsk->state = TASK_RUNNING;
+	if (!(oldcount & RWSEM_ACTIVE_MASK))
 		return 1;
-	}
 	/* some one grabbed the sem already */
 	if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK)
 		return 0;
@@ -220,11 +212,10 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 }
 
 /*
- * wait for the write lock to be granted
+ * wait until we successfully acquire the write lock
  */
 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	enum rwsem_waiter_type type = RWSEM_WAITING_FOR_WRITE;
 	signed long adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
@@ -232,8 +223,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
-	waiter.type = type;
-	get_task_struct(tsk);
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
@@ -255,25 +245,20 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
-	raw_spin_unlock_irq(&sem->wait_lock);
-
-	/* wait to be given the lock */
+	/* wait until we successfully acquire the lock */
 	while (true) {
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-		if (!waiter.task)
+
+		if (try_get_writer_sem(sem, &waiter))
 			break;
 
-		raw_spin_lock_irq(&sem->wait_lock);
-		/* Try to get the writer sem, may steal from the head writer: */
-		if (type == RWSEM_WAITING_FOR_WRITE)
-			if (try_get_writer_sem(sem, &waiter)) {
-				raw_spin_unlock_irq(&sem->wait_lock);
-				return sem;
-			}
 		raw_spin_unlock_irq(&sem->wait_lock);
 		schedule();
+		raw_spin_lock_irq(&sem->wait_lock);
 	}
 
+	list_del(&waiter.list);
+	raw_spin_unlock_irq(&sem->wait_lock);
 	tsk->state = TASK_RUNNING;
 
 	return sem;
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (4 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

Some small code simplifications can be achieved by doing more agressive
lock stealing:

- When rwsem_down_write_failed() notices that there are no active locks
  (and thus no thread to wake us if we decided to sleep), it used to wake
  the first queued process. However, stealing the lock is also sufficient
  to deal with this case, so we don't need this check anymore.

- In try_get_writer_sem(), we can steal the lock even when the first waiter
  is a reader. This is correct because the code path that wakes readers is
  protected by the wait_lock. As to the performance effects of this change,
  they are expected to be minimal: readers are still granted the lock
  (rather than having to acquire it themselves) when they reach the front
  of the wait queue, so we have essentially the same behavior as in
  rwsem-spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index c73bd96dc30c..2360bf204098 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -143,20 +143,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 }
 
 /* Try to get write sem, caller holds sem->wait_lock: */
-static int try_get_writer_sem(struct rw_semaphore *sem,
-					struct rwsem_waiter *waiter)
+static int try_get_writer_sem(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter *fwaiter;
 	long oldcount, adjustment;
 
-	/* only steal when first waiter is writing */
-	fwaiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (fwaiter->type != RWSEM_WAITING_FOR_WRITE)
-		return 0;
-
 	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
-	/* Only one waiter in the queue: */
-	if (fwaiter == waiter && waiter->list.next == &sem->wait_list)
+	if (list_is_singular(&sem->wait_list))
 		adjustment -= RWSEM_WAITING_BIAS;
 
 try_again_write:
@@ -233,23 +225,18 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
-	/* If there are no active locks, wake the front queued process(es) up.
-	 *
-	 * Alternatively, if we're called from a failed down_write(), there
-	 * were already threads queued before us and there are no active
-	 * writers, the lock must be read owned; so we try to wake any read
-	 * locks that were queued ahead of us. */
-	if (count == RWSEM_WAITING_BIAS)
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
-	else if (count > RWSEM_WAITING_BIAS &&
-		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+	/* If there were already threads queued before us and there are no
+	 * active writers, the lock must be read owned; so we try to wake
+	 * any read locks that were queued ahead of us. */
+	if (count > RWSEM_WAITING_BIAS &&
+	    adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	/* wait until we successfully acquire the lock */
 	while (true) {
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 
-		if (try_get_writer_sem(sem, &waiter))
+		if (try_get_writer_sem(sem))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (5 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

Using rwsem_atomic_update to try stealing the write lock forced us to
undo the adjustment in the failure path. We can have simpler and faster
code by using cmpxchg instead.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 26 ++++++--------------------
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 2360bf204098..64c2dc007be2 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -142,25 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	return sem;
 }
 
-/* Try to get write sem, caller holds sem->wait_lock: */
-static int try_get_writer_sem(struct rw_semaphore *sem)
-{
-	long oldcount, adjustment;
-
-	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
-	if (list_is_singular(&sem->wait_list))
-		adjustment -= RWSEM_WAITING_BIAS;
-
-try_again_write:
-	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-	if (!(oldcount & RWSEM_ACTIVE_MASK))
-		return 1;
-	/* some one grabbed the sem already */
-	if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK)
-		return 0;
-	goto try_again_write;
-}
-
 /*
  * wait for the read lock to be granted
  */
@@ -236,7 +217,12 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	while (true) {
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 
-		if (try_get_writer_sem(sem))
+		/* Try acquiring the write lock. */
+		count = RWSEM_ACTIVE_WRITE_BIAS;
+		if (!list_is_singular(&sem->wait_list))
+			count += RWSEM_WAITING_BIAS;
+		if (cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
+							RWSEM_WAITING_BIAS)
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (6 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

In rwsem_down_write_failed(), if there are active locks after we wake up
(i.e. the lock got stolen from us), skip taking the wait_lock and go back
to sleep immediately.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 64c2dc007be2..edf3d9ca670e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -214,8 +214,8 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	/* wait until we successfully acquire the lock */
+	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	while (true) {
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 
 		/* Try acquiring the write lock. */
 		count = RWSEM_ACTIVE_WRITE_BIAS;
@@ -226,7 +226,13 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-		schedule();
+
+		/* Block until there are no active lockers. */
+		do {
+			schedule();
+			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		} while (sem->count & RWSEM_ACTIVE_MASK);
+
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
 
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/12] rwsem: skip initial trylock in rwsem_down_write_failed
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (7 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

We can skip the initial trylock in rwsem_down_write_failed() if there are
known active lockers already, thus saving one likely-to-fail cmpxchg.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index edf3d9ca670e..0d50e46d5b0c 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -216,14 +216,15 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	while (true) {
-
-		/* Try acquiring the write lock. */
-		count = RWSEM_ACTIVE_WRITE_BIAS;
-		if (!list_is_singular(&sem->wait_list))
-			count += RWSEM_WAITING_BIAS;
-		if (cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
+		if (!(count & RWSEM_ACTIVE_MASK)) {
+			/* Try acquiring the write lock. */
+			count = RWSEM_ACTIVE_WRITE_BIAS;
+			if (!list_is_singular(&sem->wait_list))
+				count += RWSEM_WAITING_BIAS;
+			if (cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
 							RWSEM_WAITING_BIAS)
-			break;
+				break;
+		}
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -231,7 +232,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 		do {
 			schedule();
 			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-		} while (sem->count & RWSEM_ACTIVE_MASK);
+		} while ((count = sem->count) & RWSEM_ACTIVE_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (8 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
  2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

When the first queued waiter is a reader, wake all readers instead of
just those that are at the front of the queue. This should result in
increased parallelism for workloads that mix readers and writers.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 lib/rwsem-spinlock.c | 39 ++++++++++++++++-----------------------
 1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index 5f117f37ac0a..d22948eeaeae 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -70,36 +70,29 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 
-	if (!wakewrite) {
-		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			goto out;
-		goto dont_wake_writers;
-	}
-
-	/*
-	 * as we support write lock stealing, we can't set sem->activity
-	 * to -1 here to indicate we get the lock. Instead, we wake it up
-	 * to let it go get it again.
-	 */
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		wake_up_process(waiter->task);
+		if (wakewrite)
+			/* Wake up a writer. Note that we do not grant it the
+			 * lock - it will have to acquire it when it runs. */
+			wake_up_process(waiter->task);
 		goto out;
 	}
 
-	/* grant an infinite number of read locks to the front of the queue */
- dont_wake_writers:
+	/* grant read locks to all queued readers. */
 	woken = 0;
-	while (waiter->type == RWSEM_WAITING_FOR_READ) {
+	while (true) {
 		struct list_head *next = waiter->list.next;
 
-		list_del(&waiter->list);
-		tsk = waiter->task;
-		smp_mb();
-		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
-		woken++;
-		if (list_empty(&sem->wait_list))
+		if (waiter->type != RWSEM_WAITING_FOR_WRITE) {
+			list_del(&waiter->list);
+			tsk = waiter->task;
+			smp_mb();
+			waiter->task = NULL;
+			wake_up_process(tsk);
+			put_task_struct(tsk);
+			woken++;
+		}
+		if (next == &sem->wait_list)
 			break;
 		waiter = list_entry(next, struct rwsem_waiter, list);
 	}
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (9 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  2013-03-09  0:32   ` Dave Chinner
  2013-03-11 20:36   ` Peter Hurley
  2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse
  11 siblings, 2 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

When the first queued waiter is a reader, wake all readers instead of
just those that are at the front of the queue. There are really two
motivations for this change:

- This should result in increased parallelism for workloads that mix
  readers and writers.

- As we want to enable fast-path write lock stealing (without taking the
  wait_lock), we don't want to spend much time counting readers to wake
  while fast-path write lock stealing might still get the lock. By
  maintaining a count of queued readers, and waking them all at once, we
  avoid having to count readers before attempting to grand them read locks.

Regarding the implementation, a few things are worth mentioning:

- On 64-bit architectures, wait_readers fills a hole that would otherwise
  be present between wait_lock and wait_list, so it doesn't actually grow
  the struct rw_semaphore size.

- The only difference between the RWSEM_WAKE_ANY and RWSEM_WAKE_NO_ACTIVE
  wake types was whether we need to check for write locks before granting
  read locks. The updated code always starts by granting read locks, so
  the difference is now moot. We can thus replace the wake_type with a
  boolean 'wakewrite' value, just as in rwsem-spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/rwsem.h |   2 +
 lib/rwsem.c           | 128 ++++++++++++++++++++------------------------------
 2 files changed, 53 insertions(+), 77 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 8da67d625e13..be6f9649512d 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -25,6 +25,7 @@ struct rw_semaphore;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
+	unsigned int		wait_readers;
 	struct list_head	wait_list;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
@@ -58,6 +59,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
+	  0,						\
 	  LIST_HEAD_INIT((name).wait_list)		\
 	  __RWSEM_DEP_MAP_INIT(name) }
 
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 0d50e46d5b0c..f9a57054e9f2 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -41,14 +41,6 @@ struct rwsem_waiter {
 	enum rwsem_waiter_type type;
 };
 
-/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
- * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
- * since the rwsem value was observed.
- */
-#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
-#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
-#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
-
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -60,83 +52,64 @@ struct rwsem_waiter {
  * - writers are only woken if downgrading is false
  */
 static struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
+__rwsem_do_wake(struct rw_semaphore *sem, bool wakewrite)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	signed long woken, loop, adjustment;
+	signed long oldcount, readers, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
-		goto readers_only;
-
-	if (wake_type == RWSEM_WAKE_READ_OWNED)
-		/* Another active reader was observed, so wakeup is not
-		 * likely to succeed. Save the atomic op.
-		 */
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		if (wakewrite)
+			/* Wake writer at the front of the queue, but do not
+			 * grant it the lock yet as we want other writers
+			 * to be able to steal it.  Readers, on the other hand,
+			 * will block as they will notice the queued writer.
+			 */
+			wake_up_process(waiter->task);
 		goto out;
+	}
 
-	/* Wake up the writing waiter and let the task grab the sem: */
-	wake_up_process(waiter->task);
-	goto out;
-
- readers_only:
-	/* If we come here from up_xxxx(), another thread might have reached
-	 * rwsem_down_failed_common() before we acquired the spinlock and
-	 * woken up a waiter, making it now active.  We prefer to check for
-	 * this first in order to not spend too much time with the spinlock
-	 * held if we're not going to be able to wake up readers in the end.
-	 *
-	 * Note that we do not need to update the rwsem count: any writer
-	 * trying to acquire rwsem will run rwsem_down_write_failed() due
-	 * to the waiting threads and block trying to acquire the spinlock.
-	 *
-	 * We use a dummy atomic update in order to acquire the cache line
-	 * exclusively since we expect to succeed and run the final rwsem
-	 * count adjustment pretty soon.
+	/* Writers might steal the lock before we grant it to the next reader.
+	 * We grant the lock to readers ASAP so we can bail out early if a
+	 * writer stole the lock.
 	 */
-	if (wake_type == RWSEM_WAKE_ANY &&
-	    rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS)
-		/* Someone grabbed the sem for write already */
-		goto out;
+	readers = sem->wait_readers;
+	adjustment = readers * RWSEM_ACTIVE_READ_BIAS - RWSEM_WAITING_BIAS;
+ retry_reader_grants:
+	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
+	if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
+		/* A writer stole the lock.  Undo our reader grants. */
+		if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
+			goto out;
+		/* The writer left.  Retry waking readers. */
+		goto retry_reader_grants;
+	}
 
-	/* Grant an infinite number of read locks to the readers at the front
-	 * of the queue.  Note we increment the 'active part' of the count by
-	 * the number of readers before waking any processes up.
+	/* Read locks have been granted to all queued readers; we can now
+	 * actualy wake them.
 	 */
-	woken = 0;
 	do {
-		woken++;
-
-		if (waiter->list.next == &sem->wait_list)
-			break;
-
-		waiter = list_entry(waiter->list.next,
-					struct rwsem_waiter, list);
-
-	} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
-
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
-	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
-		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
-
-	rwsem_atomic_add(adjustment, sem);
-
-	next = sem->wait_list.next;
-	for (loop = woken; loop > 0; loop--) {
-		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
-		tsk = waiter->task;
-		smp_mb();
-		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
-	}
-
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
+		if (waiter->type != RWSEM_WAITING_FOR_WRITE) {
+			list_del(&waiter->list);
+
+			/* Set RWSEM_WAITING_BIAS before waking the last reader
+			 * so we know there will be a remaining active locker.
+			 */
+			if (!(--readers) && !list_empty(&sem->wait_list))
+				rwsem_atomic_add(RWSEM_WAITING_BIAS, sem);
+
+			tsk = waiter->task;
+			smp_mb();
+			waiter->task = NULL;
+			wake_up_process(tsk);
+			put_task_struct(tsk);
+		}
+		waiter = list_entry(next, struct rwsem_waiter, list);
+	} while (readers);
+	sem->wait_readers = 0;
 
  out:
 	return sem;
@@ -158,6 +131,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	get_task_struct(tsk);
 
 	raw_spin_lock_irq(&sem->wait_lock);
+	sem->wait_readers++;
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter.list, &sem->wait_list);
@@ -166,8 +140,8 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	count = rwsem_atomic_update(adjustment, sem);
 
 	/* If there are no active locks, wake the front queued process(es). */
-	if (count == RWSEM_WAITING_BIAS)
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
+	if (!(count & RWSEM_ACTIVE_MASK))
+		sem = __rwsem_do_wake(sem, true);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -211,7 +185,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	 * any read locks that were queued ahead of us. */
 	if (count > RWSEM_WAITING_BIAS &&
 	    adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
+		sem = __rwsem_do_wake(sem, false);
 
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -256,7 +230,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+		sem = __rwsem_do_wake(sem, true);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 
@@ -276,7 +250,7 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
+		sem = __rwsem_do_wake(sem, false);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock
  2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
                   ` (10 preceding siblings ...)
  2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
@ 2013-03-06 23:21 ` Michel Lespinasse
  11 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-06 23:21 UTC (permalink / raw)
  To: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel
  Cc: Andrew Morton, linux-kernel

modify __down_write[_nested] and __down_write_trylock to grab the write
lock whenever the active count is 0, even if there are queued waiters
(they must be writers pending wakeup, since the active count is 0).

Note that this is an optimization only; architectures without this
optimization will still work fine:

- __down_write() would take the slow path which would take the wait_lock
  and then try stealing the lock (as in the spinlocked rwsem implementation)

- __down_write_trylock() would fail, but callers must be ready to deal
  with that - since there are some writers pending wakeup, they could
  have raced with us and obtained the lock before we steal it.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 arch/x86/include/asm/rwsem.h | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 2dbe4a721ce5..cad82c9c2fde 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -105,8 +105,8 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
 	asm volatile("# beginning down_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
 		     /* adds 0xffff0001, returns the old value */
-		     "  test      %1,%1\n\t"
-		     /* was the count 0 before? */
+		     "  test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t"
+		     /* was the active mask 0 before? */
 		     "  jz        1f\n"
 		     "  call call_rwsem_down_write_failed\n"
 		     "1:\n"
@@ -126,11 +126,25 @@ static inline void __down_write(struct rw_semaphore *sem)
  */
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
-			   RWSEM_ACTIVE_WRITE_BIAS);
-	if (ret == RWSEM_UNLOCKED_VALUE)
-		return 1;
-	return 0;
+	long result, tmp;
+	asm volatile("# beginning __down_write_trylock\n\t"
+		     "  mov          %0,%1\n\t"
+		     "1:\n\t"
+		     "  test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t"
+		     /* was the active mask 0 before? */
+		     "  jnz          2f\n\t"
+		     "  mov          %1,%2\n\t"
+		     "  add          %3,%2\n\t"
+		     LOCK_PREFIX "  cmpxchg  %2,%0\n\t"
+		     "  jnz	     1b\n\t"
+		     "2:\n\t"
+		     "  sete         %b1\n\t"
+		     "  movzbl       %b1, %k1\n\t"
+		     "# ending __down_write_trylock\n\t"
+		     : "+m" (sem->count), "=&a" (result), "=&r" (tmp)
+		     : "er" (RWSEM_ACTIVE_WRITE_BIAS)
+		     : "memory", "cc");
+	return result;
 }
 
 /*
-- 
1.8.1.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
@ 2013-03-09  0:32   ` Dave Chinner
  2013-03-09  1:20     ` Michel Lespinasse
  2013-03-11 20:36   ` Peter Hurley
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2013-03-09  0:32 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
> When the first queued waiter is a reader, wake all readers instead of
> just those that are at the front of the queue. There are really two
> motivations for this change:

Isn't this a significant change of semantics for the rwsem? i.e.
that read lock requests that come after a write lock request now
jump ahead of the write lock request? i.e.the write lock request is
no longer a barrier in the queue?

XFS has long assumed that a rwsem write lock is a barrier that
stops new read locks from being taken, and this change will break
that assumption. Given that this barrier assumption is used as the
basis for serialisation of operations like IO vs truncate, there's a
bit more at stake than just improving parallelism here.  i.e. IO
issued after truncate/preallocate/hole punch could now be issued
ahead of the pending metadata operation, whereas currently the IO
issued after the pending metadata operation is waiting for the write
lock will be only be processed -after- the metadata modification
operation completes...

That is a recipe for weird data corruption problems because
applications are likely to have implicit dependencies on the barrier
effect of metadata operations on data IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-09  0:32   ` Dave Chinner
@ 2013-03-09  1:20     ` Michel Lespinasse
  2013-03-11  0:16       ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-09  1:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
>> When the first queued waiter is a reader, wake all readers instead of
>> just those that are at the front of the queue. There are really two
>> motivations for this change:
>
> Isn't this a significant change of semantics for the rwsem? i.e.
> that read lock requests that come after a write lock request now
> jump ahead of the write lock request? i.e.the write lock request is
> no longer a barrier in the queue?

Yes, I am allowing readers to skip ahead of writers in the queue (but
only if they can run with another reader that was already ahead).

I don't see that this is a change of observable semantics for correct
programs. If a reader and a writer both block on the rwsem, how do you
known for sure which one got queued first ? rwsem API doesn't give you
any easy way to know whether a thread is currently queued on the rwsem
(it could also be descheduled before it gets onto the rwsem queue).

But yes, if you're making assumptions about queuing order the change
makes it more likely that they'll be observably wrong.

> XFS has long assumed that a rwsem write lock is a barrier that
> stops new read locks from being taken, and this change will break
> that assumption. Given that this barrier assumption is used as the
> basis for serialisation of operations like IO vs truncate, there's a
> bit more at stake than just improving parallelism here.  i.e. IO
> issued after truncate/preallocate/hole punch could now be issued
> ahead of the pending metadata operation, whereas currently the IO
> issued after the pending metadata operation is waiting for the write
> lock will be only be processed -after- the metadata modification
> operation completes...
>
> That is a recipe for weird data corruption problems because
> applications are likely to have implicit dependencies on the barrier
> effect of metadata operations on data IO...

I am confused as to exactly what XFS is doing, could you point me to
the code / indicate a scenario where this would go wrong ? If you
really rely on this for correctness you'd have to do something already
to guarantee that your original queueing order is as desired, and I
just don't see how it'd be done...

That said, it is doable to add support for write lock stealing in the
rwsem write path while still preserving the queueing order of readers
vs writers; I'm just not sure that I fully understand the correctness
concern at this point.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-09  1:20     ` Michel Lespinasse
@ 2013-03-11  0:16       ` Dave Chinner
  2013-03-11  5:17         ` Michel Lespinasse
  2013-03-11  7:50         ` Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2013-03-11  0:16 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Mar 06, 2013 at 03:21:50PM -0800, Michel Lespinasse wrote:
> >> When the first queued waiter is a reader, wake all readers instead of
> >> just those that are at the front of the queue. There are really two
> >> motivations for this change:
> >
> > Isn't this a significant change of semantics for the rwsem? i.e.
> > that read lock requests that come after a write lock request now
> > jump ahead of the write lock request? i.e.the write lock request is
> > no longer a barrier in the queue?
> 
> Yes, I am allowing readers to skip ahead of writers in the queue (but
> only if they can run with another reader that was already ahead).
> 
> I don't see that this is a change of observable semantics for correct
> programs. If a reader and a writer both block on the rwsem, how do you
> known for sure which one got queued first ? rwsem API doesn't give you
> any easy way to know whether a thread is currently queued on the rwsem
> (it could also be descheduled before it gets onto the rwsem queue).

There are algorithms that rely on write locks to act as read-side
barriers to prevent write side starvation. i.e. if you keep queuing
up read locks, the writer never gets the lock, thereby starving the
writer.

> But yes, if you're making assumptions about queuing order the change
> makes it more likely that they'll be observably wrong.
> 
> > XFS has long assumed that a rwsem write lock is a barrier that
> > stops new read locks from being taken, and this change will break
> > that assumption. Given that this barrier assumption is used as the
> > basis for serialisation of operations like IO vs truncate, there's a
> > bit more at stake than just improving parallelism here.  i.e. IO
> > issued after truncate/preallocate/hole punch could now be issued
> > ahead of the pending metadata operation, whereas currently the IO
> > issued after the pending metadata operation is waiting for the write
> > lock will be only be processed -after- the metadata modification
> > operation completes...
> >
> > That is a recipe for weird data corruption problems because
> > applications are likely to have implicit dependencies on the barrier
> > effect of metadata operations on data IO...
> 
> I am confused as to exactly what XFS is doing, could you point me to
> the code / indicate a scenario where this would go wrong ? If you
> really rely on this for correctness you'd have to do something already
> to guarantee that your original queueing order is as desired, and I
> just don't see how it'd be done...

My point isn't that XFS gets the serialisation wrong - my point is
that the change of queuing order will change the userspace visible
behaviour of the filesystem *significantly*.

For example: - direct IO only takes read locks, while truncate takes
a write lock.  Right now we can drop a truncate, preallocation or
hole punch operation into a stream of concurrent direct IO and have
it execute almost immediately - the barrier mechanism in the rwsem
ensures that it will be executed ahead of any future IO that is
issued by the application. With your proposed change, that operation
won't take place until all current *and all future* IOs stop and the
read lock is no longer held by anyone.

To put this in context, here's the Irix XFS iolock initialisation
code from mid 1997, where the barrier semantics are first explicitly
documented:

	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", (long)vp->v_number);

It was only coded like this in 1997, because the Irix multi-reader
lock only grew read-side queue jumping and priority inheritence at
this time. This changed the default behaviour from write lock
barrier semantics to something similar to what you are proposing for
rwsems right now. All the locks that relied on barrier semantics of
write locks had to be configured explicitly to have that behaviour
rather than implicitly relying on the fact that mrlocks had provided
write barrier semantics.

So, that's my concern - we've got 20 years of algorithms in XFS
designed around rwsem write locks providing read-side barriers, and
it's been down this road once before (over 15 years ago).  Changing
the semantics of rwsems is going to break stuff in subtle and
unpredictable ways....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-11  0:16       ` Dave Chinner
@ 2013-03-11  5:17         ` Michel Lespinasse
  2013-03-12  2:36           ` Dave Chinner
  2013-03-11  7:50         ` Ingo Molnar
  1 sibling, 1 reply; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-11  5:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
>> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > Isn't this a significant change of semantics for the rwsem? i.e.
>> > that read lock requests that come after a write lock request now
>> > jump ahead of the write lock request? i.e.the write lock request is
>> > no longer a barrier in the queue?
>>
>> Yes, I am allowing readers to skip ahead of writers in the queue (but
>> only if they can run with another reader that was already ahead).
>>
>> I don't see that this is a change of observable semantics for correct
>> programs. If a reader and a writer both block on the rwsem, how do you
>> known for sure which one got queued first ? rwsem API doesn't give you
>> any easy way to know whether a thread is currently queued on the rwsem
>> (it could also be descheduled before it gets onto the rwsem queue).
>
> There are algorithms that rely on write locks to act as read-side
> barriers to prevent write side starvation. i.e. if you keep queuing
> up read locks, the writer never gets the lock, thereby starving the
> writer.

What I propose actually isn't as bad as a reader preference rwlock
like rwlock_t. When a reader makes it to the head of the queue, all
readers gets woken. At that point there are only writers left in the
queue, and new readers will get queued after them and won't be able to
skip over them (since they have no earlier reader to join up with).
So, I am not throwing reader/writer fairness out the window here.

>> > XFS has long assumed that a rwsem write lock is a barrier that
>> > stops new read locks from being taken, and this change will break
>> > that assumption. Given that this barrier assumption is used as the
>> > basis for serialisation of operations like IO vs truncate, there's a
>> > bit more at stake than just improving parallelism here.  i.e. IO
>> > issued after truncate/preallocate/hole punch could now be issued
>> > ahead of the pending metadata operation, whereas currently the IO
>> > issued after the pending metadata operation is waiting for the write
>> > lock will be only be processed -after- the metadata modification
>> > operation completes...
>> >
>> > That is a recipe for weird data corruption problems because
>> > applications are likely to have implicit dependencies on the barrier
>> > effect of metadata operations on data IO...
>>
>> I am confused as to exactly what XFS is doing, could you point me to
>> the code / indicate a scenario where this would go wrong ? If you
>> really rely on this for correctness you'd have to do something already
>> to guarantee that your original queueing order is as desired, and I
>> just don't see how it'd be done...
>
> My point isn't that XFS gets the serialisation wrong - my point is
> that the change of queuing order will change the userspace visible
> behaviour of the filesystem *significantly*.
>
> For example: - direct IO only takes read locks, while truncate takes
> a write lock.  Right now we can drop a truncate, preallocation or
> hole punch operation into a stream of concurrent direct IO and have
> it execute almost immediately - the barrier mechanism in the rwsem
> ensures that it will be executed ahead of any future IO that is
> issued by the application. With your proposed change, that operation
> won't take place until all current *and all future* IOs stop and the
> read lock is no longer held by anyone.

You actually wouldn't get such starvation with my proposal.

What you might see would be:

- Assuming you have up to N concurrent reads in progress, a writer
might see up to 2N-1 readers proceed in front of it. This limit is
reached if you first had N-1 readers grabbing the rwsem with an empty
queue, then another writer W1 got queued, then a reader RN (note that
it will get queued after W1 and not run concurrently with the existing
N-1 readers), then our writer W2 gets queued. As our N-1 readers
complete their IO operations, they might get queued again after W2,
and then skip ahead of W2 when RN reaches the front of the queue. So,
W2 is still guaranteed to run eventually, but with a worst case
latency that is ~2x worse than before

- since all readers are woken at once, you might see burst of direct
IO operations followed by bursts of truncate operations, instead of
having them interleaved in smaller groups as before.

I'm not sure if these characteristics are good enough for the XFS use
case. It seems tougher than many rwsem use cases, since the benefits
of increased reader parallelism are not as clear here (i.e. the IO
operations still have to hit disk so you don't get much further
benefit if you already had more IO parallelism than platters). So if
this explanation still didn't made you comfortable with the proposal,
I will rework it to avoid the queue reordering.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-11  0:16       ` Dave Chinner
  2013-03-11  5:17         ` Michel Lespinasse
@ 2013-03-11  7:50         ` Ingo Molnar
  1 sibling, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2013-03-11  7:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michel Lespinasse, Alex Shi, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel


* Dave Chinner <david@fromorbit.com> wrote:

> So, that's my concern - we've got 20 years of algorithms in XFS designed 
> around rwsem write locks providing read-side barriers, and it's been 
> down this road once before (over 15 years ago). [...]

We are certainly not going to break reader/writer fairness - that's a must 
for any generic, long-held reader-writer locking mechanism.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
  2013-03-09  0:32   ` Dave Chinner
@ 2013-03-11 20:36   ` Peter Hurley
  2013-03-14  7:03     ` Michel Lespinasse
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Hurley @ 2013-03-11 20:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel


On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> + retry_reader_grants:
> +	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> +	if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> +		/* A writer stole the lock.  Undo our reader grants. */
> +		if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> +			goto out;
> +		/* The writer left.  Retry waking readers. */
> +		goto retry_reader_grants;
> +	}

This can be reduced to single looping cmpxchg in the grant reversal
path; then if reversing the grant fails, the count can simply be
re-tested for grant success, rather than trying to atomically re-grant.
For example, with a helper function, rwsem_cmpxchg():

static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem)
{
	long tmp = *old;
	*old = cmpxchg(&sem->count, *old, new);
	return tmp == *old;
}

... then above becomes ...

	count = rwsem_atomic_update(adjustment, sem);
	do {
		if (count - adjustment >= RWSEM_WAITING_BIAS)
			break;
		if (rwsem_cmpxchg(&count, count - adjustment, sem))
			goto out;   /* or simply return sem */
	} while (1);

	< wake up readers >


Also, this series and the original rwsem can mistakenly sleep reader(s)
when the lock is transitioned from writer-owned to waiting readers-owned
with no waiting writers. For example,


  CPU 0                               |  CPU 1
                                      |
                                      | down_write()

... CPU 1 has the write lock for the semaphore.
    Meanwhile, 1 or more down_read(s) are attempted and fail;
    these are put on the wait list. Then ...

down_read()                           | up_write()
  local = atomic_update(+read_bias)   |
  local <= 0?                         |   local = atomic_update(-write_bias)
  if (true)                           |   local < 0?
     down_read_failed()               |   if (true)
                                      |      wake()
                                      |         grab wait_lock
        wait for wait_lock            |         wake all readers
                                      |         release wait_lock

... At this point, sem->count > 0 and the wait list is empty,
    but down_read_failed() will sleep the reader.


In this case, CPU 0 has observed the sem count with the write lock (and
the other waiters) and so is detoured to down_read_failed(). But if 
CPU 0 can't grab the wait_lock before the up_write() does (via
rwsem_wake()), then down_read_failed() will wake no one and sleep the
reader.

Unfortunately, this means readers and writers which observe the sem
count after the adjustment is committed by CPU 0 in down_read_failed()
will sleep as well, until the sem count returns to 0.

I think the best solution would be to preserve the observed count when
down_read() fails and pass it to rwsem_down_read_failed() -- of course,
this is also the most disruptive approach as it changes the per-arch
interface (the attached patch does not include the arch changes). The
other alternative is to go through the __rwsem_do_wake() path.

Regards,
Peter Hurley

--- >% ---
Subject: [PATCH] rwsem: Early-out tardy readers


Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
---
 lib/rwsem.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index f9a5705..8eb2cdf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -118,12 +118,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, bool wakewrite)
 /*
  * wait for the read lock to be granted
  */
-struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
+struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem, long count)
 {
 	signed long adjustment = -RWSEM_ACTIVE_READ_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
-	signed long count;
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -131,6 +130,20 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	get_task_struct(tsk);
 
 	raw_spin_lock_irq(&sem->wait_lock);
+
+	/* Try to reverse the lock attempt but if the count has changed
+	 * so that reversing fails, check if there are are no waiters,
+	 * and early-out if not */
+	do {
+		if (rwsem_cmpxchg(&count, count + adjust, sem))
+			break;
+		if (count > 0) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			put_task_struct(tsk);
+			return sem;
+		}
+	} while (1);
+
 	sem->wait_readers++;
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-- 
1.8.1.2





^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-11  5:17         ` Michel Lespinasse
@ 2013-03-12  2:36           ` Dave Chinner
  2013-03-12  6:43             ` Michel Lespinasse
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2013-03-12  2:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
> >> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > Isn't this a significant change of semantics for the rwsem? i.e.
> >> > that read lock requests that come after a write lock request now
> >> > jump ahead of the write lock request? i.e.the write lock request is
> >> > no longer a barrier in the queue?
> >>
> >> Yes, I am allowing readers to skip ahead of writers in the queue (but
> >> only if they can run with another reader that was already ahead).
> >>
> >> I don't see that this is a change of observable semantics for correct
> >> programs. If a reader and a writer both block on the rwsem, how do you
> >> known for sure which one got queued first ? rwsem API doesn't give you
> >> any easy way to know whether a thread is currently queued on the rwsem
> >> (it could also be descheduled before it gets onto the rwsem queue).
> >
> > There are algorithms that rely on write locks to act as read-side
> > barriers to prevent write side starvation. i.e. if you keep queuing
> > up read locks, the writer never gets the lock, thereby starving the
> > writer.
> 
> What I propose actually isn't as bad as a reader preference rwlock
> like rwlock_t. When a reader makes it to the head of the queue, all
> readers gets woken. At that point there are only writers left in the
> queue, and new readers will get queued after them and won't be able to
> skip over them (since they have no earlier reader to join up with).
> So, I am not throwing reader/writer fairness out the window here.

OK, but....

> > My point isn't that XFS gets the serialisation wrong - my point is
> > that the change of queuing order will change the userspace visible
> > behaviour of the filesystem *significantly*.
> >
> > For example: - direct IO only takes read locks, while truncate takes
> > a write lock.  Right now we can drop a truncate, preallocation or
> > hole punch operation into a stream of concurrent direct IO and have
> > it execute almost immediately - the barrier mechanism in the rwsem
> > ensures that it will be executed ahead of any future IO that is
> > issued by the application. With your proposed change, that operation
> > won't take place until all current *and all future* IOs stop and the
> > read lock is no longer held by anyone.
> 
> You actually wouldn't get such starvation with my proposal.
> 
> What you might see would be:
> 
> - Assuming you have up to N concurrent reads in progress, a writer
> might see up to 2N-1 readers proceed in front of it. This limit is
> reached if you first had N-1 readers grabbing the rwsem with an empty
> queue, then another writer W1 got queued,

So something like

	RRRRRRRRRRRRRW1

> then a reader RN (note that
> it will get queued after W1 and not run concurrently with the existing
> N-1 readers), then our writer W2 gets queued.

So:

	RRRRRRRRRRRRRW1rrrrrrrrrrrrW2
>
> As our N-1 readers
> complete their IO operations, they might get queued again after W2,

So:
	W1rrrrrrrrrrrrW2RRRRRRRRRRRRR

> and then skip ahead of W2 when RN reaches the front of the queue. So,
> W2 is still guaranteed to run eventually, but with a worst case
> latency that is ~2x worse than before

So, when W1 is released, the queue is treated as though it was:

	rrrrrrrrrrrrRRRRRRRRRRRRRW2

yes? Ok, so I can see where your latency figure comes from, but it's
still a change of ordering in that W2 is no longer a barrier to the
reads queued after it.

And if we extend that to WN, we have an interleaved queue
similar to this:

	W1R1W2R2W3R3.....WnRm

By my reading of the algorithm you are proposing, after W1 is
released, we end up with the queue being treated like this:

	R1R2R3....RmW2W3....Wn

Right?

If so, that is definitely a change of lock ordering semantics -
writes do not have barrier properties anymore.

> - since all readers are woken at once, you might see burst of direct
> IO operations followed by bursts of truncate operations, instead of
> having them interleaved in smaller groups as before.

And this will result in the same application IO pattern giving
vastly different results. e.g. a read IO promoted ahead of a
truncate will now return data instead of -1 (beyond EOF).

> I'm not sure if these characteristics are good enough for the XFS use
> case. It seems tougher than many rwsem use cases, since the benefits
> of increased reader parallelism are not as clear here (i.e. the IO

Well defined, predictable, deterministc ordering of operations takes
far more precedence over parallelism when it comes to filesystem
behaviour. The rwsems already have enough parallelism to allow us to
drive more than 2 million 4k IOPS to a single file via
multi-threaded direct IO(*), so we aren't limited by parallelism and
concurrency in rwsems.

> So if
> this explanation still didn't made you comfortable with the proposal,
> I will rework it to avoid the queue reordering.

Not really, but I'm still not sure I fully understand the different
queuing/queue jumping semantics fully....

Cheers,

Dave.
> 
> -- 
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-12  2:36           ` Dave Chinner
@ 2013-03-12  6:43             ` Michel Lespinasse
  2013-03-13  3:23               ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-12  6:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

Hi Dave,

On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
>> - since all readers are woken at once, you might see burst of direct
>> IO operations followed by bursts of truncate operations, instead of
>> having them interleaved in smaller groups as before.
>
> And this will result in the same application IO pattern giving
> vastly different results. e.g. a read IO promoted ahead of a
> truncate will now return data instead of -1 (beyond EOF).

I will reply to this part first, as I think this gives a more concrete
anchor to the discussion.

The crucial point is that the truncate system call hasn't completed
yet - that thread is still queued in the rwsem.

You still have the guarantee that the truncate will complete
eventually (even if there is a never-ending stream of IOs), and that
all IO system calls that start after the truncate system call
completes will see the file as truncated.

You don't have guarantees about which system call will "appear to run
before the other" if the execution times of the two system calls
overlap each other, but you actually never had such a guarantee from a
correctness perspective (i.e. you could never guarantee which of the
two would get queued on the rwsem ahead of the other).

> So:
>
>         RRRRRRRRRRRRRW1rrrrrrrrrrrrW2
>>
>> As our N-1 readers
>> complete their IO operations, they might get queued again after W2,
>
> So:
>         W1rrrrrrrrrrrrW2RRRRRRRRRRRRR
>
>> and then skip ahead of W2 when RN reaches the front of the queue. So,
>> W2 is still guaranteed to run eventually, but with a worst case
>> latency that is ~2x worse than before
>
> So, when W1 is released, the queue is treated as though it was:
>
>         rrrrrrrrrrrrRRRRRRRRRRRRRW2
>
> yes?

Yes.

> Ok, so I can see where your latency figure comes from, but it's
> still a change of ordering in that W2 is no longer a barrier to the
> reads queued after it.

My claim is that it's not a visible change from a correctness
perspective - if the extra readers R are submitted before W2 completes
then they can't assume their execution order vs W2. From a correctness
perspective, the outcome won't look any different as if W2's thread
had been delayed right before entering the truncate system call.

Now from a performance point of view, I concede there IS a difference
as this will happen more often than before. But from a correctness
point of view, this is not a new possible outcome - it was already
possible before.

> And if we extend that to WN, we have an interleaved queue
> similar to this:
>
>         W1R1W2R2W3R3.....WnRm
>
> By my reading of the algorithm you are proposing, after W1 is
> released, we end up with the queue being treated like this:
>
>         R1R2R3....RmW2W3....Wn
>
> Right?

Yes, if what you are representing is the state of the queue at a given
point of time (which implies R1..Rm must be distinct threads, not just
the same thread doing repeated requests).

As new requests come in over time, one important point to remember is
that each writer will see at most one batch of readers wake up ahead
of it (though this batch may include readers that were originally
queued both before and after it). This guarantees that the writer
won't get starved / will get its turn running eventually.

> If so, that is definitely a change of lock ordering semantics -
> writes do not have barrier properties anymore.

I find the name 'barrier' actually confusing when used to describe
synchronous operations. To me a barrier is usualy between asynchronous
operations, and it is well defined which operations are ahead or
behind of the barrier (either because they were queued by the same
thread, or they were queued by different threads which may have
synchronized together some other way). But in this case, we have two
synchronous operations with overlapping execution times - they don't
have a way to know which one is 'supposed to' be ahead of the other.
The two threads can't observe their relative ordering since they are
blocked during the operation...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-12  6:43             ` Michel Lespinasse
@ 2013-03-13  3:23               ` Dave Chinner
  2013-03-13 11:03                 ` Michel Lespinasse
  2013-03-14  2:00                 ` Peter Hurley
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2013-03-13  3:23 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> Hi Dave,
> 
> On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> >> - since all readers are woken at once, you might see burst of direct
> >> IO operations followed by bursts of truncate operations, instead of
> >> having them interleaved in smaller groups as before.
> >
> > And this will result in the same application IO pattern giving
> > vastly different results. e.g. a read IO promoted ahead of a
> > truncate will now return data instead of -1 (beyond EOF).
> 
> I will reply to this part first, as I think this gives a more concrete
> anchor to the discussion.
> 
> The crucial point is that the truncate system call hasn't completed
> yet - that thread is still queued in the rwsem.
> 
> You still have the guarantee that the truncate will complete
> eventually (even if there is a never-ending stream of IOs), and that
> all IO system calls that start after the truncate system call
> completes will see the file as truncated.

Sure, but the problem is not about when the syscall completes - the
problem is that you are changing where in the pipeline of IO the
truncate is initially executed.  i.e. ordering is not defined by
when an operation completes, but by the order ing which the queue is
processed after a blocking operation completes. That is not when the
syscall completes, but when the filesystem drops the exclusive lock.

>From a single threaded userspace application perspective or
multithreaded apps with their own userspace locking, truncates
complete when either the syscall completes or some time after when
the application drops it's lock. Either way, we're looking at
completion time serialisation, and in that case what XFS or the
kernel does simply doesn't matter.

However, if we are talking about userspace applications that use
lockless IO algorithms or kernel side applications (like knfsd) that
are exposed directly to the filesystem locking semantics,
serialisation behaviour is actually defined by filesystem's
submission side locking behaviour. There is no external
serialisation of IO completion, so serialisation and ordering is
defined (for XFS) solely by the rwsem behaviour.

> You don't have guarantees about which system call will "appear to run
> before the other" if the execution times of the two system calls
> overlap each other, but you actually never had such a guarantee from a
> correctness perspective (i.e. you could never guarantee which of the
> two would get queued on the rwsem ahead of the other).

Sure, but as long as the submission side ordering is deterministic,
it doesn't matter.

> > Ok, so I can see where your latency figure comes from, but it's
> > still a change of ordering in that W2 is no longer a barrier to the
> > reads queued after it.
> 
> My claim is that it's not a visible change from a correctness
> perspective

I am not arguing that it is incorrect. I'm arguing that the change
of ordering semantics breaks assumptions a lot of code makes about
how these locks work.

> > similar to this:
> >
> >         W1R1W2R2W3R3.....WnRm
> >
> > By my reading of the algorithm you are proposing, after W1 is
> > released, we end up with the queue being treated like this:
> >
> >         R1R2R3....RmW2W3....Wn
> >
> > Right?
> 
> Yes, if what you are representing is the state of the queue at a given
> point of time (which implies R1..Rm must be distinct threads, not just
> the same thread doing repeated requests).

Yup, that's very typical.

> As new requests come in over time, one important point to remember is
> that each writer will see at most one batch of readers wake up ahead
> of it (though this batch may include readers that were originally
> queued both before and after it).

And that's *exactly* the problem with the changes you are proposing
- rwsems will no longer provide strongly ordered write vs read
barrier semantics.

> I find the name 'barrier' actually confusing when used to describe
> synchronous operations.  To me a barrier is usualy between
> asynchronous operations, and it is well defined which operations
> are ahead or behind of the barrier (either because they were
> queued by the same thread, or they were queued by different
> threads which may have synchronized together some other way).

When you have hundreds or thousands of threads doing IO to the one
file, it doesn't matter if the IO is issued synchronously or
asynchronously by the threads - you simply have a huge amount of
data IO concurrency and very, very deep pipelines.

Inserting a metadata modification (truncate, preallocation,
holepunch, etc) into that pipeline currently causes all new
submissions to queue behind the metadata modification, waits for
everything submitted before the metadata modification to complete
and then runs the metadata modification. Once it completes, it then
allows everything queued up to the next metadata modification to
run....

That matches your definition of a barrier pretty well, I think.

> But in this case, we have two
> synchronous operations with overlapping execution times - they don't
> have a way to know which one is 'supposed to' be ahead of the other.
> The two threads can't observe their relative ordering since they are
> blocked during the operation...

We don't care about the ordering between multiple concurrent
metadata modifications - what matters is whether the ongoing data IO
around them is ordered correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-13  3:23               ` Dave Chinner
@ 2013-03-13 11:03                 ` Michel Lespinasse
  2013-03-14  2:00                 ` Peter Hurley
  1 sibling, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-13 11:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Tue, Mar 12, 2013 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
>> I find the name 'barrier' actually confusing when used to describe
>> synchronous operations.  To me a barrier is usualy between
>> asynchronous operations, and it is well defined which operations
>> are ahead or behind of the barrier (either because they were
>> queued by the same thread, or they were queued by different
>> threads which may have synchronized together some other way).
>
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
>
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions to queue behind the metadata modification, waits for
> everything submitted before the metadata modification to complete
> and then runs the metadata modification. Once it completes, it then
> allows everything queued up to the next metadata modification to
> run....
>
> That matches your definition of a barrier pretty well, I think.

The difference I see is that with async operations one can easily tell
when the desired operations made it into the queue:

t1()
async_submit(request X)
t2()
..
..
..
(eventually) request X completes

After t2 we know the operation is queued, and thus we can legitimately
expect that new operations submitted after t2 should logically run
after request X.


With sync operations we can't make that observation as the caller
thread is blocked:

t1()
begin system call
..
..
request X gets queued
..
..
request X completes
..
..
system call completes
t2()

Userspace can't pinpoint exactly when did the request get onto the queue.

If another request Y gets submitted between t1 and t2, userspace can
never know for sure if Y was before of after X in the queue, so the
proposed reordering should be (my claim) not observable by userspace,
in the sense that it never produces a result that wasn't already
possible before.




Anyway, I didn't intend for the proposed change to warrant a thread of
this size. I will abandon patch 10 and propose a different (slightly
longer, but without reordering) change to replace patch 11.

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask
  2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
@ 2013-03-13 21:33   ` Rik van Riel
  0 siblings, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2013-03-13 21:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Andrew Morton, linux-kernel

On 03/06/2013 06:21 PM, Michel Lespinasse wrote:
> We are not planning to add some new waiter flags, so we can convert the
> waiter type into an enumeration.
>
> Background: David Howells suggested I do this back when I tried adding
> a new waiter type for unfair readers. However, I believe the cleanup
> applies regardless of that use case.

Given the discussion on patch 10/12 and 11/12, could we benefit
from the fair vs. unfair distinction?

> Signed-off-by: Michel Lespinasse <walken@google.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-13  3:23               ` Dave Chinner
  2013-03-13 11:03                 ` Michel Lespinasse
@ 2013-03-14  2:00                 ` Peter Hurley
  2013-03-19  1:17                   ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Hurley @ 2013-03-14  2:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michel Lespinasse, Alex Shi, Ingo Molnar, David Howells,
	Peter Zijlstra, Thomas Gleixner, Yuanhan Liu, Rik van Riel,
	Andrew Morton, linux-kernel

On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> > Hi Dave,
> > 
> > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > >> - since all readers are woken at once, you might see burst of direct
> > >> IO operations followed by bursts of truncate operations, instead of
> > >> having them interleaved in smaller groups as before.
> > >
> > > And this will result in the same application IO pattern giving
> > > vastly different results. e.g. a read IO promoted ahead of a
> > > truncate will now return data instead of -1 (beyond EOF).
> > 
> > I will reply to this part first, as I think this gives a more concrete
> > anchor to the discussion.
> > 
> > The crucial point is that the truncate system call hasn't completed
> > yet - that thread is still queued in the rwsem.
> > 
> > You still have the guarantee that the truncate will complete
> > eventually (even if there is a never-ending stream of IOs), and that
> > all IO system calls that start after the truncate system call
> > completes will see the file as truncated.
> 
> Sure, but the problem is not about when the syscall completes - the
> problem is that you are changing where in the pipeline of IO the
> truncate is initially executed.  i.e. ordering is not defined by
> when an operation completes, but by the order ing which the queue is
> processed after a blocking operation completes. That is not when the
> syscall completes, but when the filesystem drops the exclusive lock.
> 
> From a single threaded userspace application perspective or
> multithreaded apps with their own userspace locking, truncates
> complete when either the syscall completes or some time after when
> the application drops it's lock. Either way, we're looking at
> completion time serialisation, and in that case what XFS or the
> kernel does simply doesn't matter.
> 
> However, if we are talking about userspace applications that use
> lockless IO algorithms or kernel side applications (like knfsd) that
> are exposed directly to the filesystem locking semantics,
> serialisation behaviour is actually defined by filesystem's
> submission side locking behaviour. There is no external
> serialisation of IO completion, so serialisation and ordering is
> defined (for XFS) solely by the rwsem behaviour.
> 
> > You don't have guarantees about which system call will "appear to run
> > before the other" if the execution times of the two system calls
> > overlap each other, but you actually never had such a guarantee from a
> > correctness perspective (i.e. you could never guarantee which of the
> > two would get queued on the rwsem ahead of the other).
> 
> Sure, but as long as the submission side ordering is deterministic,
> it doesn't matter.
> 
> > > Ok, so I can see where your latency figure comes from, but it's
> > > still a change of ordering in that W2 is no longer a barrier to the
> > > reads queued after it.
> > 
> > My claim is that it's not a visible change from a correctness
> > perspective
> 
> I am not arguing that it is incorrect. I'm arguing that the change
> of ordering semantics breaks assumptions a lot of code makes about
> how these locks work.
> 
> > > similar to this:
> > >
> > >         W1R1W2R2W3R3.....WnRm
> > >
> > > By my reading of the algorithm you are proposing, after W1 is
> > > released, we end up with the queue being treated like this:
> > >
> > >         R1R2R3....RmW2W3....Wn
> > >
> > > Right?
> > 
> > Yes, if what you are representing is the state of the queue at a given
> > point of time (which implies R1..Rm must be distinct threads, not just
> > the same thread doing repeated requests).
> 
> Yup, that's very typical.
> 
> > As new requests come in over time, one important point to remember is
> > that each writer will see at most one batch of readers wake up ahead
> > of it (though this batch may include readers that were originally
> > queued both before and after it).
> 
> And that's *exactly* the problem with the changes you are proposing
> - rwsems will no longer provide strongly ordered write vs read
> barrier semantics.
> 
> > I find the name 'barrier' actually confusing when used to describe
> > synchronous operations.  To me a barrier is usualy between
> > asynchronous operations, and it is well defined which operations
> > are ahead or behind of the barrier (either because they were
> > queued by the same thread, or they were queued by different
> > threads which may have synchronized together some other way).
> 
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
> 
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions to queue behind the metadata modification, waits for
> everything submitted before the metadata modification to complete
> and then runs the metadata modification. Once it completes, it then
> allows everything queued up to the next metadata modification to
> run....
> 
> That matches your definition of a barrier pretty well, I think.
> 
> > But in this case, we have two
> > synchronous operations with overlapping execution times - they don't
> > have a way to know which one is 'supposed to' be ahead of the other.
> > The two threads can't observe their relative ordering since they are
> > blocked during the operation...
> 
> We don't care about the ordering between multiple concurrent
> metadata modifications - what matters is whether the ongoing data IO
> around them is ordered correctly.

Dave,

The point that Michel is making is that there never was any ordering
guarantee by rwsem. It's an illusion.

The reason is simple: to even get to the lock the cpu has to be
sleep-able. So for every submission that you believe is ordered, is by
its very nature __not ordered__, even when used by kernel code.

Why? Because any thread on its way to claim the lock (reader or writer)
could be pre-empted for some other task, thus delaying the submission of
whatever i/o you believed to be ordered.

So just to reiterate: there is no 'queue' and no 'barrier'. The
guarantees that rwsem makes are;
1. Multiple readers can own the lock.
2. Only a single writer can own the lock.
3. Readers will not starve writers.


> On Tue, 2013-03-12 at 13:36 +1100, Dave Chinner wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > - since all readers are woken at once, you might see burst of direct
> > IO operations followed by bursts of truncate operations, instead of
> > having them interleaved in smaller groups as before.
> 
> And this will result in the same application IO pattern giving
> vastly different results. e.g. a read IO promoted ahead of a
> truncate will now return data instead of -1 (beyond EOF).

The 'same application IO pattern' will give you 'vastly different
results' with the __current__ implementation for any given
machine/day/time/run. The reason is that you cannot control which cpu
submits which IO, and is delayed because it's busy handling which
network interrupt, etc.

Where lock policy can have a significant impact is on performance. But
predicting that impact is difficult -- it's better just to measure.

It's not my intention to convince you (or anyone else) that there should
only be One True Rwsem, because I don't believe that. But I didn't want
the impression to persist that rwsem does anything more than implement a
fair reader/writer semaphore.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-11 20:36   ` Peter Hurley
@ 2013-03-14  7:03     ` Michel Lespinasse
  2013-03-14 11:39       ` Peter Hurley
  0 siblings, 1 reply; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-14  7:03 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote:
> 
> On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> > + retry_reader_grants:
> > +	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > +	if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > +		/* A writer stole the lock.  Undo our reader grants. */
> > +		if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> > +			goto out;
> > +		/* The writer left.  Retry waking readers. */
> > +		goto retry_reader_grants;
> > +	}
> 
> This can be reduced to single looping cmpxchg in the grant reversal
> path; then if reversing the grant fails, the count can simply be
> re-tested for grant success, rather than trying to atomically re-grant.
> For example, with a helper function, rwsem_cmpxchg():
> 
> static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem)
> {
> 	long tmp = *old;
> 	*old = cmpxchg(&sem->count, *old, new);
> 	return tmp == *old;
> }
> 
> ... then above becomes ...
> 
> 	count = rwsem_atomic_update(adjustment, sem);
> 	do {
> 		if (count - adjustment >= RWSEM_WAITING_BIAS)
> 			break;
> 		if (rwsem_cmpxchg(&count, count - adjustment, sem))
> 			goto out;   /* or simply return sem */
> 	} while (1);
> 
> 	< wake up readers >

This would work, but I don't see a benefit as we still end up with a
retry loop. Also, the loop would have to retry whenever the counter
value changed instead of only when writer(s) appear or are removed.

> Also, this series and the original rwsem can mistakenly sleep reader(s)
> when the lock is transitioned from writer-owned to waiting readers-owned
> with no waiting writers. For example,
> 
> 
>   CPU 0                               |  CPU 1
>                                       |
>                                       | down_write()
> 
> ... CPU 1 has the write lock for the semaphore.
>     Meanwhile, 1 or more down_read(s) are attempted and fail;
>     these are put on the wait list.

I'm not sure of the relevance of these other down_read() calls -
please note that as these extra readers are put on the wait list,
their +read_bias adjustments are canceled so that the count value
ends up at write_bias + waiting_bias (for a total active count of 1)

> Then ...
> 
> down_read()                           | up_write()
>   local = atomic_update(+read_bias)   |
>   local <= 0?                         |   local = atomic_update(-write_bias)
>   if (true)                           |   local < 0?
>      down_read_failed()               |   if (true)
>                                       |      wake()
>                                       |         grab wait_lock
>         wait for wait_lock            |         wake all readers
>                                       |         release wait_lock
> 
> ... At this point, sem->count > 0 and the wait list is empty,
>     but down_read_failed() will sleep the reader.
> 
> In this case, CPU 0 has observed the sem count with the write lock (and
> the other waiters) and so is detoured to down_read_failed(). But if 
> CPU 0 can't grab the wait_lock before the up_write() does (via
> rwsem_wake()), then down_read_failed() will wake no one and sleep the
> reader.

Actually - down_read_failed() will insert the reader on the wait_list
and do an atomic update(-read_bias). If there are no active lockers at
the time of that atomic update, it calls __rwsem_do_wake() to unqueued
the first queued waiter - which may well be itself.

In your specific example, __rwsem_do_wake() will wake the previously queued
readers as well as current.

> Unfortunately, this means readers and writers which observe the sem
> count after the adjustment is committed by CPU 0 in down_read_failed()
> will sleep as well, until the sem count returns to 0.

Note that the active count does go back to 0 in your example.

However, thinking about it made me consider the following case:

CPU 0               CPU 1                                   CPU 2

down_write()

                    down_read()
                      local = atomic_update(+read_bias)

up_write()

                                                            down_read()

                      if (local <= 0)
                        down_read_failed()

At this point CPU 0 doesn't hold the write lock anymore;
CPU 2 grabbed a read lock;
CPU 1 should grab an additional read lock but it won't - instead, it will
queue itself (and get woken by CPU 2 when that read lock is released).

This is indeed slightly suboptimal. Could be fixed in an additional patch.

As you mentioned, this is not any new issue caused by this patch series
though - it's been there forever as far as I know.

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-14  7:03     ` Michel Lespinasse
@ 2013-03-14 11:39       ` Peter Hurley
  2013-03-14 15:20         ` Michel Lespinasse
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Hurley @ 2013-03-14 11:39 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton,
	linux-kernel

On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote:
> On Mon, Mar 11, 2013 at 04:36:47PM -0400, Peter Hurley wrote:
> > 
> > On Wed, 2013-03-06 at 15:21 -0800, Michel Lespinasse wrote:
> > > + retry_reader_grants:
> > > +	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > > +	if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > > +		/* A writer stole the lock.  Undo our reader grants. */
> > > +		if (rwsem_atomic_update(-adjustment, sem) < RWSEM_WAITING_BIAS)
> > > +			goto out;
> > > +		/* The writer left.  Retry waking readers. */
> > > +		goto retry_reader_grants;
> > > +	}
> > 
> > This can be reduced to single looping cmpxchg in the grant reversal
> > path; then if reversing the grant fails, the count can simply be
> > re-tested for grant success, rather than trying to atomically re-grant.
> > For example, with a helper function, rwsem_cmpxchg():
> > 
> > static inline int rwsem_cmpxchg(long *old, long new, struct rw_semaphore *sem)
> > {
> > 	long tmp = *old;
> > 	*old = cmpxchg(&sem->count, *old, new);
> > 	return tmp == *old;
> > }
> > 
> > ... then above becomes ...
> > 
> > 	count = rwsem_atomic_update(adjustment, sem);
> > 	do {
> > 		if (count - adjustment >= RWSEM_WAITING_BIAS)
> > 			break;
> > 		if (rwsem_cmpxchg(&count, count - adjustment, sem))
> > 			goto out;   /* or simply return sem */
> > 	} while (1);
> > 
> > 	< wake up readers >
> 
> This would work, but I don't see a benefit as we still end up with a
> retry loop. Also, the loop would have to retry whenever the counter
> value changed instead of only when writer(s) appear or are removed.

It's a minor optimization in that, while the count is unstable, this CPU
will stay here and while it does so, it's only inflicting one bus lock
rather than two. Feel free to ignore it.

> > Also, this series and the original rwsem can mistakenly sleep reader(s)
> > when the lock is transitioned from writer-owned to waiting readers-owned
> > with no waiting writers. For example,
> > 
> > 
> >   CPU 0                               |  CPU 1
> >                                       |
> >                                       | down_write()
> > 
> > ... CPU 1 has the write lock for the semaphore.
> >     Meanwhile, 1 or more down_read(s) are attempted and fail;
> >     these are put on the wait list.
> 
> I'm not sure of the relevance of these other down_read() calls -
> please note that as these extra readers are put on the wait list,
> their +read_bias adjustments are canceled so that the count value
> ends up at write_bias + waiting_bias (for a total active count of 1)

The relevance of the waiting readers is that when CPU 1 drops the writer
___and grabs the spin lock___, it then wakes up these already-waiting
readers (CPU 0 is still parked in down_read_failed() waiting to acquire
the spin lock).

When CPU 1 wakes up these readers, the sem count goes > 0 and the
waiting list is emptied. CPU 1 then drops the spin lock and leaves
up_write().

CPU 0 has been spinning on the wait_lock during this time. It now
acquires the lock, and discovers the wait list is empty and ups the
adjustment(+wait bias). The count is already (+ # of waiting readers) so
it never does the wake up.

> > Then ...
> > 
> > down_read()                           | up_write()
> >   local = atomic_update(+read_bias)   |
> >   local <= 0?                         |   local = atomic_update(-write_bias)
> >   if (true)                           |   local < 0?
> >      down_read_failed()               |   if (true)
> >                                       |      wake()
> >                                       |         grab wait_lock
> >         wait for wait_lock            |         wake all readers
> >                                       |         release wait_lock
> > 
> > ... At this point, sem->count > 0 and the wait list is empty,
> >     but down_read_failed() will sleep the reader.
> > 
> > In this case, CPU 0 has observed the sem count with the write lock (and
> > the other waiters) and so is detoured to down_read_failed(). But if 
> > CPU 0 can't grab the wait_lock before the up_write() does (via
> > rwsem_wake()), then down_read_failed() will wake no one and sleep the
> > reader.
> 
> Actually - down_read_failed() will insert the reader on the wait_list
> and do an atomic update(-read_bias). If there are no active lockers at
> the time of that atomic update, it calls __rwsem_do_wake() to unqueued
> the first queued waiter - which may well be itself.
> 
> In your specific example, __rwsem_do_wake() will wake the previously queued
> readers as well as current.

Hopefully my comments above make it clear that there are no queued
readers once CPU 0 can advance with the spin lock in down_read_failed().

> > Unfortunately, this means readers and writers which observe the sem
> > count after the adjustment is committed by CPU 0 in down_read_failed()
> > will sleep as well, until the sem count returns to 0.
> 
> Note that the active count does go back to 0 in your example.

No. The final semaphore value =

   wait bias + # of queued readers that were woken by CPU 1

The reader on CPU 0 is on the wait list.
(If you want I can toss together a simple simulation of what's going on
here and tack it in a reply).

> However, thinking about it made me consider the following case:
> 
> CPU 0               CPU 1                                   CPU 2
> 
> down_write()
> 
>                     down_read()
>                       local = atomic_update(+read_bias)
> 
> up_write()
> 
>                                                             down_read()
> 
>                       if (local <= 0)
>                         down_read_failed()
> 
> At this point CPU 0 doesn't hold the write lock anymore;
> CPU 2 grabbed a read lock;
> CPU 1 should grab an additional read lock but it won't - instead, it will
> queue itself (and get woken by CPU 2 when that read lock is released).
> 
> This is indeed slightly suboptimal. Could be fixed in an additional patch.

Exactly. As with my original example, a failed reader has observed the
count < 0 but by the time it can grab the spin lock the count may
actually be > 0, in which case the reader should early-out and not
sleep.

> As you mentioned, this is not any new issue caused by this patch series
> though - it's been there forever as far as I know.

Yeah, just to be clear. These issues are not caused by this series.

I brought them to your attention because I didn't see the problem until
you split the failed paths of reader and writer. Rather than fix the old
code, I thought it would be easier for maintainers if it went in as one
series, rather than a dependent patch following.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-14 11:39       ` Peter Hurley
@ 2013-03-14 15:20         ` Michel Lespinasse
  0 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-14 15:20 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Alex Shi, Ingo Molnar, David Howells, Peter Zijlstra,
	Thomas Gleixner, Yuanhan Liu, Rik van Riel, Andrew Morton, LKML

On Thu, Mar 14, 2013 at 4:39 AM, Peter Hurley <peter@hurleysoftware.com> wrote:
> On Thu, 2013-03-14 at 00:03 -0700, Michel Lespinasse wrote:
>> >   CPU 0                               |  CPU 1
>> >                                       |
>> >                                       | down_write()
>> >
>> > ... CPU 1 has the write lock for the semaphore.
>> >     Meanwhile, 1 or more down_read(s) are attempted and fail;
>> >     these are put on the wait list.
>>
>> I'm not sure of the relevance of these other down_read() calls -
>> please note that as these extra readers are put on the wait list,
>> their +read_bias adjustments are canceled so that the count value
>> ends up at write_bias + waiting_bias (for a total active count of 1)
>
> The relevance of the waiting readers is that when CPU 1 drops the writer
> ___and grabs the spin lock___, it then wakes up these already-waiting
> readers (CPU 0 is still parked in down_read_failed() waiting to acquire
> the spin lock).
>
> When CPU 1 wakes up these readers, the sem count goes > 0 and the
> waiting list is emptied. CPU 1 then drops the spin lock and leaves
> up_write().

Ah, correct. So the race you noticed is real, and the one I found is
just a more complicated way to hit it.

I'll add a fix in v2 of this patch series (probably later today).

BTW, you seem to have been paying pretty close attention, would you
mind giving appropriate Reviewed-by tags when I send v2 ?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-14  2:00                 ` Peter Hurley
@ 2013-03-19  1:17                   ` Dave Chinner
  2013-03-19 23:48                     ` Michel Lespinasse
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2013-03-19  1:17 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Michel Lespinasse, Alex Shi, Ingo Molnar, David Howells,
	Peter Zijlstra, Thomas Gleixner, Yuanhan Liu, Rik van Riel,
	Andrew Morton, linux-kernel

On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> > We don't care about the ordering between multiple concurrent
> > metadata modifications - what matters is whether the ongoing data IO
> > around them is ordered correctly.
> 
> Dave,
> 
> The point that Michel is making is that there never was any ordering
> guarantee by rwsem. It's an illusion.

Weasel words.

> The reason is simple: to even get to the lock the cpu has to be
> sleep-able. So for every submission that you believe is ordered, is by
> its very nature __not ordered__, even when used by kernel code.
>
> Why? Because any thread on its way to claim the lock (reader or writer)
> could be pre-empted for some other task, thus delaying the submission of
> whatever i/o you believed to be ordered.

You think I don't know this?  You're arguing fine grained, low level
behaviour between tasks is unpredictable. I get that. I understand
that. But I'm not arguing about fine-grained, low level, microsecond
semantics of the locking order....

What you (and Michael) appear to be failing to see is what happens
on a macro level when you have read locks being held for periods
measured in *seconds* (e.g. direct IO gets queued behind a few
thousand other IOs in the elevator waiting for a request slot),
and the subsequent effect of inserting an operation that requires a
write lock into that IO stream.

IOWs, it simply doesn't matter if there's a micro-level race between
the write lock and a couple of the readers. That's the level you
guys are arguing at but it simply does not matter in the cases I'm
describing. I'm talking about high level serialisation behaviours
that might take of *seconds* to play out and the ordering behaviours
observed at that scale.

That is, I don't care if a couple of threads out of a few thousand
race with the write lock over few tens to hundreds of microseconds,
but I most definitely care if a few thousand IOs issued seconds
after the write lock is queued jump over the write lock. That is a
gross behavioural change at the macro-level.....

> So just to reiterate: there is no 'queue' and no 'barrier'. The
> guarantees that rwsem makes are;
> 1. Multiple readers can own the lock.
> 2. Only a single writer can own the lock.
> 3. Readers will not starve writers.

You've conveniently ignored the fact that the current implementation
also provides following guarantee:

4. new readers will block behind existing writers

And that's the behaviour we currently depend on, whether you like it
or not.

> Where lock policy can have a significant impact is on performance. But
> predicting that impact is difficult -- it's better just to measure.

Predicting the impact in this case is trivial - it's obvious that
ordering of operations will change and break high level assumptions
that userspace currently makes about various IO operations on XFS
filesystems

> It's not my intention to convince you (or anyone else) that there should
> only be One True Rwsem, because I don't believe that. But I didn't want
> the impression to persist that rwsem does anything more than implement a
> fair reader/writer semaphore.

I'm sorry, but redefining "fair" to suit your own needs doesn't
convince me of anything. rwsem behaviour has been unchanged for at
least 10 years and hence the current implementation defines what is
"fair", not what you say is fair....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
  2013-03-19  1:17                   ` Dave Chinner
@ 2013-03-19 23:48                     ` Michel Lespinasse
  0 siblings, 0 replies; 30+ messages in thread
From: Michel Lespinasse @ 2013-03-19 23:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Hurley, Alex Shi, Ingo Molnar, David Howells,
	Peter Zijlstra, Thomas Gleixner, Yuanhan Liu, Rik van Riel,
	Andrew Morton, linux-kernel

On Mon, Mar 18, 2013 at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
>> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
>> > We don't care about the ordering between multiple concurrent
>> > metadata modifications - what matters is whether the ongoing data IO
>> > around them is ordered correctly.
>>
>> Dave,
>>
>> The point that Michel is making is that there never was any ordering
>> guarantee by rwsem. It's an illusion.
>
> Weasel words.

Whoaaa, calm down.

You initially made one false statement (that the change meant a stream
of readers would starve a writer forever) and one imprecise statement
(that rwsem used to guarantee that readers don't skip ahead of writers
- this may be true in practice for your use case because the latencies
involved are very large compared to scheduling latencies, but that's a
very important qualification that needs to be added here). That
confused me enough that I initially couldn't tell what your actual
concern was, so I pointed out the source of my confusion and asked you
to clarify. It seems unfair to characterize that as "weasel words" -
I'm not trying to be a smartass here, but only to actually understand
your concern.

>> The reason is simple: to even get to the lock the cpu has to be
>> sleep-able. So for every submission that you believe is ordered, is by
>> its very nature __not ordered__, even when used by kernel code.
>>
>> Why? Because any thread on its way to claim the lock (reader or writer)
>> could be pre-empted for some other task, thus delaying the submission of
>> whatever i/o you believed to be ordered.
>
> You think I don't know this?  You're arguing fine grained, low level
> behaviour between tasks is unpredictable. I get that. I understand
> that. But I'm not arguing about fine-grained, low level, microsecond
> semantics of the locking order....
>
> What you (and Michael) appear to be failing to see is what happens
> on a macro level when you have read locks being held for periods
> measured in *seconds* (e.g. direct IO gets queued behind a few
> thousand other IOs in the elevator waiting for a request slot),
> and the subsequent effect of inserting an operation that requires a
> write lock into that IO stream.
>
> IOWs, it simply doesn't matter if there's a micro-level race between
> the write lock and a couple of the readers. That's the level you
> guys are arguing at but it simply does not matter in the cases I'm
> describing. I'm talking about high level serialisation behaviours
> that might take of *seconds* to play out and the ordering behaviours
> observed at that scale.
>
> That is, I don't care if a couple of threads out of a few thousand
> race with the write lock over few tens to hundreds of microseconds,
> but I most definitely care if a few thousand IOs issued seconds
> after the write lock is queued jump over the write lock. That is a
> gross behavioural change at the macro-level.....

Understood. I accepted your concern and made sure my v2 proposal
doesn't do such macro level reordering.

>> So just to reiterate: there is no 'queue' and no 'barrier'. The
>> guarantees that rwsem makes are;
>> 1. Multiple readers can own the lock.
>> 2. Only a single writer can own the lock.
>> 3. Readers will not starve writers.
>
> You've conveniently ignored the fact that the current implementation
> also provides following guarantee:
>
> 4. new readers will block behind existing writers

In your use case, with large enough queue latencies, yes.

Please don't make it sound like this applies in every use case - it
has never applied for short (<ms) queue latencies, and you might
confuse people by making such unqualified statements.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-03-19 23:48 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
2013-03-13 21:33   ` Rik van Riel
2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
2013-03-09  0:32   ` Dave Chinner
2013-03-09  1:20     ` Michel Lespinasse
2013-03-11  0:16       ` Dave Chinner
2013-03-11  5:17         ` Michel Lespinasse
2013-03-12  2:36           ` Dave Chinner
2013-03-12  6:43             ` Michel Lespinasse
2013-03-13  3:23               ` Dave Chinner
2013-03-13 11:03                 ` Michel Lespinasse
2013-03-14  2:00                 ` Peter Hurley
2013-03-19  1:17                   ` Dave Chinner
2013-03-19 23:48                     ` Michel Lespinasse
2013-03-11  7:50         ` Ingo Molnar
2013-03-11 20:36   ` Peter Hurley
2013-03-14  7:03     ` Michel Lespinasse
2013-03-14 11:39       ` Peter Hurley
2013-03-14 15:20         ` Michel Lespinasse
2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.