All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal
@ 2010-05-14 12:39 Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
                   ` (10 more replies)
  0 siblings, 11 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

I would like to sollicit comments regarding the following changes
against 2.6.34-rc7 + 91af708 (from V1 proposal) already applied.

The motivation for this change was some cluster monitoring software we
use at google; which reads /proc/<pid>/maps files for all running
processes. When the machines are under load, the mmap_sem is often
acquire for reads for long periods of time since do_page_fault() holds
it while doing disk accesses; and fair queueing behavior often ends up
in the monitoring software making little progress. By introducing
unfair behavior in a few selected places, are are able to let the
monitoring software make progress without impacting performance for
the rest of the system.

In general, I've made sure to implement this proposal without touching
the rwsem fast paths. Also, the first 8 patches of this series should
be of general applicability even if not taking the down_read_unfair()
changes, addressing minor issues such as situations where reader
threads can get blocked at the head of the waiting list even though
the rwsem is currently owned for reads.

Changes since v1:
- Keep the active count check when trying to wake readers in the up_xxxx()
  slow path (I had suppressed it in v1). However, I did try to lighten the
  check (this is patch 3 of the series).
- Added priviledge check before making use of unfair behavior in
  /proc/<pid>/exe and /proc/<pid>/maps files.
- Applied David Howell's many small suggestions (I hope I did not miss any).

Michel Lespinasse (10):
  x86 rwsem: minor cleanups
  rwsem: fully separate code pathes to wake writers vs readers
  rwsem: lighter active count checks when waking up readers
  rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  rwsem: wake queued readers when writer blocks on active read lock
  rwsem: smaller wrappers around rwsem_down_failed_common
  generic rwsem: implement down_read_unfair
  rwsem: down_read_unfair infrastructure support
  x86 rwsem: down_read_unfair implementation
  Use down_read_unfair() for /sys/<pid>/exe and /sys/<pid>/maps files

 arch/x86/include/asm/rwsem.h   |   70 ++++++++++++-----
 arch/x86/lib/rwsem_64.S        |   14 +++-
 arch/x86/lib/semaphore_32.S    |   21 +++++-
 fs/proc/base.c                 |    2 +-
 fs/proc/task_mmu.c             |    2 +-
 fs/proc/task_nommu.c           |    2 +-
 include/linux/capability.h     |    1 +
 include/linux/rwsem-spinlock.h |   10 ++-
 include/linux/rwsem.h          |   13 +++
 kernel/rwsem.c                 |   31 ++++++++
 lib/rwsem-spinlock.c           |   10 ++-
 lib/rwsem.c                    |  160 ++++++++++++++++++++++++++--------------
 12 files changed, 247 insertions(+), 89 deletions(-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 01/10] x86 rwsem: minor cleanups
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

The only functional change here is that __up_write won't call
call_rwsem_wake anymore if the new rwsem value is >0. This makes
no real difference since call_rwsem_wake would have noticed the
active count being nonzero and done nothing anyway.

Besides that, I clarified a few comments.

Feel free to drop this patch from the series if you don't like it;
nothing else depends on it.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   17 ++++++++---------
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 606ede1..86e0473 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -118,7 +118,7 @@ static inline void __down_read(struct rw_semaphore *sem)
 {
 	asm volatile("# beginning down_read\n\t"
 		     LOCK_PREFIX _ASM_INC "(%1)\n\t"
-		     /* adds 0x00000001, returns the old value */
+		     /* adds 0x00000001 */
 		     "  jns        1f\n"
 		     "  call call_rwsem_down_read_failed\n"
 		     "1:\n\t"
@@ -160,7 +160,7 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
 	tmp = RWSEM_ACTIVE_WRITE_BIAS;
 	asm volatile("# beginning down_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* subtract 0x0000ffff, returns the old value */
+		     /* adds 0xffff0001, returns the old value */
 		     "  test      %1,%1\n\t"
 		     /* was the count 0 before? */
 		     "  jz        1f\n"
@@ -200,7 +200,7 @@ static inline void __up_read(struct rw_semaphore *sem)
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
 		     /* subtracts 1, returns the old value */
 		     "  jns        1f\n\t"
-		     "  call call_rwsem_wake\n"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n"
 		     "# ending __up_read\n"
 		     : "+m" (sem->count), "=d" (tmp)
@@ -213,17 +213,16 @@ static inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
-	rwsem_count_t tmp;
+	rwsem_count_t tmp = -RWSEM_ACTIVE_WRITE_BIAS;
 	asm volatile("# beginning __up_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* tries to transition
-			0xffff0001 -> 0x00000000 */
-		     "  jz       1f\n"
-		     "  call call_rwsem_wake\n"
+		     /* substracts 0xffff0001, returns the old value */
+		     "  jns        1f\n\t"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n\t"
 		     "# ending __up_write\n"
 		     : "+m" (sem->count), "=d" (tmp)
-		     : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS)
+		     : "a" (sem), "1" (tmp)
 		     : "memory", "cc");
 }
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This is in preparation to later changes in the series.

In __rwsem_do_wake(), the first queued waiter is checked first in order
to determine whether it's a writer or a reader. The code paths diverge
at this point. The code that checks and incremetns the rwsem active count
is duplicated on both sides - the point is that later changes in the
series will be able to independently modify both sides.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   54 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 8d6a13e..e3e133f 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -41,7 +41,7 @@ struct rwsem_waiter {
  * - if we come here from up_xxxx(), then:
  *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
  *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
- *   - there must be someone on the queue
+ * - there must be someone on the queue
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
@@ -54,27 +54,24 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	struct list_head *next;
 	signed long oldcount, woken, loop;
 
+	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+		goto readers_only;
+
 	if (downgrading)
-		goto dont_wake_writers;
+		goto out;
 
-	/* if we came through an up_xxxx() call, we only only wake someone up
-	 * if we can transition the active part of the count from 0 -> 1
+	/* There's a writer at the front of the queue - try to grant it the
+	 * write lock.  However, we only wake this writer if we can transition
+	 * the active part of the count from 0 -> 1
 	 */
  try_again:
 	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
 						- RWSEM_ACTIVE_BIAS;
 	if (oldcount & RWSEM_ACTIVE_MASK)
+		/* Someone grabbed the sem already */
 		goto undo;
 
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-
-	/* try to grant a single write lock if there's a writer at the front
-	 * of the queue - note we leave the 'active part' of the count
-	 * incremented by 1 and the waiting part incremented by 0x00010000
-	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
-
 	/* We must be careful not to touch 'waiter' after we set ->task = NULL.
 	 * It is an allocated on the waiter's stack and may become invalid at
 	 * any time after that point (due to a wakeup from another source).
@@ -87,18 +84,25 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	put_task_struct(tsk);
 	goto out;
 
-	/* don't want to wake any writers */
- dont_wake_writers:
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
-		goto out;
+ readers_only:
+	if (!downgrading) {
+
+		/* if we came through an up_xxxx() call, we only only wake
+		 * someone up if we can transition the active part of the
+		 * count from 0 -> 1
+		 */
+ try_again_read:
+		oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
+							- RWSEM_ACTIVE_BIAS;
+		if (oldcount & RWSEM_ACTIVE_MASK)
+			/* Someone grabbed the sem already */
+			goto undo_read;
+	}
 
-	/* grant an infinite number of read locks to the readers at the front
-	 * of the queue
-	 * - note we increment the 'active part' of the count by the number of
-	 *   readers before waking any processes up
+	/* Grant an infinite number of read locks to the readers at the front
+	 * of the queue.  Note we increment the 'active part' of the count by
+	 * the number of readers before waking any processes up.
 	 */
- readers_only:
 	woken = 0;
 	do {
 		woken++;
@@ -141,6 +145,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
+ undo_read:
+	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+		goto out;
+	goto try_again_read;
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/10] rwsem: lighter active count checks when waking up readers
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

In __rwsem_do_wake(), we can skip the active count check unless we come
there from up_xxxx(). Also when checking the active count, it is not
actually necessary to increment it; this allows us to get rid of the
read side undo code and simplify the calculation of the final rwsem count
adjustment once we've counted the reader threads to wake.

The basic observation is the following. When there are waiter threads
on a rwsem and the spinlock is held, other threads can only increment the
active count by trying to grab the rwsem in up_xxxx(). However up_xxxx()
will notice there are waiter threads and take the down_failed path,
blocking to acquire the spinlock on the way there. Therefore, a thread
observing an active count of zero with waiters queued and the spinlock held,
is protected against other threads acquiring the rwsem until it wakes the
last waiter or releases the spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   60 ++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index e3e133f..93206a3 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -36,6 +36,14 @@ struct rwsem_waiter {
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
 };
 
+/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
+ * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
+ * since the rwsem value was observed.
+ */
+#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
+#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
+#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -46,8 +54,8 @@ struct rwsem_waiter {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
  */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static struct rw_semaphore *
+__rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -58,7 +66,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
 		goto readers_only;
 
-	if (downgrading)
+	if (wake_type == RWSEM_WAKE_READ_OWNED)
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -85,19 +93,24 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	goto out;
 
  readers_only:
-	if (!downgrading) {
-
-		/* if we came through an up_xxxx() call, we only only wake
-		 * someone up if we can transition the active part of the
-		 * count from 0 -> 1
-		 */
- try_again_read:
-		oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-							- RWSEM_ACTIVE_BIAS;
-		if (oldcount & RWSEM_ACTIVE_MASK)
-			/* Someone grabbed the sem already */
-			goto undo_read;
-	}
+	/* If we come here from up_xxxx(), another thread might have reached
+	 * rwsem_down_failed_common() before we acquired the spinlock and
+	 * woken up an active locker.  We prefer to check for this first in
+	 * order to not spend too much time with the spinlock held if we're
+	 * not going to be able to wake up readers in the end.
+	 *
+	 * Note that we do not need to update the rwsem count: any writer
+	 * trying to acquire rwsem will run rwsem_down_write_failed() due
+	 * to the waiting threads, and block trying to acquire the spinlock.
+	 *
+	 * We use a dummy atomic update in order to acquire the cache line
+	 * exclusively since we expect to succeed and run the final rwsem
+	 * count adjustment pretty soon.
+	 */
+	if (wake_type == RWSEM_WAKE_ANY &&
+	    rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK)
+		/* Someone grabbed the sem already */
+		goto out;
 
 	/* Grant an infinite number of read locks to the readers at the front
 	 * of the queue.  Note we increment the 'active part' of the count by
@@ -117,9 +130,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 
 	loop = woken;
 	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
-	if (!downgrading)
-		/* we'd already done one increment earlier */
-		woken -= RWSEM_ACTIVE_BIAS;
 
 	rwsem_atomic_add(woken, sem);
 
@@ -145,10 +155,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
- undo_read:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
-		goto out;
-	goto try_again_read;
 }
 
 /*
@@ -170,12 +176,12 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	list_add_tail(&waiter->list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively read-locking */
+	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
 	/* if there are no active locks, wake the front queued process(es) up */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
 
 	spin_unlock_irq(&sem->wait_lock);
 
@@ -232,7 +238,7 @@ asmregparm struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
@@ -252,7 +258,7 @@ asmregparm struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 1);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (2 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With this
change, the bias is added only once when the wait list is non-empty.

This has a few nice properties which will be used in following changes:
- when the spinlock is held and the waiter list is known to be non-empty,
  count < RWSEM_WAITING_BIAS  <=>  there is an active writer on that sem
- count == RWSEM_WAITING_BIAS  <=>  there are waiting threads and no
                                     active readers/writers on that sem

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 93206a3..92a133f 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -60,7 +60,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	signed long oldcount, woken, loop;
+	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
@@ -73,9 +73,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * write lock.  However, we only wake this writer if we can transition
 	 * the active part of the count from 0 -> 1
 	 */
+	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
+	if (waiter->list.next == &sem->wait_list)
+		adjustment -= RWSEM_WAITING_BIAS;
+
  try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-						- RWSEM_ACTIVE_BIAS;
+	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
 	if (oldcount & RWSEM_ACTIVE_MASK)
 		/* Someone grabbed the sem already */
 		goto undo;
@@ -128,13 +131,15 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
+	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
 
-	rwsem_atomic_add(woken, sem);
+	rwsem_atomic_add(adjustment, sem);
 
 	next = sem->wait_list.next;
-	for (; loop > 0; loop--) {
+	for (loop = woken; loop > 0; loop--) {
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
 		tsk = waiter->task;
@@ -152,7 +157,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 
 	/* undo the change to count, but check for a transition 1->0 */
  undo:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+	if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
 }
@@ -174,6 +179,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	waiter->task = tsk;
 	get_task_struct(tsk);
 
+	if (list_empty(&sem->wait_list))
+		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter->list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
@@ -207,8 +214,7 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter,
-				RWSEM_WAITING_BIAS - RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
 	return sem;
 }
 
@@ -221,7 +227,7 @@ rwsem_down_write_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
 
 	return sem;
 }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (3 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This change addresses the following situation:

- Thread A acquires the rwsem for read
- Thread B tries to acquire the rwsem for write, notices there is already
  an active owner for the rwsem.
- Thread C tries to acquire the rwsem for read, notices that thread B already
  tried to acquire it.
- Thread C grabs the spinlock and queues itself on the wait queue.
- Thread B grabs the spinlock and queues itself behind C. At this point A is
  the only remaining active owner on the rwsem.

In this situation thread B could notice that it was the last active writer
on the rwsem, and decide to wake C to let it proceed in parallel with A
since they both only want the rwsem for read.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   21 ++++++++++++++++-----
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 92a133f..6e1200e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -67,6 +67,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
+		/* Another active reader was observed, so wakeup is not
+		 * likely to succeed. Save the atomic op.
+		 */
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -98,7 +101,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  readers_only:
 	/* If we come here from up_xxxx(), another thread might have reached
 	 * rwsem_down_failed_common() before we acquired the spinlock and
-	 * woken up an active locker.  We prefer to check for this first in
+	 * woken up an active writer.  We prefer to check for this first in
 	 * order to not spend too much time with the spinlock held if we're
 	 * not going to be able to wake up readers in the end.
 	 *
@@ -111,8 +114,8 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * count adjustment pretty soon.
 	 */
 	if (wake_type == RWSEM_WAKE_ANY &&
-	    rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK)
-		/* Someone grabbed the sem already */
+	    rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS)
+		/* Someone grabbed the sem for write already */
 		goto out;
 
 	/* Grant an infinite number of read locks to the readers at the front
@@ -186,9 +189,17 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
-	/* if there are no active locks, wake the front queued process(es) up */
-	if (!(count & RWSEM_ACTIVE_MASK))
+	/* If there are no active locks, wake the front queued process(es) up.
+	 *
+	 * Alternatively, if we're called from a failed down_write(), there
+	 * were already threads queued before us, and there are no active
+	 * writers, the lock must be read owned; so we try to wake any read
+	 * locks that were queued ahead of us. */
+	if (count == RWSEM_WAITING_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
+	else if (count > RWSEM_WAITING_BIAS &&
+		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irq(&sem->wait_lock);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (4 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 07/10] generic rwsem: implement down_read_unfair Michel Lespinasse
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

More code can be pushed from rwsem_down_read_failed and
rwsem_down_write_failed into rwsem_down_failed_common.

Following change adding down_read_unfair infrastructure support also
enjoys having flags available in a register rather than having to
fish it out in the struct rwsem_waiter...

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 6e1200e..2dd1a1b 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -170,8 +170,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			struct rwsem_waiter *waiter, signed long adjustment)
+			 unsigned int flags, signed long adjustment)
 {
+	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	signed long count;
 
@@ -179,12 +180,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
-	waiter->task = tsk;
+	waiter.task = tsk;
+	waiter.flags = flags;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter->list, &sem->wait_list);
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -205,7 +207,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->task)
+		if (!waiter.task)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -222,11 +224,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 asmregparm struct rw_semaphore __sched *
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_READ,
+					-RWSEM_ACTIVE_READ_BIAS);
 }
 
 /*
@@ -235,12 +234,8 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 asmregparm struct rw_semaphore __sched *
 rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
-
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_WRITE,
+					-RWSEM_ACTIVE_WRITE_BIAS);
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/10] generic rwsem: implement down_read_unfair
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (5 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 08/10] rwsem: down_read_unfair infrastructure support Michel Lespinasse
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Add down_read_unfair() API.

This is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_unfair() callers
  are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_unfair() callers get queued in
  front of threads trying to acquire the rwsem by other means.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   10 ++++++++++
 kernel/rwsem.c                 |   17 +++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..dc849d9 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,15 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __sched __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1);	}
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..0d3310b 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -28,6 +28,16 @@ struct rw_semaphore;
 extern void down_read(struct rw_semaphore *sem);
 
 /*
+ * lock for reading - skip waiting threads
+ */
+#ifdef __HAVE_DOWN_READ_UNFAIR
+extern void down_read_unfair(struct rw_semaphore *sem);
+#else
+static inline void down_read_unfair(struct rw_semaphore *sem)
+	{ down_read(sem); }
+#endif
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 extern int down_read_trylock(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cae050b..d7b424b 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -26,6 +26,23 @@ void __sched down_read(struct rw_semaphore *sem)
 
 EXPORT_SYMBOL(down_read);
 
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * lock for reading - skip waiting threads
+ */
+void __sched down_read_unfair(struct rw_semaphore *sem)
+{
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
+}
+
+EXPORT_SYMBOL(down_read_unfair);
+
+#endif
+
 /*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..b2fd5fb 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -139,7 +139,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 /*
  * get a read lock on the semaphore
  */
-void __sched __down_read(struct rw_semaphore *sem)
+void __sched __down_read_internal(struct rw_semaphore *sem, int unfair)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
@@ -147,7 +147,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 
 	spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
+	if (sem->activity >= 0 && (unfair || list_empty(&sem->wait_list))) {
 		/* granted */
 		sem->activity++;
 		spin_unlock_irqrestore(&sem->wait_lock, flags);
@@ -162,7 +162,11 @@ void __sched __down_read(struct rw_semaphore *sem)
 	waiter.flags = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
-	list_add_tail(&waiter.list, &sem->wait_list);
+	if (unfair) {
+		list_add(&waiter.list, &sem->wait_list);
+	} else {
+		list_add_tail(&waiter.list, &sem->wait_list);
+	}
 
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/10] rwsem: down_read_unfair infrastructure support
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (6 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 07/10] generic rwsem: implement down_read_unfair Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 09/10] x86 rwsem: down_read_unfair implementation Michel Lespinasse
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Add rwsem_down_read_unfair_failed() function in the non-generic rwsem
library code, similar to  rwsem_down_read_failed() except that blocked
threads are placed at the head of the queue.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 2dd1a1b..f0589b2 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -34,6 +34,7 @@ struct rwsem_waiter {
 	unsigned int flags;
 #define RWSEM_WAITING_FOR_READ	0x00000001
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
+#define RWSEM_UNFAIR            0x00000004
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -186,7 +187,11 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter.list, &sem->wait_list);
+
+	if (flags & RWSEM_UNFAIR)
+		list_add(&waiter.list, &sem->wait_list);
+	else
+		list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -228,6 +233,21 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 					-RWSEM_ACTIVE_READ_BIAS);
 }
 
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * wait for the read lock to be granted - skip waiting threads
+ */
+asmregparm struct rw_semaphore __sched *
+rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
+{
+	return rwsem_down_failed_common(sem,
+					RWSEM_WAITING_FOR_READ | RWSEM_UNFAIR,
+					-RWSEM_ACTIVE_READ_BIAS);
+}
+
+#endif
+
 /*
  * wait for the write lock to be granted
  */
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/10] x86 rwsem: down_read_unfair implementation
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (7 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 08/10] rwsem: down_read_unfair infrastructure support Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 12:39 ` [PATCH 10/10] Use down_read_unfair() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
  2010-05-14 15:13 ` [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Linus Torvalds
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

The basic properties that have been true so far still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer
will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_unfair()
can make use of this to determine if there are any active writers.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   53 ++++++++++++++++++++++++++++++++---------
 arch/x86/lib/rwsem_64.S      |   14 +++++++++-
 arch/x86/lib/semaphore_32.S  |   21 +++++++++++++++-
 3 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 86e0473..fbb068c 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_BIAS supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -48,6 +47,8 @@ struct rwsem_waiter;
 extern asmregparm struct rw_semaphore *
  rwsem_down_read_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
+ rwsem_down_read_unfair_failed(struct rw_semaphore *sem);
+extern asmregparm struct rw_semaphore *
  rwsem_down_write_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
  rwsem_wake(struct rw_semaphore *);
@@ -62,17 +63,22 @@ extern asmregparm struct rw_semaphore *
  * for 64 bits.
  */
 
+
 #ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
+# define RWSEM_ACTIVE_MASK		0x000000007fffffffL
+# define RWSEM_ACTIVE_READ_BIAS		0x0000000000000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffffffff00000001L
+# define RWSEM_WAITING_BIAS		0xffffffff80000000L
 #else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_UNLOCKED_VALUE		0x00000000L
+# define RWSEM_ACTIVE_MASK		0x00007fffL
+# define RWSEM_ACTIVE_READ_BIAS		0x00000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffff0001L
+# define RWSEM_WAITING_BIAS		0xffff8000L
 #endif
 
-#define RWSEM_UNLOCKED_VALUE		0x00000000L
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define __HAVE_DOWN_READ_UNFAIR
 
 typedef signed long rwsem_count_t;
 
@@ -129,6 +135,28 @@ static inline void __down_read(struct rw_semaphore *sem)
 }
 
 /*
+ * lock for reading - skip waiting threads
+ */
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+{
+	rwsem_count_t tmp;
+
+	tmp = RWSEM_ACTIVE_READ_BIAS;
+	asm volatile("# beginning down_read_unfair\n\t"
+		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
+		     /* adds 0x00000001, returns the old value */
+		     "  cmp       %4,%1\n\t"
+		     /* was the count >= RWSEM_WAITING_BIAS before? */
+		     "  jge       1f\n"
+		     "  call call_rwsem_down_read_unfair_failed\n"
+		     "1:\n"
+		     "# ending down_read_unfair"
+		     : "+m" (sem->count), "=r" (tmp)
+		     : "a" (sem), "1" (tmp), "re" (RWSEM_WAITING_BIAS)
+		     : "memory", "cc");
+}
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 static inline int __down_read_trylock(struct rw_semaphore *sem)
@@ -242,7 +270,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem),
+		       "er" (RWSEM_ACTIVE_READ_BIAS - RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
diff --git a/arch/x86/lib/rwsem_64.S b/arch/x86/lib/rwsem_64.S
index 41fcf00..e99d187 100644
--- a/arch/x86/lib/rwsem_64.S
+++ b/arch/x86/lib/rwsem_64.S
@@ -51,6 +51,16 @@ ENTRY(call_rwsem_down_read_failed)
 	ret
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	save_common_regs
+	pushq %rdx
+	movq %rax,%rdi
+	call rwsem_down_read_unfair_failed
+	popq %rdx
+	restore_common_regs
+	ret
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	save_common_regs
 	movq %rax,%rdi
@@ -60,8 +70,8 @@ ENTRY(call_rwsem_down_write_failed)
 	ENDPROC(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
-	decl %edx	/* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpl $0x80000001, %edx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	save_common_regs
 	movq %rax,%rdi
 	call rwsem_wake
diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S
index 648fe47..fea9a53 100644
--- a/arch/x86/lib/semaphore_32.S
+++ b/arch/x86/lib/semaphore_32.S
@@ -89,6 +89,23 @@ ENTRY(call_rwsem_down_read_failed)
 	CFI_ENDPROC
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	CFI_STARTPROC
+	push %ecx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET ecx,0
+	push %edx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET edx,0
+	call rwsem_down_read_unfair_failed
+	pop %edx
+	CFI_ADJUST_CFA_OFFSET -4
+	pop %ecx
+	CFI_ADJUST_CFA_OFFSET -4
+	ret
+	CFI_ENDPROC
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	CFI_STARTPROC
 	push %ecx
@@ -103,8 +120,8 @@ ENTRY(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
 	CFI_STARTPROC
-	decw %dx    /* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpw $0x8001, %dx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	push %ecx
 	CFI_ADJUST_CFA_OFFSET 4
 	CFI_REL_OFFSET ecx,0
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/10] Use down_read_unfair() for /sys/<pid>/exe and /sys/<pid>/maps files
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (8 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 09/10] x86 rwsem: down_read_unfair implementation Michel Lespinasse
@ 2010-05-14 12:39 ` Michel Lespinasse
  2010-05-14 15:13 ` [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Linus Torvalds
  10 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-14 12:39 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This helps in the following situation:
- Thread A takes a page fault while reading or writing memory.
  do_page_fault() acquires the mmap_sem for read and blocks on disk
  (either reading the page from file, or hitting swap) for a long time.
- Thread B does an mmap call and blocks trying to acquire the mmap_sem
  for write
- Thread C is a monitoring process trying to read every /proc/pid/maps
  in the system. This requires acquiring the mmap_sem for read. Thread C
  blocks behind B, waiting for A to release the rwsem.  If thread C
  could be allowed to run in parallel with A, it would probably get done
  long before thread A's disk access completes, thus not actually slowing
  down thread B.

The unfair behavior is restricted to processes with the CAP_SYS_NICE
capability in order to avoid possible DoS attacks.

Test results with down_read_unfair_test (10 seconds):

2.6.33.3:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~600 /proc/pid/maps reads

2.6.33.3 + down_read_unfair:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~160000 /proc/pid/maps reads

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/base.c             |    2 +-
 fs/proc/task_mmu.c         |    2 +-
 fs/proc/task_nommu.c       |    2 +-
 include/linux/capability.h |    1 +
 include/linux/rwsem.h      |    3 +++
 kernel/rwsem.c             |   14 ++++++++++++++
 6 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8418fcc..9941802 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,7 +1367,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
+	down_read_unfair_if_nice_capable(&mm->mmap_sem);
 	exe_file = mm->exe_file;
 	if (exe_file)
 		get_file(exe_file);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0705534..47127e7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -123,7 +123,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	mm = mm_for_maps(priv->task);
 	if (!mm)
 		return NULL;
-	down_read(&mm->mmap_sem);
+	down_read_unfair_if_nice_capable(&mm->mmap_sem);
 
 	tail_vma = get_gate_vma(priv->task);
 	priv->tail_vma = tail_vma;
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 46d4b5d..af87191 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -194,7 +194,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 		priv->task = NULL;
 		return NULL;
 	}
-	down_read(&mm->mmap_sem);
+	down_read_unfair_if_nice_capable(&mm->mmap_sem);
 
 	/* start from the Nth VMA */
 	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 39e5ff5..de003dc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -296,6 +296,7 @@ struct cpu_vfs_cap_data {
    processes and setting the scheduling algorithm used by another
    process. */
 /* Allow setting cpu affinity on other processes */
+/* Allow unfair rwsem read acquire with down_read_unfair_if_nice_capable() */
 
 #define CAP_SYS_NICE         23
 
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0d3310b..1322ee5 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -32,9 +32,12 @@ extern void down_read(struct rw_semaphore *sem);
  */
 #ifdef __HAVE_DOWN_READ_UNFAIR
 extern void down_read_unfair(struct rw_semaphore *sem);
+extern void down_read_unfair_if_nice_capable(struct rw_semaphore *sem);
 #else
 static inline void down_read_unfair(struct rw_semaphore *sem)
 	{ down_read(sem); }
+static inline void down_read_unfair_if_nice_capable(struct rw_semaphore *sem)
+	{ down_read(sem); }
 #endif
 
 /*
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index d7b424b..2c51880 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/capability.h>
 #include <linux/rwsem.h>
 
 #include <asm/system.h>
@@ -41,6 +42,19 @@ void __sched down_read_unfair(struct rw_semaphore *sem)
 
 EXPORT_SYMBOL(down_read_unfair);
 
+void __sched down_read_unfair_if_nice_capable(struct rw_semaphore *sem)
+{
+	int unfair = capable(CAP_SYS_NICE);
+
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock,
+		       (unfair ? __down_read_unfair : __down_read));
+}
+
+EXPORT_SYMBOL(down_read_unfair_if_nice_capable);
+
 #endif
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal
  2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
                   ` (9 preceding siblings ...)
  2010-05-14 12:39 ` [PATCH 10/10] Use down_read_unfair() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
@ 2010-05-14 15:13 ` Linus Torvalds
  2010-05-17 21:28   ` Michel Lespinasse
  10 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2010-05-14 15:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: David Howells, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton,
	Mike Waychison, Suleiman Souhlal, Ying Han



On Fri, 14 May 2010, Michel Lespinasse wrote:
>
> I would like to sollicit comments regarding the following changes
> against 2.6.34-rc7 + 91af708 (from V1 proposal) already applied.
> 
> The motivation for this change was some cluster monitoring software we
> use at google

Quite frankly, I hate it the way it reads now.

I think "down_read_unfair()" is a really dangerous model, and the reason I 
say that is we used to have _all_ mutexes work that way, and it was a 
disaster from a unfairness perspective.

HOWEVER.

I do see where you are coming from, and I do think that unfair readers are 
likely to be ok AS LONG AS THEY CANNOT BLOCK.

And I think that _that_ is likely the much more important issue than the 
unfairness. IOW, I suspect that I would personally at least be perfectly 
ok with something like this, with the following fairly trivial changes:

 - Make it actually do a "preempt_disable()" _after_ getting the rwsem, so 
   that we get a warning if something tries to sleep inside the region 
   (see the whole "__might_sleep()" thing).

   This also implies that you need a separate unlock routine to pair with 
   it, that undoes that. So you can't unlock it with a regular "up_read()"

 - rename the thing to be about the fact that you promise that the code 
   that runs under the thing is nonblocking. IOW, rather than talk about 
   "unfair", you talk about "nonpreemptible" or "critical" or something.

   So you'd have something like

	down_read_critical();
	.. atomic region with no allocation, no preemption ..
	up_read_critical();

   ratehr than talk about "unfairness".

So it would have "spinlock" semantics when held (the taking of the lock 
itself can obviously block - but you couldn't block while _holding_ the 
lock).

In fact, for the generic lib/rwsem-spinlock.c version, it's quite possible 
you should just _hold_ the spinlock over the critical region. That would 
potentially speed up the locking quite a lot.

The reason I think the above would be acceptable is exactly because it 
consciously _limits_ that unfair spinlock to only ever work in cases where 
a certain amount of unfairness would be ok.

IOW, you can't just use the unfair version in random places that you think 
are "more important" and are worthy of unfairness. They have to be places 
where you can guarantee that you release the lock with no delay. And we'd 
disable preemption not just to get the warning, but also to make sure that 
"timely release" really happens.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal
  2010-05-14 15:13 ` [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Linus Torvalds
@ 2010-05-17 21:28   ` Michel Lespinasse
  0 siblings, 0 replies; 13+ messages in thread
From: Michel Lespinasse @ 2010-05-17 21:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton,
	Mike Waychison, Suleiman Souhlal, Ying Han

On Fri, May 14, 2010 at 8:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I think "down_read_unfair()" is a really dangerous model, and the reason I
> say that is we used to have _all_ mutexes work that way, and it was a
> disaster from a unfairness perspective.
>
> HOWEVER.
>
> I do see where you are coming from, and I do think that unfair readers are
> likely to be ok AS LONG AS THEY CANNOT BLOCK.

Thanks for your comments. Seems reasonable to me - and as you pointed
out, not very difficult to implement either.

> And I think that _that_ is likely the much more important issue than the
> unfairness. IOW, I suspect that I would personally at least be perfectly
> ok with something like this, with the following fairly trivial changes:
>
>  - Make it actually do a "preempt_disable()" _after_ getting the rwsem, so
>   that we get a warning if something tries to sleep inside the region
>   (see the whole "__might_sleep()" thing).
>
>   This also implies that you need a separate unlock routine to pair with
>   it, that undoes that. So you can't unlock it with a regular "up_read()"

Done in V3 (will send out shortly).

>  - rename the thing to be about the fact that you promise that the code
>   that runs under the thing is nonblocking. IOW, rather than talk about
>   "unfair", you talk about "nonpreemptible" or "critical" or something.
>
>   So you'd have something like
>
>        down_read_critical();
>        .. atomic region with no allocation, no preemption ..
>        up_read_critical();
>
>   rather than talk about "unfairness".

Changed the high level API to work that way. I did not change
__down_read_unfair though, because at that low level it's still about
unfairness - the tying with non-preemptible is done higher up. (I had
actually started changing it to __down_read_critical, but then I
realized that there was no __up_read_critical to pair it with).

> So it would have "spinlock" semantics when held (the taking of the lock
> itself can obviously block - but you couldn't block while _holding_ the
> lock).
>
> In fact, for the generic lib/rwsem-spinlock.c version, it's quite possible
> you should just _hold_ the spinlock over the critical region. That would
> potentially speed up the locking quite a lot.

I did not pursue that one - one issue would have been that I would
have to expose the spin_lock_irqsave flags to the down_read_critical
caller. It can be done but I'm not sure it makes sense given that the
non-generic implementation wouldn't use them.

> The reason I think the above would be acceptable is exactly because it
> consciously _limits_ that unfair spinlock to only ever work in cases where
> a certain amount of unfairness would be ok.
>
> IOW, you can't just use the unfair version in random places that you think
> are "more important" and are worthy of unfairness. They have to be places
> where you can guarantee that you release the lock with no delay. And we'd
> disable preemption not just to get the warning, but also to make sure that
> "timely release" really happens.

I like the strategic thinking here :)

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-05-17 21:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-14 12:39 [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Michel Lespinasse
2010-05-14 12:39 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
2010-05-14 12:39 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
2010-05-14 12:39 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
2010-05-14 12:39 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
2010-05-14 12:39 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
2010-05-14 12:39 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
2010-05-14 12:39 ` [PATCH 07/10] generic rwsem: implement down_read_unfair Michel Lespinasse
2010-05-14 12:39 ` [PATCH 08/10] rwsem: down_read_unfair infrastructure support Michel Lespinasse
2010-05-14 12:39 ` [PATCH 09/10] x86 rwsem: down_read_unfair implementation Michel Lespinasse
2010-05-14 12:39 ` [PATCH 10/10] Use down_read_unfair() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
2010-05-14 15:13 ` [PATCH 00/10] V2: rwsem changes + down_read_unfair() proposal Linus Torvalds
2010-05-17 21:28   ` Michel Lespinasse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.