All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal
@ 2010-05-17 22:25 Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
                   ` (18 more replies)
  0 siblings, 19 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This is version 3 of my rwsem changes. Patches 7 and 10 were modified to
address Linus's comments about the API. Please consider for merging.

Changes since V2:

- Rebased to 2.6.34

- Changed patch 07 to address Linus's comments about the API.
  down_read_critical() and up_read_critical() now work as a pair; threads
  using this are allowed to skip over blocked threads when acquiring the
  read lock; however they must make sure to quickly release that lock and
  in particular, they are forbidden to block.

- Changed patch 10 to make use of the down_read_critical()/up_read_critical()
  API when accessing /proc/<pid>/exe and /proc/<pid>/maps files.
  I excluded smaps and numa_maps files, which can actually block while
  being generated (smaps blocks in smaps_pte_range() doing a cond_resched(),
  which seems legitimate as it's a potentially heavy operation. numa_maps
  blocks in show_numa_map() doing a bzalloc of struct numa_maps, which
  should probably get done in do_maps_open() instead).

The motivation for this change was some cluster monitoring software we
use at google; which reads /proc/<pid>/maps files for all running
processes. When the machines are under load, the mmap_sem is often
acquire for reads for long periods of time since do_page_fault() holds
it while doing disk accesses; and fair queueing behavior often ends up
in the monitoring software making little progress. By introducing
unfair behavior in a few selected places, are are able to let the
monitoring software make progress without impacting performance for
the rest of the system. I've made sure not to change the rwsem fast
paths in implementing this proposal.

Michel Lespinasse (10):
  x86 rwsem: minor cleanups
  rwsem: fully separate code pathes to wake writers vs readers
  rwsem: lighter active count checks when waking up readers
  rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  rwsem: wake queued readers when writer blocks on active read lock
  rwsem: smaller wrappers around rwsem_down_failed_common
  generic rwsem: implement down_read_critical() / up_read_critical()
  rwsem: down_read_critical infrastructure support
  x86 rwsem: down_read_critical implementation
  Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files

 arch/x86/include/asm/rwsem.h   |   70 ++++++++++++-----
 arch/x86/lib/rwsem_64.S        |   14 +++-
 arch/x86/lib/semaphore_32.S    |   21 +++++-
 fs/proc/base.c                 |    4 +-
 fs/proc/task_mmu.c             |   24 ++++--
 include/linux/proc_fs.h        |    1 +
 include/linux/rwsem-spinlock.h |   10 ++-
 include/linux/rwsem.h          |   12 +++
 kernel/rwsem.c                 |   35 +++++++++
 lib/rwsem-spinlock.c           |   10 ++-
 lib/rwsem.c                    |  160 ++++++++++++++++++++++++++--------------
 11 files changed, 266 insertions(+), 95 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/10] x86 rwsem: minor cleanups
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

The only functional change here is that __up_write won't call
call_rwsem_wake anymore if the new rwsem value is >0. This makes
no real difference since call_rwsem_wake would have noticed the
active count being nonzero and done nothing anyway.

Besides that, I clarified a few comments.

Feel free to drop this patch from the series if you don't like it;
nothing else depends on it.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   17 ++++++++---------
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 606ede1..86e0473 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -118,7 +118,7 @@ static inline void __down_read(struct rw_semaphore *sem)
 {
 	asm volatile("# beginning down_read\n\t"
 		     LOCK_PREFIX _ASM_INC "(%1)\n\t"
-		     /* adds 0x00000001, returns the old value */
+		     /* adds 0x00000001 */
 		     "  jns        1f\n"
 		     "  call call_rwsem_down_read_failed\n"
 		     "1:\n\t"
@@ -160,7 +160,7 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
 	tmp = RWSEM_ACTIVE_WRITE_BIAS;
 	asm volatile("# beginning down_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* subtract 0x0000ffff, returns the old value */
+		     /* adds 0xffff0001, returns the old value */
 		     "  test      %1,%1\n\t"
 		     /* was the count 0 before? */
 		     "  jz        1f\n"
@@ -200,7 +200,7 @@ static inline void __up_read(struct rw_semaphore *sem)
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
 		     /* subtracts 1, returns the old value */
 		     "  jns        1f\n\t"
-		     "  call call_rwsem_wake\n"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n"
 		     "# ending __up_read\n"
 		     : "+m" (sem->count), "=d" (tmp)
@@ -213,17 +213,16 @@ static inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
-	rwsem_count_t tmp;
+	rwsem_count_t tmp = -RWSEM_ACTIVE_WRITE_BIAS;
 	asm volatile("# beginning __up_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* tries to transition
-			0xffff0001 -> 0x00000000 */
-		     "  jz       1f\n"
-		     "  call call_rwsem_wake\n"
+		     /* substracts 0xffff0001, returns the old value */
+		     "  jns        1f\n\t"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n\t"
 		     "# ending __up_write\n"
 		     : "+m" (sem->count), "=d" (tmp)
-		     : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS)
+		     : "a" (sem), "1" (tmp)
 		     : "memory", "cc");
 }
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This is in preparation to later changes in the series.

In __rwsem_do_wake(), the first queued waiter is checked first in order
to determine whether it's a writer or a reader. The code paths diverge
at this point. The code that checks and incremetns the rwsem active count
is duplicated on both sides - the point is that later changes in the
series will be able to independently modify both sides.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   54 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index ceba8e2..874c0c1 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -41,7 +41,7 @@ struct rwsem_waiter {
  * - if we come here from up_xxxx(), then:
  *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
  *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
- *   - there must be someone on the queue
+ * - there must be someone on the queue
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
@@ -54,27 +54,24 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	struct list_head *next;
 	signed long oldcount, woken, loop;
 
+	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+		goto readers_only;
+
 	if (downgrading)
-		goto dont_wake_writers;
+		goto out;
 
-	/* if we came through an up_xxxx() call, we only only wake someone up
-	 * if we can transition the active part of the count from 0 -> 1
+	/* There's a writer at the front of the queue - try to grant it the
+	 * write lock.  However, we only wake this writer if we can transition
+	 * the active part of the count from 0 -> 1
 	 */
  try_again:
 	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
 						- RWSEM_ACTIVE_BIAS;
 	if (oldcount & RWSEM_ACTIVE_MASK)
+		/* Someone grabbed the sem already */
 		goto undo;
 
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-
-	/* try to grant a single write lock if there's a writer at the front
-	 * of the queue - note we leave the 'active part' of the count
-	 * incremented by 1 and the waiting part incremented by 0x00010000
-	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
-
 	/* We must be careful not to touch 'waiter' after we set ->task = NULL.
 	 * It is an allocated on the waiter's stack and may become invalid at
 	 * any time after that point (due to a wakeup from another source).
@@ -87,18 +84,25 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	put_task_struct(tsk);
 	goto out;
 
-	/* don't want to wake any writers */
- dont_wake_writers:
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
-		goto out;
+ readers_only:
+	if (!downgrading) {
+
+		/* if we came through an up_xxxx() call, we only only wake
+		 * someone up if we can transition the active part of the
+		 * count from 0 -> 1
+		 */
+ try_again_read:
+		oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
+							- RWSEM_ACTIVE_BIAS;
+		if (oldcount & RWSEM_ACTIVE_MASK)
+			/* Someone grabbed the sem already */
+			goto undo_read;
+	}
 
-	/* grant an infinite number of read locks to the readers at the front
-	 * of the queue
-	 * - note we increment the 'active part' of the count by the number of
-	 *   readers before waking any processes up
+	/* Grant an infinite number of read locks to the readers at the front
+	 * of the queue.  Note we increment the 'active part' of the count by
+	 * the number of readers before waking any processes up.
 	 */
- readers_only:
 	woken = 0;
 	do {
 		woken++;
@@ -142,6 +146,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
+ undo_read:
+	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+		goto out;
+	goto try_again_read;
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/10] rwsem: lighter active count checks when waking up readers
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

In __rwsem_do_wake(), we can skip the active count check unless we come
there from up_xxxx(). Also when checking the active count, it is not
actually necessary to increment it; this allows us to get rid of the
read side undo code and simplify the calculation of the final rwsem count
adjustment once we've counted the reader threads to wake.

The basic observation is the following. When there are waiter threads
on a rwsem and the spinlock is held, other threads can only increment the
active count by trying to grab the rwsem in up_xxxx(). However up_xxxx()
will notice there are waiter threads and take the down_failed path,
blocking to acquire the spinlock on the way there. Therefore, a thread
observing an active count of zero with waiters queued and the spinlock held,
is protected against other threads acquiring the rwsem until it wakes the
last waiter or releases the spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   60 ++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 874c0c1..ab0d306 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -36,6 +36,14 @@ struct rwsem_waiter {
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
 };
 
+/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
+ * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
+ * since the rwsem value was observed.
+ */
+#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
+#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
+#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -46,8 +54,8 @@ struct rwsem_waiter {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
  */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static struct rw_semaphore *
+__rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -58,7 +66,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
 		goto readers_only;
 
-	if (downgrading)
+	if (wake_type == RWSEM_WAKE_READ_OWNED)
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -85,19 +93,24 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	goto out;
 
  readers_only:
-	if (!downgrading) {
-
-		/* if we came through an up_xxxx() call, we only only wake
-		 * someone up if we can transition the active part of the
-		 * count from 0 -> 1
-		 */
- try_again_read:
-		oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-							- RWSEM_ACTIVE_BIAS;
-		if (oldcount & RWSEM_ACTIVE_MASK)
-			/* Someone grabbed the sem already */
-			goto undo_read;
-	}
+	/* If we come here from up_xxxx(), another thread might have reached
+	 * rwsem_down_failed_common() before we acquired the spinlock and
+	 * woken up an active locker.  We prefer to check for this first in
+	 * order to not spend too much time with the spinlock held if we're
+	 * not going to be able to wake up readers in the end.
+	 *
+	 * Note that we do not need to update the rwsem count: any writer
+	 * trying to acquire rwsem will run rwsem_down_write_failed() due
+	 * to the waiting threads, and block trying to acquire the spinlock.
+	 *
+	 * We use a dummy atomic update in order to acquire the cache line
+	 * exclusively since we expect to succeed and run the final rwsem
+	 * count adjustment pretty soon.
+	 */
+	if (wake_type == RWSEM_WAKE_ANY &&
+	    rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK)
+		/* Someone grabbed the sem already */
+		goto out;
 
 	/* Grant an infinite number of read locks to the readers at the front
 	 * of the queue.  Note we increment the 'active part' of the count by
@@ -117,9 +130,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 
 	loop = woken;
 	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
-	if (!downgrading)
-		/* we'd already done one increment earlier */
-		woken -= RWSEM_ACTIVE_BIAS;
 
 	rwsem_atomic_add(woken, sem);
 
@@ -146,10 +156,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
- undo_read:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
-		goto out;
-	goto try_again_read;
 }
 
 /*
@@ -171,12 +177,12 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	list_add_tail(&waiter->list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively read-locking */
+	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
 	/* if there are no active locks, wake the front queued process(es) up */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
 
 	spin_unlock_irq(&sem->wait_lock);
 
@@ -233,7 +239,7 @@ asmregparm struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
@@ -253,7 +259,7 @@ asmregparm struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 1);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (2 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With this
change, the bias is added only once when the wait list is non-empty.

This has a few nice properties which will be used in following changes:
- when the spinlock is held and the waiter list is known to be non-empty,
  count < RWSEM_WAITING_BIAS  <=>  there is an active writer on that sem
- count == RWSEM_WAITING_BIAS  <=>  there are waiting threads and no
                                     active readers/writers on that sem

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index ab0d306..b2dde5a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -60,7 +60,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	signed long oldcount, woken, loop;
+	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
@@ -73,9 +73,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * write lock.  However, we only wake this writer if we can transition
 	 * the active part of the count from 0 -> 1
 	 */
+	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
+	if (waiter->list.next == &sem->wait_list)
+		adjustment -= RWSEM_WAITING_BIAS;
+
  try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-						- RWSEM_ACTIVE_BIAS;
+	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
 	if (oldcount & RWSEM_ACTIVE_MASK)
 		/* Someone grabbed the sem already */
 		goto undo;
@@ -128,13 +131,15 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
+	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
 
-	rwsem_atomic_add(woken, sem);
+	rwsem_atomic_add(adjustment, sem);
 
 	next = sem->wait_list.next;
-	for (; loop > 0; loop--) {
+	for (loop = woken; loop > 0; loop--) {
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
 		tsk = waiter->task;
@@ -153,7 +158,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	/* undo the change to the active count, but check for a transition
 	 * 1->0 */
  undo:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+	if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again;
 }
@@ -175,6 +180,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	waiter->task = tsk;
 	get_task_struct(tsk);
 
+	if (list_empty(&sem->wait_list))
+		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter->list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
@@ -208,8 +215,7 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter,
-				RWSEM_WAITING_BIAS - RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
 	return sem;
 }
 
@@ -222,7 +228,7 @@ rwsem_down_write_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
 
 	return sem;
 }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (3 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This change addresses the following situation:

- Thread A acquires the rwsem for read
- Thread B tries to acquire the rwsem for write, notices there is already
  an active owner for the rwsem.
- Thread C tries to acquire the rwsem for read, notices that thread B already
  tried to acquire it.
- Thread C grabs the spinlock and queues itself on the wait queue.
- Thread B grabs the spinlock and queues itself behind C. At this point A is
  the only remaining active owner on the rwsem.

In this situation thread B could notice that it was the last active writer
on the rwsem, and decide to wake C to let it proceed in parallel with A
since they both only want the rwsem for read.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   21 ++++++++++++++++-----
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index b2dde5a..89cd8b3 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -67,6 +67,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
+		/* Another active reader was observed, so wakeup is not
+		 * likely to succeed. Save the atomic op.
+		 */
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -98,7 +101,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  readers_only:
 	/* If we come here from up_xxxx(), another thread might have reached
 	 * rwsem_down_failed_common() before we acquired the spinlock and
-	 * woken up an active locker.  We prefer to check for this first in
+	 * woken up an active writer.  We prefer to check for this first in
 	 * order to not spend too much time with the spinlock held if we're
 	 * not going to be able to wake up readers in the end.
 	 *
@@ -111,8 +114,8 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * count adjustment pretty soon.
 	 */
 	if (wake_type == RWSEM_WAKE_ANY &&
-	    rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK)
-		/* Someone grabbed the sem already */
+	    rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS)
+		/* Someone grabbed the sem for write already */
 		goto out;
 
 	/* Grant an infinite number of read locks to the readers at the front
@@ -187,9 +190,17 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
-	/* if there are no active locks, wake the front queued process(es) up */
-	if (!(count & RWSEM_ACTIVE_MASK))
+	/* If there are no active locks, wake the front queued process(es) up.
+	 *
+	 * Alternatively, if we're called from a failed down_write(), there
+	 * were already threads queued before us, and there are no active
+	 * writers, the lock must be read owned; so we try to wake any read
+	 * locks that were queued ahead of us. */
+	if (count == RWSEM_WAITING_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
+	else if (count > RWSEM_WAITING_BIAS &&
+		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irq(&sem->wait_lock);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (4 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

More code can be pushed from rwsem_down_read_failed and
rwsem_down_write_failed into rwsem_down_failed_common.

Following change adding down_read_unfair infrastructure support also
enjoys having flags available in a register rather than having to
fish it out in the struct rwsem_waiter...

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 89cd8b3..d8764e0 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -171,8 +171,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			struct rwsem_waiter *waiter, signed long adjustment)
+			 unsigned int flags, signed long adjustment)
 {
+	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	signed long count;
 
@@ -180,12 +181,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
-	waiter->task = tsk;
+	waiter.task = tsk;
+	waiter.flags = flags;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter->list, &sem->wait_list);
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -206,7 +208,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->task)
+		if (!waiter.task)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -223,11 +225,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 asmregparm struct rw_semaphore __sched *
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_READ,
+					-RWSEM_ACTIVE_READ_BIAS);
 }
 
 /*
@@ -236,12 +235,8 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 asmregparm struct rw_semaphore __sched *
 rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
-
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_WRITE,
+					-RWSEM_ACTIVE_WRITE_BIAS);
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (5 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:44   ` Linus Torvalds
  2010-05-17 22:25 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support Michel Lespinasse
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Add down_read_critical() / up_read_critical() API.

down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_unfair() callers
  are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_unfair() callers get queued in
  front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
  and preemption is disabled for that time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   12 ++++++++++++
 kernel/rwsem.c                 |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
 
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1); }
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..76fd8f4 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -28,6 +28,13 @@ struct rw_semaphore;
 extern void down_read(struct rw_semaphore *sem);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+extern void down_read_critical(struct rw_semaphore *sem);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 extern int down_read_trylock(struct rw_semaphore *sem);
@@ -48,6 +55,11 @@ extern int down_write_trylock(struct rw_semaphore *sem);
 extern void up_read(struct rw_semaphore *sem);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+extern void up_read_critical(struct rw_semaphore *sem);
+
+/*
  * release a write lock
  */
 extern void up_write(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cae050b..7c34174 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -13,6 +13,10 @@
 #include <asm/system.h>
 #include <asm/atomic.h>
 
+#ifndef __HAVE_DOWN_READ_UNFAIR
+# define __down_read_unfair(sem) __down_read(sem)
+#endif
+
 /*
  * lock for reading
  */
@@ -27,6 +31,23 @@ void __sched down_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(down_read);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+void __sched down_read_critical(struct rw_semaphore *sem)
+{
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
+
+	preempt_disable();
+}
+
+EXPORT_SYMBOL(down_read_critical);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 int down_read_trylock(struct rw_semaphore *sem)
@@ -80,6 +101,20 @@ void up_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(up_read);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+void up_read_critical(struct rw_semaphore *sem)
+{
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
+
+	__up_read(sem);
+
+	preempt_enable();
+}
+
+EXPORT_SYMBOL(up_read_critical);
+
+/*
  * release a write lock
  */
 void up_write(struct rw_semaphore *sem)
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..b2fd5fb 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -139,7 +139,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 /*
  * get a read lock on the semaphore
  */
-void __sched __down_read(struct rw_semaphore *sem)
+void __sched __down_read_internal(struct rw_semaphore *sem, int unfair)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
@@ -147,7 +147,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 
 	spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
+	if (sem->activity >= 0 && (unfair || list_empty(&sem->wait_list))) {
 		/* granted */
 		sem->activity++;
 		spin_unlock_irqrestore(&sem->wait_lock, flags);
@@ -162,7 +162,11 @@ void __sched __down_read(struct rw_semaphore *sem)
 	waiter.flags = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
-	list_add_tail(&waiter.list, &sem->wait_list);
+	if (unfair) {
+		list_add(&waiter.list, &sem->wait_list);
+	} else {
+		list_add_tail(&waiter.list, &sem->wait_list);
+	}
 
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/10] rwsem: down_read_critical infrastructure support
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (6 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation Michel Lespinasse
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

Add rwsem_down_read_unfair_failed() function in the non-generic rwsem
library code, similar to  rwsem_down_read_failed() except that blocked
threads are placed at the head of the queue.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index d8764e0..b3fe179 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -34,6 +34,7 @@ struct rwsem_waiter {
 	unsigned int flags;
 #define RWSEM_WAITING_FOR_READ	0x00000001
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
+#define RWSEM_UNFAIR            0x00000004
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -187,7 +188,11 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter.list, &sem->wait_list);
+
+	if (flags & RWSEM_UNFAIR)
+		list_add(&waiter.list, &sem->wait_list);
+	else
+		list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -229,6 +234,21 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 					-RWSEM_ACTIVE_READ_BIAS);
 }
 
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * wait for the read lock to be granted - skip waiting threads
+ */
+asmregparm struct rw_semaphore __sched *
+rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
+{
+	return rwsem_down_failed_common(sem,
+					RWSEM_WAITING_FOR_READ | RWSEM_UNFAIR,
+					-RWSEM_ACTIVE_READ_BIAS);
+}
+
+#endif
+
 /*
  * wait for the write lock to be granted
  */
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/10] x86 rwsem: down_read_critical implementation
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (7 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-17 22:25 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

The basic properties that have been true so far still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer
will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_critical()
can make use of this to determine if there are any active writers.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   53 ++++++++++++++++++++++++++++++++---------
 arch/x86/lib/rwsem_64.S      |   14 +++++++++-
 arch/x86/lib/semaphore_32.S  |   21 +++++++++++++++-
 3 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 86e0473..fbb068c 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_BIAS supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -48,6 +47,8 @@ struct rwsem_waiter;
 extern asmregparm struct rw_semaphore *
  rwsem_down_read_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
+ rwsem_down_read_unfair_failed(struct rw_semaphore *sem);
+extern asmregparm struct rw_semaphore *
  rwsem_down_write_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
  rwsem_wake(struct rw_semaphore *);
@@ -62,17 +63,22 @@ extern asmregparm struct rw_semaphore *
  * for 64 bits.
  */
 
+
 #ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
+# define RWSEM_ACTIVE_MASK		0x000000007fffffffL
+# define RWSEM_ACTIVE_READ_BIAS		0x0000000000000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffffffff00000001L
+# define RWSEM_WAITING_BIAS		0xffffffff80000000L
 #else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_UNLOCKED_VALUE		0x00000000L
+# define RWSEM_ACTIVE_MASK		0x00007fffL
+# define RWSEM_ACTIVE_READ_BIAS		0x00000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffff0001L
+# define RWSEM_WAITING_BIAS		0xffff8000L
 #endif
 
-#define RWSEM_UNLOCKED_VALUE		0x00000000L
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define __HAVE_DOWN_READ_UNFAIR
 
 typedef signed long rwsem_count_t;
 
@@ -129,6 +135,28 @@ static inline void __down_read(struct rw_semaphore *sem)
 }
 
 /*
+ * lock for reading - skip waiting threads
+ */
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+{
+	rwsem_count_t tmp;
+
+	tmp = RWSEM_ACTIVE_READ_BIAS;
+	asm volatile("# beginning down_read_unfair\n\t"
+		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
+		     /* adds 0x00000001, returns the old value */
+		     "  cmp       %4,%1\n\t"
+		     /* was the count >= RWSEM_WAITING_BIAS before? */
+		     "  jge       1f\n"
+		     "  call call_rwsem_down_read_unfair_failed\n"
+		     "1:\n"
+		     "# ending down_read_unfair"
+		     : "+m" (sem->count), "=r" (tmp)
+		     : "a" (sem), "1" (tmp), "re" (RWSEM_WAITING_BIAS)
+		     : "memory", "cc");
+}
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 static inline int __down_read_trylock(struct rw_semaphore *sem)
@@ -242,7 +270,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem),
+		       "er" (RWSEM_ACTIVE_READ_BIAS - RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
diff --git a/arch/x86/lib/rwsem_64.S b/arch/x86/lib/rwsem_64.S
index 41fcf00..e99d187 100644
--- a/arch/x86/lib/rwsem_64.S
+++ b/arch/x86/lib/rwsem_64.S
@@ -51,6 +51,16 @@ ENTRY(call_rwsem_down_read_failed)
 	ret
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	save_common_regs
+	pushq %rdx
+	movq %rax,%rdi
+	call rwsem_down_read_unfair_failed
+	popq %rdx
+	restore_common_regs
+	ret
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	save_common_regs
 	movq %rax,%rdi
@@ -60,8 +70,8 @@ ENTRY(call_rwsem_down_write_failed)
 	ENDPROC(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
-	decl %edx	/* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpl $0x80000001, %edx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	save_common_regs
 	movq %rax,%rdi
 	call rwsem_wake
diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S
index 648fe47..fea9a53 100644
--- a/arch/x86/lib/semaphore_32.S
+++ b/arch/x86/lib/semaphore_32.S
@@ -89,6 +89,23 @@ ENTRY(call_rwsem_down_read_failed)
 	CFI_ENDPROC
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	CFI_STARTPROC
+	push %ecx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET ecx,0
+	push %edx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET edx,0
+	call rwsem_down_read_unfair_failed
+	pop %edx
+	CFI_ADJUST_CFA_OFFSET -4
+	pop %ecx
+	CFI_ADJUST_CFA_OFFSET -4
+	ret
+	CFI_ENDPROC
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	CFI_STARTPROC
 	push %ecx
@@ -103,8 +120,8 @@ ENTRY(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
 	CFI_STARTPROC
-	decw %dx    /* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpw $0x8001, %dx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	push %ecx
 	CFI_ADJUST_CFA_OFFSET 4
 	CFI_REL_OFFSET ecx,0
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (8 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation Michel Lespinasse
@ 2010-05-17 22:25 ` Michel Lespinasse
  2010-05-19 11:47 ` [PATCH 01/10] x86 rwsem: minor cleanups David Howells
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 22:25 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	Michel Lespinasse

This helps in the following situation:
- Thread A takes a page fault while reading or writing memory.
  do_page_fault() acquires the mmap_sem for read and blocks on disk
  (either reading the page from file, or hitting swap) for a long time.
- Thread B does an mmap call and blocks trying to acquire the mmap_sem
  for write
- Thread C is a monitoring process trying to read every /proc/pid/maps
  in the system. This requires acquiring the mmap_sem for read. Thread C
  blocks behind B, waiting for A to release the rwsem.  If thread C
  could be allowed to run in parallel with A, it would probably get done
  long before thread A's disk access completes, thus not actually slowing
  down thread B.

Test results with down_read_critical_test (10 seconds):

2.6.33.3:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~600 /proc/pid/maps reads

2.6.33.3 + down_read_critical:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~160000 /proc/pid/maps reads

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/base.c          |    4 ++--
 fs/proc/task_mmu.c      |   24 +++++++++++++++++-------
 include/linux/proc_fs.h |    1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8418fcc..9201762 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,11 +1367,11 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
+	down_read_critical(&mm->mmap_sem);
 	exe_file = mm->exe_file;
 	if (exe_file)
 		get_file(exe_file);
-	up_read(&mm->mmap_sem);
+	up_read_critical(&mm->mmap_sem);
 	return exe_file;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47f5b14..83f9353 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,7 +89,10 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
 {
 	if (vma && vma != priv->tail_vma) {
 		struct mm_struct *mm = vma->vm_mm;
-		up_read(&mm->mmap_sem);
+		if (priv->critical)
+			up_read_critical(&mm->mmap_sem);
+		else
+			up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
 }
@@ -123,7 +126,10 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	mm = mm_for_maps(priv->task);
 	if (!mm)
 		return NULL;
-	down_read(&mm->mmap_sem);
+	if (priv->critical)
+		down_read_critical(&mm->mmap_sem);
+	else
+		down_read(&mm->mmap_sem);
 
 	tail_vma = get_gate_vma(priv->task);
 	priv->tail_vma = tail_vma;
@@ -156,7 +162,10 @@ out:
 
 	/* End of vmas has been reached */
 	m->version = (tail_vma != NULL)? 0: -1UL;
-	up_read(&mm->mmap_sem);
+	if (priv->critical)
+		up_read_critical(&mm->mmap_sem);
+	else
+		up_read(&mm->mmap_sem);
 	mmput(mm);
 	return tail_vma;
 }
@@ -185,13 +194,14 @@ static void m_stop(struct seq_file *m, void *v)
 }
 
 static int do_maps_open(struct inode *inode, struct file *file,
-			const struct seq_operations *ops)
+			const struct seq_operations *ops, int critical)
 {
 	struct proc_maps_private *priv;
 	int ret = -ENOMEM;
 	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
 	if (priv) {
 		priv->pid = proc_pid(inode);
+		priv->critical = critical;
 		ret = seq_open(file, ops);
 		if (!ret) {
 			struct seq_file *m = file->private_data;
@@ -282,7 +292,7 @@ static const struct seq_operations proc_pid_maps_op = {
 
 static int maps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_maps_op);
+	return do_maps_open(inode, file, &proc_pid_maps_op, 1);
 }
 
 const struct file_operations proc_maps_operations = {
@@ -432,7 +442,7 @@ static const struct seq_operations proc_pid_smaps_op = {
 
 static int smaps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_smaps_op);
+	return do_maps_open(inode, file, &proc_pid_smaps_op, 0);
 }
 
 const struct file_operations proc_smaps_operations = {
@@ -808,7 +818,7 @@ static const struct seq_operations proc_pid_numa_maps_op = {
 
 static int numa_maps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+	return do_maps_open(inode, file, &proc_pid_numa_maps_op, 0);
 }
 
 const struct file_operations proc_numa_maps_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 379eaed..66785e7 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -291,6 +291,7 @@ struct proc_maps_private {
 	struct task_struct *task;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
+	int critical;
 #endif
 };
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-17 22:25 ` [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
@ 2010-05-17 22:44   ` Linus Torvalds
  2010-05-17 23:13     ` Michel Lespinasse
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2010-05-17 22:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: David Howells, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton,
	Mike Waychison, Suleiman Souhlal, Ying Han



On Mon, 17 May 2010, Michel Lespinasse wrote:
>
> Add down_read_critical() / up_read_critical() API.
> 
> down_read_critical() is similar to down_read() with the following changes:
> - when the rwsem is read owned with queued writers, down_read_unfair() callers
>   are allowed to acquire the rwsem for read without queueing;

You didn't update the comment for the new name here...

> - when the rwsem is write owned, down_read_unfair() callers get queued in
>   front of threads trying to acquire the rwsem by other means.

.. or here. In this case, it really is more about "unfairness", but I'm 
not convinced it should be so in the naming anyway, even if internally it 
might be __down_read_unfair. "critical" I think covers both.

> - caller can't sleep until releasing lock with up_read_critical(),
>   and preemption is disabled for that time.

But you did here (because it's a new thing).

Anyway, the series looks mostly acceptable to me in this form. I think it 
conceptually works out, and I think that the non-preemption guarantee 
should mean that starvation of writers is not likely an issue. However, 
I'd definitely like some second opinions on it. I'm not going to apply 
this series without acks from people. So you should try to convince DavidH 
too that this actually really does matter and makes sense.

		Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-17 22:44   ` Linus Torvalds
@ 2010-05-17 23:13     ` Michel Lespinasse
  2010-05-17 23:20       ` Michel Lespinasse
  2010-05-19 13:21       ` David Howells
  0 siblings, 2 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton,
	Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, May 17, 2010 at 03:44:24PM -0700, Linus Torvalds wrote:
> You didn't update the comment for the new name here...
> 
> > - when the rwsem is write owned, down_read_unfair() callers get queued in
> >   front of threads trying to acquire the rwsem by other means.
> 
> .. or here. In this case, it really is more about "unfairness", but I'm 
> not convinced it should be so in the naming anyway, even if internally it 
> might be __down_read_unfair. "critical" I think covers both.

Gah! Sorry for missing the comment updates. I agree with you on the naming,
I just didn't remember about the comment.

Will send identical patch with the correct comment as reply to this.

> Anyway, the series looks mostly acceptable to me in this form. I think it 
> conceptually works out, and I think that the non-preemption guarantee 
> should mean that starvation of writers is not likely an issue. However, 
> I'd definitely like some second opinions on it. I'm not going to apply 
> this series without acks from people. So you should try to convince DavidH 
> too that this actually really does matter and makes sense.

I'll see what I can do here. Thanks !

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-17 23:13     ` Michel Lespinasse
@ 2010-05-17 23:20       ` Michel Lespinasse
  2010-05-19 13:21       ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-17 23:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Ingo Molnar, Thomas Gleixner, LKML, Andrew Morton,
	Mike Waychison, Suleiman Souhlal, Ying Han

Add down_read_critical() / up_read_critical() API.

down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_critical()
  callers are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_critical() callers get queued in
  front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
  and preemption is disabled for that time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   12 ++++++++++++
 kernel/rwsem.c                 |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
 
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1); }
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..76fd8f4 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -28,6 +28,13 @@ struct rw_semaphore;
 extern void down_read(struct rw_semaphore *sem);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+extern void down_read_critical(struct rw_semaphore *sem);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 extern int down_read_trylock(struct rw_semaphore *sem);
@@ -48,6 +55,11 @@ extern int down_write_trylock(struct rw_semaphore *sem);
 extern void up_read(struct rw_semaphore *sem);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+extern void up_read_critical(struct rw_semaphore *sem);
+
+/*
  * release a write lock
  */
 extern void up_write(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cae050b..7c34174 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -13,6 +13,10 @@
 #include <asm/system.h>
 #include <asm/atomic.h>
 
+#ifndef __HAVE_DOWN_READ_UNFAIR
+# define __down_read_unfair(sem) __down_read(sem)
+#endif
+
 /*
  * lock for reading
  */
@@ -27,6 +31,23 @@ void __sched down_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(down_read);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+void __sched down_read_critical(struct rw_semaphore *sem)
+{
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
+
+	preempt_disable();
+}
+
+EXPORT_SYMBOL(down_read_critical);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 int down_read_trylock(struct rw_semaphore *sem)
@@ -80,6 +101,20 @@ void up_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(up_read);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+void up_read_critical(struct rw_semaphore *sem)
+{
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
+
+	__up_read(sem);
+
+	preempt_enable();
+}
+
+EXPORT_SYMBOL(up_read_critical);
+
+/*
  * release a write lock
  */
 void up_write(struct rw_semaphore *sem)
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..b2fd5fb 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -139,7 +139,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 /*
  * get a read lock on the semaphore
  */
-void __sched __down_read(struct rw_semaphore *sem)
+void __sched __down_read_internal(struct rw_semaphore *sem, int unfair)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
@@ -147,7 +147,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 
 	spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
+	if (sem->activity >= 0 && (unfair || list_empty(&sem->wait_list))) {
 		/* granted */
 		sem->activity++;
 		spin_unlock_irqrestore(&sem->wait_lock, flags);
@@ -162,7 +162,11 @@ void __sched __down_read(struct rw_semaphore *sem)
 	waiter.flags = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
-	list_add_tail(&waiter.list, &sem->wait_list);
+	if (unfair) {
+		list_add(&waiter.list, &sem->wait_list);
+	} else {
+		list_add_tail(&waiter.list, &sem->wait_list);
+	}
 
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/10] x86 rwsem: minor cleanups
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (9 preceding siblings ...)
  2010-05-17 22:25 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
@ 2010-05-19 11:47 ` David Howells
  2010-05-20 21:37   ` Michel Lespinasse
  2010-05-19 12:04 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers David Howells
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 35+ messages in thread
From: David Howells @ 2010-05-19 11:47 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> The only functional change here is that __up_write won't call
> call_rwsem_wake anymore if the new rwsem value is >0. This makes
> no real difference since call_rwsem_wake would have noticed the
> active count being nonzero and done nothing anyway.

Given that you describe this first, this would suggest that the subject of the
patch should be this.  I'm not sure I'd count this as a minor cleanup.  I think
I'd split it into its own patch.

> Besides that, I clarified a few comments.

Mostly okay, except where you said "expects old value in %edx" - that's only
true on i386, not x86_64.  On the latter it would be %rdi.  However, I can live
with that: it's true enough.

> -	rwsem_count_t tmp;
> +	rwsem_count_t tmp = -RWSEM_ACTIVE_WRITE_BIAS;
> ...
>  		     : "+m" (sem->count), "=d" (tmp)
> -		     : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS)
> +		     : "a" (sem), "1" (tmp)

If you're going to put the initialisation of EDX/RDI on tmp (which isn't really
necessary), rather than directly on the asm statement, you could change the
'"=d" (tmp)' output constraint to be '"+d" (tmp)' and drop the '"1" (tmp)'
constraint entirely.

However, apart from that, feel free to add my Acked-by to this patch or its
split resultant patches.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (10 preceding siblings ...)
  2010-05-19 11:47 ` [PATCH 01/10] x86 rwsem: minor cleanups David Howells
@ 2010-05-19 12:04 ` David Howells
  2010-05-20 21:48   ` Michel Lespinasse
  2010-05-19 12:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers David Howells
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 35+ messages in thread
From: David Howells @ 2010-05-19 12:04 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> + readers_only:
> +	if (!downgrading) {
> +

There's an unnecessary blank line here.

> +		/* if we came through an up_xxxx() call, we only only wake
> +		 * someone up if we can transition the active part of the
> +		 * count from 0 -> 1
> +		 */
> + try_again_read:

I hate code that jumps into braced blocks with goto.  Would it be possible for
you to do:

	readers_only:
		if (downgrading)
			goto wake_readers;
	...
	wake_readers:
		/* Grant an infinite number of read locks to the readers at the front
	...

instead, please?

Also, since the labels 'undo' and 'try_again' are now specific to the writer
path, can you label them 'undo_write' and 'try_again_write' just to make it
obvious?

Other than that, no particular objections to this patch.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/10] rwsem: lighter active count checks when waking up readers
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (11 preceding siblings ...)
  2010-05-19 12:04 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers David Howells
@ 2010-05-19 12:25 ` David Howells
  2010-05-20 22:33   ` Michel Lespinasse
  2010-05-21  8:06   ` David Howells
  2010-05-19 12:33 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads David Howells
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 12:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> ... When there are waiter threads on a rwsem and the spinlock is held, other
> threads can only increment the active count by trying to grab the rwsem in
> up_xxxx().

That's not true.  A thread attempting to get an rwsem by issuing a down_read()
or down_write() will also unconditionally increment the active count before it
considers calling out to the slow path.

Maybe what you mean is that other threads wanting to do a wake up can only
increase the active count for the processes being woken up whilst holding the
rwsem's spinlock.

> +	/* If we come here from up_xxxx(), another thread might have reached
> +	 * rwsem_down_failed_common() before we acquired the spinlock and
> +	 * woken up an active locker.

Do you mean a waiter rather than an active locker?  If a process is still
registering activity on the rwsem, then it can't be woken yet.
Michel Lespinasse <walken@google.com> wrote:

> +	 * Note that we do not need to update the rwsem count: any writer
> +	 * trying to acquire rwsem will run rwsem_down_write_failed() due
> +	 * to the waiting threads, and block trying to acquire the spinlock.

That comma shouldn't be there.

>  	/* Grant an infinite number of read locks to the readers at the front
>  	 * of the queue.  Note we increment the 'active part' of the count by

I wonder if I should've called it the 'activity part' of the count rather than
the 'active part'.

Apart from that, the patch looks fine.  That's all comment/description fixes.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (12 preceding siblings ...)
  2010-05-19 12:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers David Howells
@ 2010-05-19 12:33 ` David Howells
  2010-05-19 12:44 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock David Howells
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 12:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With this
> change, the bias is added only once when the wait list is non-empty.

You might want to change "when the wait list" to "to indicate that the wait
list".  But apart from that:

Acked-by: David Howells <dhowells@redhat.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (13 preceding siblings ...)
  2010-05-19 12:33 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads David Howells
@ 2010-05-19 12:44 ` David Howells
  2010-05-19 12:51 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common David Howells
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 12:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> +	 * Alternatively, if we're called from a failed down_write(), there
> +	 * were already threads queued before us, and there are no active
> +	 * writers, the lock must be read owned; so we try to wake any read
> +	 * locks that were queued ahead of us. */

The comma you've added after 'us' is wrong.  That suggests that the implicit
'then' comes there.  I take it you're a proponent of the Oxford/Harvard/serial
comma?

Apart from that miscellaneous grammatical difference, the patch is fine:-)

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (14 preceding siblings ...)
  2010-05-19 12:44 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock David Howells
@ 2010-05-19 12:51 ` David Howells
  2010-05-19 13:34 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support David Howells
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 12:51 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han


Michel Lespinasse <walken@google.com> wrote:

> More code can be pushed from rwsem_down_read_failed and
> rwsem_down_write_failed into rwsem_down_failed_common.
> 
> Following change adding down_read_unfair infrastructure support also
> enjoys having flags available in a register rather than having to
> fish it out in the struct rwsem_waiter...
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Acked-by: David Howells <dhowells@redhat.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-17 23:13     ` Michel Lespinasse
  2010-05-17 23:20       ` Michel Lespinasse
@ 2010-05-19 13:21       ` David Howells
  2010-05-19 23:47         ` Michel Lespinasse
  2010-05-21  3:35         ` Michel Lespinasse
  1 sibling, 2 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 13:21 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> +void __sched down_read_critical(struct rw_semaphore *sem)
> +{
> +	might_sleep();
> +	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
> +
> +	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
> +
> +	preempt_disable();

Shouldn't preemption really be disabled before __down_read_unfair() is called?
Otherwise you can get an unfair read on a sem and immediately get taken off
the CPU.  Of course, this means __down_read_unfair() would have to deal with
that in the slow path:-/

Oh, and something else that occurs to me:  Do unfair readers have to go at the
front of the wakeup queue?  Can they be slightly less unfair and go either
before the first reader in the queue or at the back of the queue instead?

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/10] rwsem: down_read_critical infrastructure support
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (15 preceding siblings ...)
  2010-05-19 12:51 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common David Howells
@ 2010-05-19 13:34 ` David Howells
  2010-05-20 23:30   ` Michel Lespinasse
  2010-05-21  8:03   ` David Howells
  2010-05-19 14:36 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation David Howells
  2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
  18 siblings, 2 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 13:34 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

>  #define RWSEM_WAITING_FOR_READ	0x00000001
>  #define RWSEM_WAITING_FOR_WRITE	0x00000002
> +#define RWSEM_UNFAIR            0x00000004

Can I suggest you change this to:

	enum rwsem_waiter_type {
		RWSEM_WAITING_FOR_WRITE,
		RWSEM_WAITING_FOR_READ,
		RWSEM_WAITING_FOR_UNFAIR_READ
	};

and then change:

>  	unsigned int flags;

to:

	enum rwsem_waiter_type type;

and use this throughout.  It simplifies some of the code too.  See attached
patch.

David
---
rwsem: Make waiter type an enum in the slow path of the asm-optimised version

Make the waiter type in the asm-optimised version of the slow path an
enumeration rather than a bitmask.

Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/lib/rwsem.c b/lib/rwsem.c
index b3fe179..8aa2238 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -28,13 +28,16 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 
 EXPORT_SYMBOL(__init_rwsem);
 
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ,
+	RWSEM_WAITING_FOR_UNFAIR_READ
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-#define RWSEM_UNFAIR            0x00000004
+	enum rwsem_waiter_type type;
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -64,7 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
@@ -133,10 +136,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(waiter->list.next,
 					struct rwsem_waiter, list);
 
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
+	} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
-	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		/* hit end of list above */
 		adjustment -= RWSEM_WAITING_BIAS;
 
@@ -172,7 +175,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			 unsigned int flags, signed long adjustment)
+			 enum rwsem_waiter_type type, signed long adjustment)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
@@ -183,13 +186,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
-	waiter.flags = flags;
+	waiter.type = type;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
 
-	if (flags & RWSEM_UNFAIR)
+	if (type & RWSEM_WAITING_FOR_UNFAIR_READ)
 		list_add(&waiter.list, &sem->wait_list);
 	else
 		list_add_tail(&waiter.list, &sem->wait_list);
@@ -242,8 +245,7 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 asmregparm struct rw_semaphore __sched *
 rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
 {
-	return rwsem_down_failed_common(sem,
-					RWSEM_WAITING_FOR_READ | RWSEM_UNFAIR,
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_UNFAIR_READ,
 					-RWSEM_ACTIVE_READ_BIAS);
 }
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/10] x86 rwsem: down_read_critical implementation
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (16 preceding siblings ...)
  2010-05-19 13:34 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support David Howells
@ 2010-05-19 14:36 ` David Howells
  2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
  18 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 14:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> The basic properties that have been true so far still hold:
> - RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
> - RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
> - RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
> - RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0
> 
> A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
> 'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer
> will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_critical()
> can make use of this to determine if there are any active writers.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Acked-by: David Howells <dhowells@redhat.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files
  2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (17 preceding siblings ...)
  2010-05-19 14:36 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation David Howells
@ 2010-05-19 15:21 ` David Howells
  2010-05-21  2:44   ` Michel Lespinasse
                     ` (2 more replies)
  18 siblings, 3 replies; 35+ messages in thread
From: David Howells @ 2010-05-19 15:21 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> This helps in the following situation:
> - Thread A takes a page fault while reading or writing memory.
>   do_page_fault() acquires the mmap_sem for read and blocks on disk
>   (either reading the page from file, or hitting swap) for a long time.
> - Thread B does an mmap call and blocks trying to acquire the mmap_sem
>   for write
> - Thread C is a monitoring process trying to read every /proc/pid/maps
>   in the system. This requires acquiring the mmap_sem for read. Thread C
>   blocks behind B, waiting for A to release the rwsem.  If thread C
>   could be allowed to run in parallel with A, it would probably get done
>   long before thread A's disk access completes, thus not actually slowing
>   down thread B.
> 
> Test results with down_read_critical_test (10 seconds):

That seems to work.  I have some rwsem benchmarks for you using my
synchronisation primitive test (see attached).  I run it like so:

	insmod /tmp/synchro-test.ko rd=10 wr=2 dg=1 do_sched=1

with 10 readers, 2 writers and a downgrader on the same rwsem, and permitting
scheduling between attempts to grab and release the semaphore.  The test runs
for 5 seconds on a box that isn't doing anything else, and has had almost
everything that might normally be running (including syslog) disabled or
removed.

Without your patches:

      THREADS       S INTVL             LOCKS TAKEN AND RELEASED
-------------------   /LOAD -------------------------------------------------
MTX SEM RD  WR  DG          MTX TAKEN SEM TAKEN RD TAKEN  WR TAKEN  DG TAKEN
=== === === === === = ===== ========= ========= ========= ========= =========
  0   0  10   2   1 s   2/2         0         0   2087911    590464    194394
  0   0  10   2   1 s   2/2         0         0   2030497    591511    196851
  0   0  10   2   1 s   2/2         0         0   2078523    605317    199645

With your patches

  0   0  10   2   1 s   2/2         0         0   2054720    605691    198472
  0   0  10   2   1 s   2/2         0         0   2051376    597976    196962
  0   0  10   2   1 s   2/2         0         0   2011909    595706    197482

So there doesn't seem to be much in the way of degradation.  In fact, there
seems to have been more degradation somewhere in the general kernel since I
played with version 1 of your patches a week ago.  At that time, the vanilla
kernel gave this without your patches applied:

  0   0  10   2   1 s   2/2         0         0   2217200    636227    210255
  0   0  10   2   1 s   2/2         0         0   2217903    629175    212926
  0   0  10   2   1 s   2/2         0         0   2224136    647912    212506

Feel free to have a play.

BTW, the code also works on FRV in NOMMU mode (which uses the spinlock-based
rwsem version).


However...  I'm not certain that giving a process that's accessing
/proc/pid/maps priority over a second process that may be trying to do mmap or
page fault or something internally is a good idea.

It would be useful if you made clearer what you're actually measuring from
/proc/pid/maps...

David
---
/* synchro-test.c: run some threads to test the synchronisation primitives
 *
 * Copyright (C) 2005, 2006 Red Hat, Inc. All Rights Reserved.
 * Written by David Howells (dhowells@redhat.com)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 *
 * The module should be run as something like:
 *
 *	insmod synchro-test.ko rd=2 wr=2
 *	insmod synchro-test.ko mx=1
 *	insmod synchro-test.ko sm=2 ism=1
 *	insmod synchro-test.ko sm=2 ism=2
 *
 * See Documentation/synchro-test.txt for more information.
 */

#include <linux/module.h>
#include <linux/poll.h>
#include <linux/moduleparam.h>
#include <linux/sched.h>
#include <linux/stat.h>
#include <linux/init.h>
#include <asm/atomic.h>
#include <linux/personality.h>
#include <linux/smp_lock.h>
#include <linux/delay.h>
#include <linux/timer.h>
#include <linux/completion.h>
#include <linux/mutex.h>

#define MAX_THREADS 64

/*
 * Turn on self-validation if we do a one-shot boot-time test:
 */
#ifndef MODULE
# define VALIDATE_OPERATORS
#endif

static int nummx;
static int numsm, seminit = 4;
static int numrd, numwr, numdg;
static int elapse = 5, load = 2, do_sched, interval = 2;
static int verbose = 0;

MODULE_AUTHOR("David Howells");
MODULE_DESCRIPTION("Synchronisation primitive test demo");
MODULE_LICENSE("GPL");

module_param_named(v, verbose, int, 0);
MODULE_PARM_DESC(verbose, "Verbosity");

module_param_named(mx, nummx, int, 0);
MODULE_PARM_DESC(nummx, "Number of mutex threads");

module_param_named(sm, numsm, int, 0);
MODULE_PARM_DESC(numsm, "Number of semaphore threads");

module_param_named(ism, seminit, int, 0);
MODULE_PARM_DESC(seminit, "Initial semaphore value");

module_param_named(rd, numrd, int, 0);
MODULE_PARM_DESC(numrd, "Number of reader threads");

module_param_named(wr, numwr, int, 0);
MODULE_PARM_DESC(numwr, "Number of writer threads");

module_param_named(dg, numdg, int, 0);
MODULE_PARM_DESC(numdg, "Number of downgrader threads");

module_param(elapse, int, 0);
MODULE_PARM_DESC(elapse, "Number of seconds to run for");

module_param(load, int, 0);
MODULE_PARM_DESC(load, "Length of load in uS");

module_param(interval, int, 0);
MODULE_PARM_DESC(interval, "Length of interval in uS before re-getting lock");

module_param(do_sched, int, 0);
MODULE_PARM_DESC(do_sched, "True if each thread should schedule regularly");

/* the semaphores under test */
static struct mutex ____cacheline_aligned mutex;
static struct semaphore ____cacheline_aligned sem;
static struct rw_semaphore ____cacheline_aligned rwsem;

static atomic_t ____cacheline_aligned do_stuff		= ATOMIC_INIT(0);

#ifdef VALIDATE_OPERATORS
static atomic_t ____cacheline_aligned mutexes		= ATOMIC_INIT(0);
static atomic_t ____cacheline_aligned semaphores	= ATOMIC_INIT(0);
static atomic_t ____cacheline_aligned readers		= ATOMIC_INIT(0);
static atomic_t ____cacheline_aligned writers		= ATOMIC_INIT(0);
#endif

static unsigned int ____cacheline_aligned mutexes_taken		[MAX_THREADS];
static unsigned int ____cacheline_aligned semaphores_taken	[MAX_THREADS];
static unsigned int ____cacheline_aligned reads_taken		[MAX_THREADS];
static unsigned int ____cacheline_aligned writes_taken		[MAX_THREADS];
static unsigned int ____cacheline_aligned downgrades_taken	[MAX_THREADS];

static struct completion ____cacheline_aligned mx_comp[MAX_THREADS];
static struct completion ____cacheline_aligned sm_comp[MAX_THREADS];
static struct completion ____cacheline_aligned rd_comp[MAX_THREADS];
static struct completion ____cacheline_aligned wr_comp[MAX_THREADS];
static struct completion ____cacheline_aligned dg_comp[MAX_THREADS];

static struct timer_list ____cacheline_aligned timer;

#define ACCOUNT(var, N) var##_taken[N]++;

#ifdef VALIDATE_OPERATORS
#define TRACK(var, dir) atomic_##dir(&(var))

#define CHECK(var, cond, val)						\
do {									\
	int x = atomic_read(&(var));					\
	if (unlikely(!(x cond (val))))					\
		printk("check [%s %s %d, == %d] failed in %s\n",	\
		       #var, #cond, (val), x, __func__);		\
} while (0)

#else
#define TRACK(var, dir)		do {} while(0)
#define CHECK(var, cond, val)	do {} while(0)
#endif

static inline void do_mutex_lock(unsigned int N)
{
	mutex_lock(&mutex);

	ACCOUNT(mutexes, N);
	TRACK(mutexes, inc);
	CHECK(mutexes, ==, 1);
}

static inline void do_mutex_unlock(unsigned int N)
{
	CHECK(mutexes, ==, 1);
	TRACK(mutexes, dec);

	mutex_unlock(&mutex);
}

static inline void do_down(unsigned int N)
{
	CHECK(mutexes, <, seminit);

	down(&sem);

	ACCOUNT(semaphores, N);
	TRACK(semaphores, inc);
}

static inline void do_up(unsigned int N)
{
	CHECK(semaphores, >, 0);
	TRACK(semaphores, dec);

	up(&sem);
}

static inline void do_down_read(unsigned int N)
{
	down_read(&rwsem);

	ACCOUNT(reads, N);
	TRACK(readers, inc);
	CHECK(readers, >, 0);
	CHECK(writers, ==, 0);
}

static inline void do_up_read(unsigned int N)
{
	CHECK(readers, >, 0);
	CHECK(writers, ==, 0);
	TRACK(readers, dec);

	up_read(&rwsem);
}

static inline void do_down_write(unsigned int N)
{
	down_write(&rwsem);

	ACCOUNT(writes, N);
	TRACK(writers, inc);
	CHECK(writers, ==, 1);
	CHECK(readers, ==, 0);
}

static inline void do_up_write(unsigned int N)
{
	CHECK(writers, ==, 1);
	CHECK(readers, ==, 0);
	TRACK(writers, dec);

	up_write(&rwsem);
}

static inline void do_downgrade_write(unsigned int N)
{
	CHECK(writers, ==, 1);
	CHECK(readers, ==, 0);
	TRACK(writers, dec);
	TRACK(readers, inc);

	downgrade_write(&rwsem);

	ACCOUNT(downgrades, N);
}

static inline void sched(void)
{
	if (do_sched)
		schedule();
}

static int mutexer(void *arg)
{
	unsigned int N = (unsigned long) arg;

	daemonize("Mutex%u", N);
	set_user_nice(current, 19);

	while (atomic_read(&do_stuff)) {
		do_mutex_lock(N);
		if (load)
			udelay(load);
		do_mutex_unlock(N);
		sched();
		if (interval)
			udelay(interval);
	}

	if (verbose >= 2)
		printk("%s: done\n", current->comm);
	complete_and_exit(&mx_comp[N], 0);
}

static int semaphorer(void *arg)
{
	unsigned int N = (unsigned long) arg;

	daemonize("Sem%u", N);
	set_user_nice(current, 19);

	while (atomic_read(&do_stuff)) {
		do_down(N);
		if (load)
			udelay(load);
		do_up(N);
		sched();
		if (interval)
			udelay(interval);
	}

	if (verbose >= 2)
		printk("%s: done\n", current->comm);
	complete_and_exit(&sm_comp[N], 0);
}

static int reader(void *arg)
{
	unsigned int N = (unsigned long) arg;

	daemonize("Read%u", N);
	set_user_nice(current, 19);

	while (atomic_read(&do_stuff)) {
		do_down_read(N);
#ifdef LOAD_TEST
		if (load)
			udelay(load);
#endif
		do_up_read(N);
		sched();
		if (interval)
			udelay(interval);
	}

	if (verbose >= 2)
		printk("%s: done\n", current->comm);
	complete_and_exit(&rd_comp[N], 0);
}

static int writer(void *arg)
{
	unsigned int N = (unsigned long) arg;

	daemonize("Write%u", N);
	set_user_nice(current, 19);

	while (atomic_read(&do_stuff)) {
		do_down_write(N);
#ifdef LOAD_TEST
		if (load)
			udelay(load);
#endif
		do_up_write(N);
		sched();
		if (interval)
			udelay(interval);
	}

	if (verbose >= 2)
		printk("%s: done\n", current->comm);
	complete_and_exit(&wr_comp[N], 0);
}

static int downgrader(void *arg)
{
	unsigned int N = (unsigned long) arg;

	daemonize("Down%u", N);
	set_user_nice(current, 19);

	while (atomic_read(&do_stuff)) {
		do_down_write(N);
#ifdef LOAD_TEST
		if (load)
			udelay(load);
#endif
		do_downgrade_write(N);
#ifdef LOAD_TEST
		if (load)
			udelay(load);
#endif
		do_up_read(N);
		sched();
		if (interval)
			udelay(interval);
	}

	if (verbose >= 2)
		printk("%s: done\n", current->comm);
	complete_and_exit(&dg_comp[N], 0);
}

static void stop_test(unsigned long dummy)
{
	atomic_set(&do_stuff, 0);
}

static unsigned int total(const char *what, unsigned int counts[], int num)
{
	unsigned int tot = 0, max = 0, min = UINT_MAX, zeros = 0, cnt;
	int loop;

	for (loop = 0; loop < num; loop++) {
		cnt = counts[loop];

		if (cnt == 0) {
			zeros++;
			min = 0;
			continue;
		}

		tot += cnt;
		if (tot > max)
			max = tot;
		if (tot < min)
			min = tot;
	}

	if (verbose && tot > 0) {
		printk("%s:", what);

		for (loop = 0; loop < num; loop++) {
			cnt = counts[loop];

			if (cnt == 0)
				printk(" zzz");
			else
				printk(" %d%%", cnt * 100 / tot);
		}

		printk("\n");
	}

	return tot;
}

/*****************************************************************************/
/*
 *
 */
static int __init do_tests(void)
{
	unsigned long loop;
	unsigned int mutex_total, sem_total, rd_total, wr_total, dg_total;

	if (nummx < 0 || nummx > MAX_THREADS ||
	    numsm < 0 || numsm > MAX_THREADS ||
	    numrd < 0 || numrd > MAX_THREADS ||
	    numwr < 0 || numwr > MAX_THREADS ||
	    numdg < 0 || numdg > MAX_THREADS ||
	    seminit < 1 ||
	    elapse < 1 ||
	    load < 0 || load > 999 ||
	    interval < 0 || interval > 999
	    ) {
		printk("Parameter out of range\n");
		return -ERANGE;
	}

	if ((nummx | numsm | numrd | numwr | numdg) == 0) {
		int num = num_online_cpus();

		if (num > MAX_THREADS)
			num = MAX_THREADS;
		nummx = numsm = numrd = numwr = numdg = num;

		load = 1;
		interval = 1;
		do_sched = 1;
		printk("No parameters - using defaults.\n");
	}

	if (verbose)
		printk("\nStarting synchronisation primitive tests...\n");

	mutex_init(&mutex);
	sema_init(&sem, seminit);
	init_rwsem(&rwsem);
	atomic_set(&do_stuff, 1);

	/* kick off all the children */
	for (loop = 0; loop < MAX_THREADS; loop++) {
		if (loop < nummx) {
			init_completion(&mx_comp[loop]);
			kernel_thread(mutexer, (void *) loop, 0);
		}

		if (loop < numsm) {
			init_completion(&sm_comp[loop]);
			kernel_thread(semaphorer, (void *) loop, 0);
		}

		if (loop < numrd) {
			init_completion(&rd_comp[loop]);
			kernel_thread(reader, (void *) loop, 0);
		}

		if (loop < numwr) {
			init_completion(&wr_comp[loop]);
			kernel_thread(writer, (void *) loop, 0);
		}

		if (loop < numdg) {
			init_completion(&dg_comp[loop]);
			kernel_thread(downgrader, (void *) loop, 0);
		}
	}

	/* set a stop timer */
	init_timer(&timer);
	timer.function = stop_test;
	timer.expires = jiffies + elapse * HZ;
	add_timer(&timer);

	/* now wait until it's all done */
	for (loop = 0; loop < nummx; loop++)
		wait_for_completion(&mx_comp[loop]);

	for (loop = 0; loop < numsm; loop++)
		wait_for_completion(&sm_comp[loop]);

	for (loop = 0; loop < numrd; loop++)
		wait_for_completion(&rd_comp[loop]);

	for (loop = 0; loop < numwr; loop++)
		wait_for_completion(&wr_comp[loop]);

	for (loop = 0; loop < numdg; loop++)
		wait_for_completion(&dg_comp[loop]);

	atomic_set(&do_stuff, 0);
	del_timer(&timer);

	if (mutex_is_locked(&mutex))
		printk(KERN_ERR "Mutex is still locked!\n");

	/* count up */
	mutex_total	= total("MTX", mutexes_taken, nummx);
	sem_total	= total("SEM", semaphores_taken, numsm);
	rd_total	= total("RD ", reads_taken, numrd);
	wr_total	= total("WR ", writes_taken, numwr);
	dg_total	= total("DG ", downgrades_taken, numdg);

	/* print the results */
	if (verbose) {
		printk("mutexes taken: %u\n", mutex_total);
		printk("semaphores taken: %u\n", sem_total);
		printk("reads taken: %u\n", rd_total);
		printk("writes taken: %u\n", wr_total);
		printk("downgrades taken: %u\n", dg_total);
	}
	else {
		char buf[30];

		sprintf(buf, "%d/%d", interval, load);

		printk("%3d %3d %3d %3d %3d %c %5s %9u %9u %9u %9u %9u\n",
		       nummx, numsm, numrd, numwr, numdg,
		       do_sched ? 's' : '-',
		       buf,
		       mutex_total,
		       sem_total,
		       rd_total,
		       wr_total,
		       dg_total);
	}

	/* tell insmod to discard the module */
	if (verbose)
		printk("Tests complete\n");
	return -ENOANO;

} /* end do_tests() */

module_init(do_tests);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() /  up_read_critical()
  2010-05-19 13:21       ` David Howells
@ 2010-05-19 23:47         ` Michel Lespinasse
  2010-05-21  3:35         ` Michel Lespinasse
  1 sibling, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-19 23:47 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 6:21 AM, David Howells <dhowells@redhat.com> wrote:
> Michel Lespinasse <walken@google.com> wrote:
>
>> +void __sched down_read_critical(struct rw_semaphore *sem)
>> +{
>> +     might_sleep();
>> +     rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>> +
>> +     LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
>> +
>> +     preempt_disable();
>
> Shouldn't preemption really be disabled before __down_read_unfair() is called?
> Otherwise you can get an unfair read on a sem and immediately get taken off
> the CPU.  Of course, this means __down_read_unfair() would have to deal with
> that in the slow path:-/

I've asked myself the same question; it is true that we don't fully
prevent ourselves getting preempted while holding the rwsem here.

My understanding is that Linus wants the preempt_disable() mainly to
prevent threads doing voluntary preemption (anything that might_sleep)
while holding the unfairly acquired rwsem;  and also to have people
think twice before adding more down_read_critical() calls.

It wouldn't be difficult to move the preempt_disable() ahead of the
lock acquire fast path. However I don't think I can do it for the
blocking path, where thread A tries to acquire the lock on behalf of
thread B and then wakes B if it succeeded - I don't think we have an
API for A to say 'I want to disable preemption in thread B', is there
?

> Oh, and something else that occurs to me:  Do unfair readers have to go at the
> front of the wakeup queue?  Can they be slightly less unfair and go either
> before the first reader in the queue or at the back of the queue instead?

Going before the first reader would be fine for our use, as we're
really only using this for mmap_sem and the write holders there don't
keep it very long. I'm not sure what this buys us though.

Going at the back of the queue would mean the critical readers would
still get occasionally blocked behind other readers doing disk
accesses - we'd like to avoid that.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/10] x86 rwsem: minor cleanups
  2010-05-19 11:47 ` [PATCH 01/10] x86 rwsem: minor cleanups David Howells
@ 2010-05-20 21:37   ` Michel Lespinasse
  0 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-20 21:37 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 4:47 AM, David Howells <dhowells@redhat.com> wrote:
> Given that you describe this first, this would suggest that the subject of the
> patch should be this.  I'm not sure I'd count this as a minor cleanup.  I think
> I'd split it into its own patch.

OK, split it in two.

> Mostly okay, except where you said "expects old value in %edx" - that's only
> true on i386, not x86_64.  On the latter it would be %rdi.  However, I can live
> with that: it's true enough.

It's actually still %edx in x86_64 - we're calling into
arch/x86/lib/rwsem_64.S which has its own unusual conventions.

>> -     rwsem_count_t tmp;
>> +     rwsem_count_t tmp = -RWSEM_ACTIVE_WRITE_BIAS;
>> ...
>>                    : "+m" (sem->count), "=d" (tmp)
>> -                  : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS)
>> +                  : "a" (sem), "1" (tmp)
>
> If you're going to put the initialisation of EDX/RDI on tmp (which isn't really
> necessary), rather than directly on the asm statement, you could change the
> '"=d" (tmp)' output constraint to be '"+d" (tmp)' and drop the '"1" (tmp)'
> constraint entirely.

I agree it'd be nicer, but I wondered if all gcc versions would handle
the constraints change fine and then I chickened out. Instead I moved
the initialization on the constraints list as was already done in
__up_write(). All I was really shooting for here is consistency
accross __down_write_nested, __up_read and __up_write functions.

> However, apart from that, feel free to add my Acked-by to this patch or its
> split resultant patches.

Thanks. I'll send a V4 series soon integrating your feedback & mark
the two splitted patches resulting from this one as Acked-by.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/10] rwsem: fully separate code pathes to wake writers  vs readers
  2010-05-19 12:04 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers David Howells
@ 2010-05-20 21:48   ` Michel Lespinasse
  0 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-20 21:48 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 5:04 AM, David Howells <dhowells@redhat.com> wrote:
> I hate code that jumps into braced blocks with goto.  Would it be possible for
> you to do:
>
>        readers_only:
>                if (downgrading)
>                        goto wake_readers;
>        ...
>        wake_readers:
>                /* Grant an infinite number of read locks to the readers at the front
>        ...
>
> instead, please?

Applied. Note that next patch in the series removes that code anyway :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/10] rwsem: lighter active count checks when waking up  readers
  2010-05-19 12:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers David Howells
@ 2010-05-20 22:33   ` Michel Lespinasse
  2010-05-21  8:06   ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-20 22:33 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 5:25 AM, David Howells <dhowells@redhat.com> wrote:
> Michel Lespinasse <walken@google.com> wrote:
>
>> ... When there are waiter threads on a rwsem and the spinlock is held, other
>> threads can only increment the active count by trying to grab the rwsem in
>> up_xxxx().
>
> That's not true.  A thread attempting to get an rwsem by issuing a down_read()
> or down_write() will also unconditionally increment the active count before it
> considers calling out to the slow path.
>
> Maybe what you mean is that other threads wanting to do a wake up can only
> increase the active count for the processes being woken up whilst holding the
> rwsem's spinlock.

Damn, this was a pretty bad commit comment - I wrote up_xxxx instead
of down_xxxx here.

down_xxxx can increase the active count but it will then take the
down_failed path and wait to acquire the spinlock - so we don't need
to worry about it. Other code paths can't even increase the active
count, since they don't hold the spinlock.

(I got mixed up due to a legitimate mention of the up_xxxx path higher
up. Semaphore operation names confuse me with the active count going
up in down() and down in up() - I suppose it made more sense in the
era of mechanical train semaphores :)

>> +     /* If we come here from up_xxxx(), another thread might have reached
>> +      * rwsem_down_failed_common() before we acquired the spinlock and
>> +      * woken up an active locker.
>
> Do you mean a waiter rather than an active locker?  If a process is still
> registering activity on the rwsem, then it can't be woken yet.

Yes, I meant a waiter (but since it got woken it's now an active locker).

> Apart from that, the patch looks fine.  That's all comment/description fixes.

Thanks. Updated the comments. And I thought I could speak english... :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/10] rwsem: down_read_critical infrastructure support
  2010-05-19 13:34 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support David Howells
@ 2010-05-20 23:30   ` Michel Lespinasse
  2010-05-21  8:03   ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-20 23:30 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 6:34 AM, David Howells <dhowells@redhat.com> wrote:
> Can I suggest you change this to:
>
>        enum rwsem_waiter_type {
>                RWSEM_WAITING_FOR_WRITE,
>                RWSEM_WAITING_FOR_READ,
>                RWSEM_WAITING_FOR_UNFAIR_READ
>        };
>
> and then change:
>
>>       unsigned int flags;
>
> to:
>
>        enum rwsem_waiter_type type;
>
> and use this throughout.  It simplifies some of the code too.  See attached
> patch.

Makes perfect sense assuming we don't plan to add more flags in the future.
I'll use this in V4 series.

> -       if (flags & RWSEM_UNFAIR)
> +       if (type & RWSEM_WAITING_FOR_UNFAIR_READ)

if (type == RWSEM_WAITING_FOR_UNFAIR_READ)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and  /sys/<pid>/maps files
  2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
@ 2010-05-21  2:44   ` Michel Lespinasse
  2010-05-22  1:49   ` Michel Lespinasse
  2010-05-25  9:42   ` David Howells
  2 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-21  2:44 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 8:21 AM, David Howells <dhowells@redhat.com> wrote:
>  0   0  10   2   1 s   2/2         0         0   2217200    636227    210255
>  0   0  10   2   1 s   2/2         0         0   2217903    629175    212926
>  0   0  10   2   1 s   2/2         0         0   2224136    647912    212506
>
> Feel free to have a play.

Thanks for sharing your testing code.

> BTW, the code also works on FRV in NOMMU mode (which uses the spinlock-based
> rwsem version).

Yes, I made sure to keep the generic (and non-x86 asm) versions
working by always separating x86 specific changes from the rest.

> However...  I'm not certain that giving a process that's accessing
> /proc/pid/maps priority over a second process that may be trying to do mmap or
> page fault or something internally is a good idea.

One thing to keep in mind if you're concerned about the higher
priority is that these /proc files are still not open to everyone - a
user can't read the /proc/<pid>/exe and /proc/<pid>/maps files for
another user's processes unless he's got CAP_SYS_PTRACE priviledges.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/10] generic rwsem: implement down_read_critical() /  up_read_critical()
  2010-05-19 13:21       ` David Howells
  2010-05-19 23:47         ` Michel Lespinasse
@ 2010-05-21  3:35         ` Michel Lespinasse
  1 sibling, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-21  3:35 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 6:21 AM, David Howells <dhowells@redhat.com> wrote:
> Michel Lespinasse <walken@google.com> wrote:
>
>> +void __sched down_read_critical(struct rw_semaphore *sem)
>> +{
>> +     might_sleep();
>> +     rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>> +
>> +     LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
>> +
>> +     preempt_disable();
>
> Shouldn't preemption really be disabled before __down_read_unfair() is called?
> Otherwise you can get an unfair read on a sem and immediately get taken off
> the CPU.  Of course, this means __down_read_unfair() would have to deal with
> that in the slow path:-/

I think it's not that bad - I mean, the unfairness does not come into
factor here. If you tried to do a regular down_read(), you could also
get preempted on your way to the blocking path. Being preempted on the
way to your critical section after a successful (if unfair) acquire
really is no worse.

The critical section prevents you from blocking on long-latency events
such as disk accesses; being preempted but still runnable is not
nearly as bad.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/10] rwsem: down_read_critical infrastructure support
  2010-05-19 13:34 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support David Howells
  2010-05-20 23:30   ` Michel Lespinasse
@ 2010-05-21  8:03   ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-21  8:03 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> > -       if (flags & RWSEM_UNFAIR)
> > +       if (type & RWSEM_WAITING_FOR_UNFAIR_READ)
> 
> if (type == RWSEM_WAITING_FOR_UNFAIR_READ)

Sorry, yes.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/10] rwsem: lighter active count checks when waking up readers
  2010-05-19 12:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers David Howells
  2010-05-20 22:33   ` Michel Lespinasse
@ 2010-05-21  8:06   ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-21  8:06 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Michel Lespinasse <walken@google.com> wrote:

> down_xxxx can increase the active count ...

Not so much can, as does...  Perhaps you need to qualify that.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and  /sys/<pid>/maps files
  2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
  2010-05-21  2:44   ` Michel Lespinasse
@ 2010-05-22  1:49   ` Michel Lespinasse
  2010-05-25  9:42   ` David Howells
  2 siblings, 0 replies; 35+ messages in thread
From: Michel Lespinasse @ 2010-05-22  1:49 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, May 19, 2010 at 8:21 AM, David Howells <dhowells@redhat.com> wrote:
> However...  I'm not certain that giving a process that's accessing
> /proc/pid/maps priority over a second process that may be trying to do mmap or
> page fault or something internally is a good idea.
>
> It would be useful if you made clearer what you're actually measuring from
> /proc/pid/maps...

This is relatively large software; I don't actually have a complete
view of it myself.
However some of the things it does include the following:

- The cluster management software tracks ownership of sysv shm
segments, potentially deleting them if they have been abandoned.
/proc/<pid>/maps files are used as an input to these decisions.

- There is a separate module making numa aware decisions based on
/proc/<pid>/numa_maps information. This drives decisions such as
moving data between nodes (we have not been pushing the corresponding
kernel patches yet). Using down_read_critical() here would be
desirable too, but not as critical as for other parts of the system
(numa awareness is not the biggest worry when machines get under
enough memory pressure that reading numa_maps starts blocking).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files
  2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
  2010-05-21  2:44   ` Michel Lespinasse
  2010-05-22  1:49   ` Michel Lespinasse
@ 2010-05-25  9:42   ` David Howells
  2 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2010-05-25  9:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: dhowells, Linus Torvalds, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	peterz

Michel Lespinasse <walken@google.com> wrote:

> - The cluster management software tracks ownership of sysv shm
> segments, potentially deleting them if they have been abandoned.
> /proc/<pid>/maps files are used as an input to these decisions.

ipcs manages to give you that information in its nattch column, I believe,
without touching /proc/pid/maps.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2010-05-25  9:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-17 22:25 [PATCH 00/10] V3: rwsem changes + down_read_critical() proposal Michel Lespinasse
2010-05-17 22:25 ` [PATCH 01/10] x86 rwsem: minor cleanups Michel Lespinasse
2010-05-17 22:25 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
2010-05-17 22:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers Michel Lespinasse
2010-05-17 22:25 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
2010-05-17 22:25 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
2010-05-17 22:25 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
2010-05-17 22:25 ` [PATCH 07/10] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
2010-05-17 22:44   ` Linus Torvalds
2010-05-17 23:13     ` Michel Lespinasse
2010-05-17 23:20       ` Michel Lespinasse
2010-05-19 13:21       ` David Howells
2010-05-19 23:47         ` Michel Lespinasse
2010-05-21  3:35         ` Michel Lespinasse
2010-05-17 22:25 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support Michel Lespinasse
2010-05-17 22:25 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation Michel Lespinasse
2010-05-17 22:25 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files Michel Lespinasse
2010-05-19 11:47 ` [PATCH 01/10] x86 rwsem: minor cleanups David Howells
2010-05-20 21:37   ` Michel Lespinasse
2010-05-19 12:04 ` [PATCH 02/10] rwsem: fully separate code pathes to wake writers vs readers David Howells
2010-05-20 21:48   ` Michel Lespinasse
2010-05-19 12:25 ` [PATCH 03/10] rwsem: lighter active count checks when waking up readers David Howells
2010-05-20 22:33   ` Michel Lespinasse
2010-05-21  8:06   ` David Howells
2010-05-19 12:33 ` [PATCH 04/10] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads David Howells
2010-05-19 12:44 ` [PATCH 05/10] rwsem: wake queued readers when writer blocks on active read lock David Howells
2010-05-19 12:51 ` [PATCH 06/10] rwsem: smaller wrappers around rwsem_down_failed_common David Howells
2010-05-19 13:34 ` [PATCH 08/10] rwsem: down_read_critical infrastructure support David Howells
2010-05-20 23:30   ` Michel Lespinasse
2010-05-21  8:03   ` David Howells
2010-05-19 14:36 ` [PATCH 09/10] x86 rwsem: down_read_critical implementation David Howells
2010-05-19 15:21 ` [PATCH 10/10] Use down_read_critical() for /sys/<pid>/exe and /sys/<pid>/maps files David Howells
2010-05-21  2:44   ` Michel Lespinasse
2010-05-22  1:49   ` Michel Lespinasse
2010-05-25  9:42   ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.