All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
@ 2010-05-24 20:31 Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 01/11] x86 rwsem: stay on fast path when count>0 in __up_write() Michel Lespinasse
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

This is version 4 of my rwsem changes.

Changes since V3:

- Split first x86 rwsem change in two smaller parts
- Added David's rwsem_waiter_type enum suggestion into
  'rwsem: down_read_critical infrastructure support'
- Lots of minor style fixes and comments clarified

I would hope the entire series can be considered for inclusion;
however if we can not agree on the down_read_critical() bits I would
still like patches 1-7 to be independently considered as I think they
still have merit on their own.

Michel Lespinasse (11):
  x86 rwsem: stay on fast path when count>0 in __up_write()
  x86 rwsem: minor cleanups
  rwsem: fully separate code pathes to wake writers vs readers
  rwsem: lighter active count checks when waking up readers
  rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  rwsem: wake queued readers when writer blocks on active read lock
  rwsem: smaller wrappers around rwsem_down_failed_common
  generic rwsem: implement down_read_critical() / up_read_critical()
  rwsem: down_read_critical infrastructure support
  x86 rwsem: down_read_critical implementation
  Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files

 arch/x86/include/asm/rwsem.h   |   73 +++++++++++------
 arch/x86/lib/rwsem_64.S        |   14 +++-
 arch/x86/lib/semaphore_32.S    |   21 ++++-
 fs/proc/base.c                 |    4 +-
 fs/proc/task_mmu.c             |   24 ++++--
 include/linux/proc_fs.h        |    1 +
 include/linux/rwsem-spinlock.h |   10 ++-
 include/linux/rwsem.h          |   12 +++
 kernel/rwsem.c                 |   35 ++++++++
 lib/rwsem-spinlock.c           |   10 ++-
 lib/rwsem.c                    |  178 +++++++++++++++++++++++++---------------
 11 files changed, 276 insertions(+), 106 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/11] x86 rwsem: stay on fast path when count>0 in __up_write()
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 02/11] x86 rwsem: minor cleanups Michel Lespinasse
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

When count>0 there is no need to take the call_rwsem_wake path.
If we did take that path, it would just return without doing anything
due to the active count not being zero.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: David Howells <dhowells@redhat.com>
---
 arch/x86/include/asm/rwsem.h |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 606ede1..3fd1ed3 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -216,9 +216,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 	rwsem_count_t tmp;
 	asm volatile("# beginning __up_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* tries to transition
-			0xffff0001 -> 0x00000000 */
-		     "  jz       1f\n"
+		     /* substracts 0xffff0001, returns the old value */
+		     "  jns        1f\n\t"
 		     "  call call_rwsem_wake\n"
 		     "1:\n\t"
 		     "# ending __up_write\n"
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/11] x86 rwsem: minor cleanups
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 01/11] x86 rwsem: stay on fast path when count>0 in __up_write() Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 03/11] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Clarified few comments and made initialization of %edx/%rdx more uniform
accross __down_write_nested, __up_read and __up_write functions.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: David Howells <dhowells@redhat.com>
---
 arch/x86/include/asm/rwsem.h |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 3fd1ed3..0892acc 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -118,7 +118,7 @@ static inline void __down_read(struct rw_semaphore *sem)
 {
 	asm volatile("# beginning down_read\n\t"
 		     LOCK_PREFIX _ASM_INC "(%1)\n\t"
-		     /* adds 0x00000001, returns the old value */
+		     /* adds 0x00000001 */
 		     "  jns        1f\n"
 		     "  call call_rwsem_down_read_failed\n"
 		     "1:\n\t"
@@ -156,11 +156,9 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	rwsem_count_t tmp;
-
-	tmp = RWSEM_ACTIVE_WRITE_BIAS;
 	asm volatile("# beginning down_write\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
-		     /* subtract 0x0000ffff, returns the old value */
+		     /* adds 0xffff0001, returns the old value */
 		     "  test      %1,%1\n\t"
 		     /* was the count 0 before? */
 		     "  jz        1f\n"
@@ -168,7 +166,7 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
 		     "1:\n"
 		     "# ending down_write"
 		     : "+m" (sem->count), "=d" (tmp)
-		     : "a" (sem), "1" (tmp)
+		     : "a" (sem), "1" (RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
@@ -195,16 +193,16 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
  */
 static inline void __up_read(struct rw_semaphore *sem)
 {
-	rwsem_count_t tmp = -RWSEM_ACTIVE_READ_BIAS;
+	rwsem_count_t tmp;
 	asm volatile("# beginning __up_read\n\t"
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
 		     /* subtracts 1, returns the old value */
 		     "  jns        1f\n\t"
-		     "  call call_rwsem_wake\n"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n"
 		     "# ending __up_read\n"
 		     : "+m" (sem->count), "=d" (tmp)
-		     : "a" (sem), "1" (tmp)
+		     : "a" (sem), "1" (-RWSEM_ACTIVE_READ_BIAS)
 		     : "memory", "cc");
 }
 
@@ -218,7 +216,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
 		     /* substracts 0xffff0001, returns the old value */
 		     "  jns        1f\n\t"
-		     "  call call_rwsem_wake\n"
+		     "  call call_rwsem_wake\n" /* expects old value in %edx */
 		     "1:\n\t"
 		     "# ending __up_write\n"
 		     : "+m" (sem->count), "=d" (tmp)
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/11] rwsem: fully separate code pathes to wake writers vs readers
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 01/11] x86 rwsem: stay on fast path when count>0 in __up_write() Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 02/11] x86 rwsem: minor cleanups Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 04/11] rwsem: lighter active count checks when waking up readers Michel Lespinasse
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

This is in preparation to later changes in the series.

In __rwsem_do_wake(), the first queued waiter is checked first in order
to determine whether it's a writer or a reader. The code paths diverge
at this point. The code that checks and increments the rwsem active count
is duplicated on both sides - the point is that later changes in the
series will be able to independently modify both sides.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   61 ++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index ceba8e2..917fd94 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -41,7 +41,7 @@ struct rwsem_waiter {
  * - if we come here from up_xxxx(), then:
  *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
  *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
- *   - there must be someone on the queue
+ * - there must be someone on the queue
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
@@ -54,26 +54,23 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	struct list_head *next;
 	signed long oldcount, woken, loop;
 
+	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+		goto readers_only;
+
 	if (downgrading)
-		goto dont_wake_writers;
+		goto out;
 
-	/* if we came through an up_xxxx() call, we only only wake someone up
-	 * if we can transition the active part of the count from 0 -> 1
+	/* There's a writer at the front of the queue - try to grant it the
+	 * write lock.  However, we only wake this writer if we can transition
+	 * the active part of the count from 0 -> 1
 	 */
- try_again:
+ try_again_write:
 	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
 						- RWSEM_ACTIVE_BIAS;
 	if (oldcount & RWSEM_ACTIVE_MASK)
-		goto undo;
-
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-
-	/* try to grant a single write lock if there's a writer at the front
-	 * of the queue - note we leave the 'active part' of the count
-	 * incremented by 1 and the waiting part incremented by 0x00010000
-	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
+		/* Someone grabbed the sem already */
+		goto undo_write;
 
 	/* We must be careful not to touch 'waiter' after we set ->task = NULL.
 	 * It is an allocated on the waiter's stack and may become invalid at
@@ -87,18 +84,24 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	put_task_struct(tsk);
 	goto out;
 
-	/* don't want to wake any writers */
- dont_wake_writers:
-	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
-		goto out;
+ readers_only:
+	if (downgrading)
+		goto wake_readers;
+
+	/* if we came through an up_xxxx() call, we only only wake someone up
+	 * if we can transition the active part of the count from 0 -> 1 */
+ try_again_read:
+	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
+						- RWSEM_ACTIVE_BIAS;
+	if (oldcount & RWSEM_ACTIVE_MASK)
+		/* Someone grabbed the sem already */
+		goto undo_read;
 
-	/* grant an infinite number of read locks to the readers at the front
-	 * of the queue
-	 * - note we increment the 'active part' of the count by the number of
-	 *   readers before waking any processes up
+ wake_readers:
+	/* Grant an infinite number of read locks to the readers at the front
+	 * of the queue.  Note we increment the 'active part' of the count by
+	 * the number of readers before waking any processes up.
 	 */
- readers_only:
 	woken = 0;
 	do {
 		woken++;
@@ -138,10 +141,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 
 	/* undo the change to the active count, but check for a transition
 	 * 1->0 */
- undo:
+ undo_write:
+	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+		goto out;
+	goto try_again_write;
+ undo_read:
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
-	goto try_again;
+	goto try_again_read;
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/11] rwsem: lighter active count checks when waking up readers
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (2 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 03/11] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:49   ` Daniel Walker
  2010-05-24 20:31 ` [PATCH 05/11] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

In __rwsem_do_wake(), we can skip the active count check unless we come
there from up_xxxx(). Also when checking the active count, it is not
actually necessary to increment it; this allows us to get rid of the
read side undo code and simplify the calculation of the final rwsem count
adjustment once we've counted the reader threads to wake.

The basic observation is the following. When there are waiter threads
on a rwsem and the spinlock is held, other threads can only increment the
active count by trying to grab the rwsem in down_xxxx(). However down_xxxx()
will notice there are waiter threads and take the down_failed path,
blocking to acquire the spinlock on the way there. Therefore, a thread
observing an active count of zero with waiters queued and the spinlock held,
is protected against other threads acquiring the rwsem until it wakes the
last waiter or releases the spinlock.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   57 ++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 917fd94..94f2d7a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -36,6 +36,14 @@ struct rwsem_waiter {
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
 };
 
+/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
+ * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
+ * since the rwsem value was observed.
+ */
+#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
+#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
+#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -46,8 +54,8 @@ struct rwsem_waiter {
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
  */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static struct rw_semaphore *
+__rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -58,7 +66,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
 		goto readers_only;
 
-	if (downgrading)
+	if (wake_type == RWSEM_WAKE_READ_OWNED)
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -85,19 +93,25 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	goto out;
 
  readers_only:
-	if (downgrading)
-		goto wake_readers;
-
-	/* if we came through an up_xxxx() call, we only only wake someone up
-	 * if we can transition the active part of the count from 0 -> 1 */
- try_again_read:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-						- RWSEM_ACTIVE_BIAS;
-	if (oldcount & RWSEM_ACTIVE_MASK)
+	/* If we come here from up_xxxx(), another thread might have reached
+	 * rwsem_down_failed_common() before we acquired the spinlock and
+	 * woken up a waiter, making it now active.  We prefer to check for
+	 * this first in order to not spend too much time with the spinlock
+	 * held if we're not going to be able to wake up readers in the end.
+	 *
+	 * Note that we do not need to update the rwsem count: any writer
+	 * trying to acquire rwsem will run rwsem_down_write_failed() due
+	 * to the waiting threads and block trying to acquire the spinlock.
+	 *
+	 * We use a dummy atomic update in order to acquire the cache line
+	 * exclusively since we expect to succeed and run the final rwsem
+	 * count adjustment pretty soon.
+	 */
+	if (wake_type == RWSEM_WAKE_ANY &&
+	    (rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK))
 		/* Someone grabbed the sem already */
-		goto undo_read;
+		goto out;
 
- wake_readers:
 	/* Grant an infinite number of read locks to the readers at the front
 	 * of the queue.  Note we increment the 'active part' of the count by
 	 * the number of readers before waking any processes up.
@@ -116,9 +130,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 
 	loop = woken;
 	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
-	if (!downgrading)
-		/* we'd already done one increment earlier */
-		woken -= RWSEM_ACTIVE_BIAS;
 
 	rwsem_atomic_add(woken, sem);
 
@@ -145,10 +156,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again_write;
- undo_read:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
-		goto out;
-	goto try_again_read;
 }
 
 /*
@@ -170,12 +177,12 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	list_add_tail(&waiter->list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively read-locking */
+	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
 	/* if there are no active locks, wake the front queued process(es) up */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
 
 	spin_unlock_irq(&sem->wait_lock);
 
@@ -232,7 +239,7 @@ asmregparm struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 0);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
@@ -252,7 +259,7 @@ asmregparm struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 1);
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/11] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (3 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 04/11] rwsem: lighter active count checks when waking up readers Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Previously each waiting thread added a bias of RWSEM_WAITING_BIAS.
With this change, the bias is added only once to indicate that
the wait list is non-empty.

This has a few nice properties which will be used in following changes:
- when the spinlock is held and the waiter list is known to be non-empty,
  count < RWSEM_WAITING_BIAS  <=>  there is an active writer on that sem
- count == RWSEM_WAITING_BIAS  <=>  there are waiting threads and no
                                     active readers/writers on that sem

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: David Howells <dhowells@redhat.com>
---
 lib/rwsem.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 94f2d7a..a3e68bf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -60,7 +60,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	signed long oldcount, woken, loop;
+	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
@@ -73,9 +73,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * write lock.  However, we only wake this writer if we can transition
 	 * the active part of the count from 0 -> 1
 	 */
+	adjustment = RWSEM_ACTIVE_WRITE_BIAS;
+	if (waiter->list.next == &sem->wait_list)
+		adjustment -= RWSEM_WAITING_BIAS;
+
  try_again_write:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
-						- RWSEM_ACTIVE_BIAS;
+	oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
 	if (oldcount & RWSEM_ACTIVE_MASK)
 		/* Someone grabbed the sem already */
 		goto undo_write;
@@ -128,13 +131,15 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
+	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
 
-	rwsem_atomic_add(woken, sem);
+	rwsem_atomic_add(adjustment, sem);
 
 	next = sem->wait_list.next;
-	for (; loop > 0; loop--) {
+	for (loop = woken; loop > 0; loop--) {
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
 		tsk = waiter->task;
@@ -153,7 +158,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	/* undo the change to the active count, but check for a transition
 	 * 1->0 */
  undo_write:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) & RWSEM_ACTIVE_MASK)
+	if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK)
 		goto out;
 	goto try_again_write;
 }
@@ -175,6 +180,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	waiter->task = tsk;
 	get_task_struct(tsk);
 
+	if (list_empty(&sem->wait_list))
+		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter->list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
@@ -208,8 +215,7 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter,
-				RWSEM_WAITING_BIAS - RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
 	return sem;
 }
 
@@ -222,7 +228,7 @@ rwsem_down_write_failed(struct rw_semaphore *sem)
 	struct rwsem_waiter waiter;
 
 	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_BIAS);
+	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
 
 	return sem;
 }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (4 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 05/11] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-08-12  1:24   ` Tony Luck
  2010-05-24 20:31 ` [PATCH 07/11] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

This change addresses the following situation:

- Thread A acquires the rwsem for read
- Thread B tries to acquire the rwsem for write, notices there is already
  an active owner for the rwsem.
- Thread C tries to acquire the rwsem for read, notices that thread B already
  tried to acquire it.
- Thread C grabs the spinlock and queues itself on the wait queue.
- Thread B grabs the spinlock and queues itself behind C. At this point A is
  the only remaining active owner on the rwsem.

In this situation thread B could notice that it was the last active writer
on the rwsem, and decide to wake C to let it proceed in parallel with A
since they both only want the rwsem for read.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 lib/rwsem.c |   19 +++++++++++++++----
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a3e68bf..318d435 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -67,6 +67,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
+		/* Another active reader was observed, so wakeup is not
+		 * likely to succeed. Save the atomic op.
+		 */
 		goto out;
 
 	/* There's a writer at the front of the queue - try to grant it the
@@ -111,8 +114,8 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 * count adjustment pretty soon.
 	 */
 	if (wake_type == RWSEM_WAKE_ANY &&
-	    (rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK))
-		/* Someone grabbed the sem already */
+	    rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS)
+		/* Someone grabbed the sem for write already */
 		goto out;
 
 	/* Grant an infinite number of read locks to the readers at the front
@@ -187,9 +190,17 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
 
-	/* if there are no active locks, wake the front queued process(es) up */
-	if (!(count & RWSEM_ACTIVE_MASK))
+	/* If there are no active locks, wake the front queued process(es) up.
+	 *
+	 * Alternatively, if we're called from a failed down_write(), there
+	 * were already threads queued before us and there are no active
+	 * writers, the lock must be read owned; so we try to wake any read
+	 * locks that were queued ahead of us. */
+	if (count == RWSEM_WAITING_BIAS)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_NO_ACTIVE);
+	else if (count > RWSEM_WAITING_BIAS &&
+		 adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
 
 	spin_unlock_irq(&sem->wait_lock);
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/11] rwsem: smaller wrappers around rwsem_down_failed_common
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (5 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 08/11] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

More code can be pushed from rwsem_down_read_failed and
rwsem_down_write_failed into rwsem_down_failed_common.

Following change adding down_read_critical infrastructure support also
enjoys having flags available in a register rather than having to
fish it out in the struct rwsem_waiter...

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: David Howells <dhowells@redhat.com>
---
 lib/rwsem.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 318d435..f236d7c 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -171,8 +171,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			struct rwsem_waiter *waiter, signed long adjustment)
+			 unsigned int flags, signed long adjustment)
 {
+	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	signed long count;
 
@@ -180,12 +181,13 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
-	waiter->task = tsk;
+	waiter.task = tsk;
+	waiter.flags = flags;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter->list, &sem->wait_list);
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -206,7 +208,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->task)
+		if (!waiter.task)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
@@ -223,11 +225,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 asmregparm struct rw_semaphore __sched *
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_READ_BIAS);
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_READ,
+					-RWSEM_ACTIVE_READ_BIAS);
 }
 
 /*
@@ -236,12 +235,8 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 asmregparm struct rw_semaphore __sched *
 rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
-
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem, &waiter, -RWSEM_ACTIVE_WRITE_BIAS);
-
-	return sem;
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_WRITE,
+					-RWSEM_ACTIVE_WRITE_BIAS);
 }
 
 /*
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/11] generic rwsem: implement down_read_critical() / up_read_critical()
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (6 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 07/11] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 09/11] rwsem: down_read_critical infrastructure support Michel Lespinasse
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Add down_read_critical() / up_read_critical() API.

down_read_critical() is similar to down_read() with the following changes:
- when the rwsem is read owned with queued writers, down_read_critical()
  callers are allowed to acquire the rwsem for read without queueing;
- when the rwsem is write owned, down_read_critical() callers get queued in
  front of threads trying to acquire the rwsem by other means.
- caller can't sleep until releasing lock with up_read_critical(),
  and preemption is disabled for that time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rwsem-spinlock.h |   10 +++++++++-
 include/linux/rwsem.h          |   12 ++++++++++++
 kernel/rwsem.c                 |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c           |   10 +++++++---
 4 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..67631c3 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,9 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+#define __HAVE_DOWN_READ_UNFAIR
+
+extern void __down_read_internal(struct rw_semaphore *sem, int unfair);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
@@ -70,5 +72,11 @@ extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
 
+static inline void __down_read(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 0); }
+
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+	{ __down_read_internal(sem, 1); }
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..76fd8f4 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -28,6 +28,13 @@ struct rw_semaphore;
 extern void down_read(struct rw_semaphore *sem);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+extern void down_read_critical(struct rw_semaphore *sem);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 extern int down_read_trylock(struct rw_semaphore *sem);
@@ -48,6 +55,11 @@ extern int down_write_trylock(struct rw_semaphore *sem);
 extern void up_read(struct rw_semaphore *sem);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+extern void up_read_critical(struct rw_semaphore *sem);
+
+/*
  * release a write lock
  */
 extern void up_write(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cae050b..7c34174 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -13,6 +13,10 @@
 #include <asm/system.h>
 #include <asm/atomic.h>
 
+#ifndef __HAVE_DOWN_READ_UNFAIR
+# define __down_read_unfair(sem) __down_read(sem)
+#endif
+
 /*
  * lock for reading
  */
@@ -27,6 +31,23 @@ void __sched down_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(down_read);
 
 /*
+ * lock for reading in critical section
+ *
+ * locker skips waiting threads, but can't block until up_read_critical()
+ */
+void __sched down_read_critical(struct rw_semaphore *sem)
+{
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
+
+	preempt_disable();
+}
+
+EXPORT_SYMBOL(down_read_critical);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 int down_read_trylock(struct rw_semaphore *sem)
@@ -80,6 +101,20 @@ void up_read(struct rw_semaphore *sem)
 EXPORT_SYMBOL(up_read);
 
 /*
+ * release a read lock acquired with down_read_critical()
+ */
+void up_read_critical(struct rw_semaphore *sem)
+{
+	rwsem_release(&sem->dep_map, 1, _RET_IP_);
+
+	__up_read(sem);
+
+	preempt_enable();
+}
+
+EXPORT_SYMBOL(up_read_critical);
+
+/*
  * release a write lock
  */
 void up_write(struct rw_semaphore *sem)
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..b2fd5fb 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -139,7 +139,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 /*
  * get a read lock on the semaphore
  */
-void __sched __down_read(struct rw_semaphore *sem)
+void __sched __down_read_internal(struct rw_semaphore *sem, int unfair)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
@@ -147,7 +147,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 
 	spin_lock_irqsave(&sem->wait_lock, flags);
 
-	if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
+	if (sem->activity >= 0 && (unfair || list_empty(&sem->wait_list))) {
 		/* granted */
 		sem->activity++;
 		spin_unlock_irqrestore(&sem->wait_lock, flags);
@@ -162,7 +162,11 @@ void __sched __down_read(struct rw_semaphore *sem)
 	waiter.flags = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
-	list_add_tail(&waiter.list, &sem->wait_list);
+	if (unfair) {
+		list_add(&waiter.list, &sem->wait_list);
+	} else {
+		list_add_tail(&waiter.list, &sem->wait_list);
+	}
 
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/11] rwsem: down_read_critical infrastructure support
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (7 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 08/11] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 10/11] x86 rwsem: down_read_critical implementation Michel Lespinasse
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han,
	David Howells

Add rwsem_down_read_unfair_failed() function in the non-generic rwsem
library code, similar to  rwsem_down_read_failed() except that blocked
threads are placed at the head of the queue.

Make the waiter type an enumeration rather than a bitmask.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---
 lib/rwsem.c |   40 +++++++++++++++++++++++++++++++---------
 1 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index f236d7c..72454f4 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -28,12 +28,16 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 
 EXPORT_SYMBOL(__init_rwsem);
 
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ,
+	RWSEM_WAITING_FOR_UNFAIR_READ
+};
+
 struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
+	enum rwsem_waiter_type type;
 };
 
 /* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
@@ -63,7 +67,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	signed long oldcount, woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		goto readers_only;
 
 	if (wake_type == RWSEM_WAKE_READ_OWNED)
@@ -132,10 +136,10 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(waiter->list.next,
 					struct rwsem_waiter, list);
 
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
+	} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS;
-	if (waiter->flags & RWSEM_WAITING_FOR_READ)
+	if (waiter->type != RWSEM_WAITING_FOR_WRITE)
 		/* hit end of list above */
 		adjustment -= RWSEM_WAITING_BIAS;
 
@@ -171,7 +175,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
  */
 static struct rw_semaphore __sched *
 rwsem_down_failed_common(struct rw_semaphore *sem,
-			 unsigned int flags, signed long adjustment)
+			 enum rwsem_waiter_type type, signed long adjustment)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
@@ -182,12 +186,16 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
-	waiter.flags = flags;
+	waiter.type = type;
 	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter.list, &sem->wait_list);
+
+	if (type == RWSEM_WAITING_FOR_UNFAIR_READ)
+		list_add(&waiter.list, &sem->wait_list);
+	else
+		list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = rwsem_atomic_update(adjustment, sem);
@@ -229,6 +237,20 @@ rwsem_down_read_failed(struct rw_semaphore *sem)
 					-RWSEM_ACTIVE_READ_BIAS);
 }
 
+#ifdef __HAVE_DOWN_READ_UNFAIR
+
+/*
+ * wait for the read lock to be granted - skip waiting threads
+ */
+asmregparm struct rw_semaphore __sched *
+rwsem_down_read_unfair_failed(struct rw_semaphore *sem)
+{
+	return rwsem_down_failed_common(sem, RWSEM_WAITING_FOR_UNFAIR_READ,
+					-RWSEM_ACTIVE_READ_BIAS);
+}
+
+#endif
+
 /*
  * wait for the write lock to be granted
  */
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/11] x86 rwsem: down_read_critical implementation
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (8 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 09/11] rwsem: down_read_critical infrastructure support Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 20:31 ` [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files Michel Lespinasse
  2010-05-25  8:47 ` [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Peter Zijlstra
  11 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

The basic properties that have been true so far still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. This way, even the first writer
will set the rwsem count to be < RWSEM_WAITING_BIAS and down_read_critical()
can make use of this to determine if there are any active writers.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: David Howells <dhowells@redhat.com>
---
 arch/x86/include/asm/rwsem.h |   52 ++++++++++++++++++++++++++++++++---------
 arch/x86/lib/rwsem_64.S      |   14 +++++++++-
 arch/x86/lib/semaphore_32.S  |   21 +++++++++++++++-
 3 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 0892acc..54905f1 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_MASK supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -48,6 +47,8 @@ struct rwsem_waiter;
 extern asmregparm struct rw_semaphore *
  rwsem_down_read_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
+ rwsem_down_read_unfair_failed(struct rw_semaphore *sem);
+extern asmregparm struct rw_semaphore *
  rwsem_down_write_failed(struct rw_semaphore *sem);
 extern asmregparm struct rw_semaphore *
  rwsem_wake(struct rw_semaphore *);
@@ -62,17 +63,22 @@ extern asmregparm struct rw_semaphore *
  * for 64 bits.
  */
 
+
 #ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
+# define RWSEM_ACTIVE_MASK		0x000000007fffffffL
+# define RWSEM_ACTIVE_READ_BIAS		0x0000000000000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffffffff00000001L
+# define RWSEM_WAITING_BIAS		0xffffffff80000000L
 #else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_UNLOCKED_VALUE		0x00000000L
+# define RWSEM_ACTIVE_MASK		0x00007fffL
+# define RWSEM_ACTIVE_READ_BIAS		0x00000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffff0001L
+# define RWSEM_WAITING_BIAS		0xffff8000L
 #endif
 
-#define RWSEM_UNLOCKED_VALUE		0x00000000L
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define __HAVE_DOWN_READ_UNFAIR
 
 typedef signed long rwsem_count_t;
 
@@ -129,6 +135,27 @@ static inline void __down_read(struct rw_semaphore *sem)
 }
 
 /*
+ * lock for reading - skip waiting threads
+ */
+static inline void __down_read_unfair(struct rw_semaphore *sem)
+{
+	rwsem_count_t tmp;
+	asm volatile("# beginning down_read_unfair\n\t"
+		     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
+		     /* adds 0x00000001, returns the old value */
+		     "  cmp       %4,%1\n\t"
+		     /* was the count >= RWSEM_WAITING_BIAS before? */
+		     "  jge       1f\n"
+		     "  call call_rwsem_down_read_unfair_failed\n"
+		     "1:\n"
+		     "# ending down_read_unfair"
+		     : "+m" (sem->count), "=r" (tmp)
+		     : "a" (sem), "1" (RWSEM_ACTIVE_READ_BIAS),
+		       "er" (RWSEM_WAITING_BIAS)
+		     : "memory", "cc");
+}
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 static inline int __down_read_trylock(struct rw_semaphore *sem)
@@ -240,7 +267,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem),
+		       "er" (RWSEM_ACTIVE_READ_BIAS - RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
diff --git a/arch/x86/lib/rwsem_64.S b/arch/x86/lib/rwsem_64.S
index 41fcf00..e99d187 100644
--- a/arch/x86/lib/rwsem_64.S
+++ b/arch/x86/lib/rwsem_64.S
@@ -51,6 +51,16 @@ ENTRY(call_rwsem_down_read_failed)
 	ret
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	save_common_regs
+	pushq %rdx
+	movq %rax,%rdi
+	call rwsem_down_read_unfair_failed
+	popq %rdx
+	restore_common_regs
+	ret
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	save_common_regs
 	movq %rax,%rdi
@@ -60,8 +70,8 @@ ENTRY(call_rwsem_down_write_failed)
 	ENDPROC(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
-	decl %edx	/* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpl $0x80000001, %edx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	save_common_regs
 	movq %rax,%rdi
 	call rwsem_wake
diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S
index 648fe47..fea9a53 100644
--- a/arch/x86/lib/semaphore_32.S
+++ b/arch/x86/lib/semaphore_32.S
@@ -89,6 +89,23 @@ ENTRY(call_rwsem_down_read_failed)
 	CFI_ENDPROC
 	ENDPROC(call_rwsem_down_read_failed)
 
+ENTRY(call_rwsem_down_read_unfair_failed)
+	CFI_STARTPROC
+	push %ecx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET ecx,0
+	push %edx
+	CFI_ADJUST_CFA_OFFSET 4
+	CFI_REL_OFFSET edx,0
+	call rwsem_down_read_unfair_failed
+	pop %edx
+	CFI_ADJUST_CFA_OFFSET -4
+	pop %ecx
+	CFI_ADJUST_CFA_OFFSET -4
+	ret
+	CFI_ENDPROC
+	ENDPROC(call_rwsem_down_read_failed)
+
 ENTRY(call_rwsem_down_write_failed)
 	CFI_STARTPROC
 	push %ecx
@@ -103,8 +120,8 @@ ENTRY(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
 	CFI_STARTPROC
-	decw %dx    /* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpw $0x8001, %dx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	push %ecx
 	CFI_ADJUST_CFA_OFFSET 4
 	CFI_REL_OFFSET ecx,0
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (9 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 10/11] x86 rwsem: down_read_critical implementation Michel Lespinasse
@ 2010-05-24 20:31 ` Michel Lespinasse
  2010-05-24 21:12   ` Daniel Walker
  2010-05-27 14:12   ` Arjan van de Ven
  2010-05-25  8:47 ` [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Peter Zijlstra
  11 siblings, 2 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

This helps in the following situation:
- Thread A takes a page fault while reading or writing memory.
  do_page_fault() acquires the mmap_sem for read and blocks on disk
  (either reading the page from file, or hitting swap) for a long time.
- Thread B does an mmap call and blocks trying to acquire the mmap_sem
  for write
- Thread C is a monitoring process trying to read every /proc/pid/maps
  in the system. This requires acquiring the mmap_sem for read. Thread C
  blocks behind B, waiting for A to release the rwsem.  If thread C
  could be allowed to run in parallel with A, it would probably get done
  long before thread A's disk access completes, thus not actually slowing
  down thread B.

Test results with down_read_critical_test (10 seconds):

2.6.33.3:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~600 /proc/pid/maps reads

2.6.33.3 + down_read_critical:
threadA completes ~600 faults
threadB completes ~300 mmap/munmap cycles
threadC completes ~160000 /proc/pid/maps reads

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/base.c          |    4 ++--
 fs/proc/task_mmu.c      |   24 +++++++++++++++++-------
 include/linux/proc_fs.h |    1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8418fcc..9201762 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,11 +1367,11 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
+	down_read_critical(&mm->mmap_sem);
 	exe_file = mm->exe_file;
 	if (exe_file)
 		get_file(exe_file);
-	up_read(&mm->mmap_sem);
+	up_read_critical(&mm->mmap_sem);
 	return exe_file;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47f5b14..83f9353 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,7 +89,10 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
 {
 	if (vma && vma != priv->tail_vma) {
 		struct mm_struct *mm = vma->vm_mm;
-		up_read(&mm->mmap_sem);
+		if (priv->critical)
+			up_read_critical(&mm->mmap_sem);
+		else
+			up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
 }
@@ -123,7 +126,10 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	mm = mm_for_maps(priv->task);
 	if (!mm)
 		return NULL;
-	down_read(&mm->mmap_sem);
+	if (priv->critical)
+		down_read_critical(&mm->mmap_sem);
+	else
+		down_read(&mm->mmap_sem);
 
 	tail_vma = get_gate_vma(priv->task);
 	priv->tail_vma = tail_vma;
@@ -156,7 +162,10 @@ out:
 
 	/* End of vmas has been reached */
 	m->version = (tail_vma != NULL)? 0: -1UL;
-	up_read(&mm->mmap_sem);
+	if (priv->critical)
+		up_read_critical(&mm->mmap_sem);
+	else
+		up_read(&mm->mmap_sem);
 	mmput(mm);
 	return tail_vma;
 }
@@ -185,13 +194,14 @@ static void m_stop(struct seq_file *m, void *v)
 }
 
 static int do_maps_open(struct inode *inode, struct file *file,
-			const struct seq_operations *ops)
+			const struct seq_operations *ops, int critical)
 {
 	struct proc_maps_private *priv;
 	int ret = -ENOMEM;
 	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
 	if (priv) {
 		priv->pid = proc_pid(inode);
+		priv->critical = critical;
 		ret = seq_open(file, ops);
 		if (!ret) {
 			struct seq_file *m = file->private_data;
@@ -282,7 +292,7 @@ static const struct seq_operations proc_pid_maps_op = {
 
 static int maps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_maps_op);
+	return do_maps_open(inode, file, &proc_pid_maps_op, 1);
 }
 
 const struct file_operations proc_maps_operations = {
@@ -432,7 +442,7 @@ static const struct seq_operations proc_pid_smaps_op = {
 
 static int smaps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_smaps_op);
+	return do_maps_open(inode, file, &proc_pid_smaps_op, 0);
 }
 
 const struct file_operations proc_smaps_operations = {
@@ -808,7 +818,7 @@ static const struct seq_operations proc_pid_numa_maps_op = {
 
 static int numa_maps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+	return do_maps_open(inode, file, &proc_pid_numa_maps_op, 0);
 }
 
 const struct file_operations proc_numa_maps_operations = {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 379eaed..66785e7 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -291,6 +291,7 @@ struct proc_maps_private {
 	struct task_struct *task;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
+	int critical;
 #endif
 };
 
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/11] rwsem: lighter active count checks when waking up readers
  2010-05-24 20:31 ` [PATCH 04/11] rwsem: lighter active count checks when waking up readers Michel Lespinasse
@ 2010-05-24 20:49   ` Daniel Walker
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Walker @ 2010-05-24 20:49 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, 2010-05-24 at 13:31 -0700, Michel Lespinasse wrote:
> In __rwsem_do_wake(), we can skip the active count check unless we come
> there from up_xxxx(). Also when checking the active count, it is not
> actually necessary to increment it; this allows us to get rid of the
> read side undo code and simplify the calculation of the final rwsem count
> adjustment once we've counted the reader threads to wake.
> 
> The basic observation is the following. When there are waiter threads
> on a rwsem and the spinlock is held, other threads can only increment the
> active count by trying to grab the rwsem in down_xxxx(). However down_xxxx()
> will notice there are waiter threads and take the down_failed path,
> blocking to acquire the spinlock on the way there. Therefore, a thread
> observing an active count of zero with waiters queued and the spinlock held,
> is protected against other threads acquiring the rwsem until it wakes the
> last waiter or releases the spinlock.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  lib/rwsem.c |   57 ++++++++++++++++++++++++++++++++-------------------------
>  1 files changed, 32 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 917fd94..94f2d7a 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -36,6 +36,14 @@ struct rwsem_waiter {
>  #define RWSEM_WAITING_FOR_WRITE	0x00000002
>  };
>  
> +/* Wake types for __rwsem_do_wake().  Note that RWSEM_WAKE_NO_ACTIVE and
> + * RWSEM_WAKE_READ_OWNED imply that the spinlock must have been kept held
> + * since the rwsem value was observed.
> + */
> +#define RWSEM_WAKE_ANY        0 /* Wake whatever's at head of wait list */
> +#define RWSEM_WAKE_NO_ACTIVE  1 /* rwsem was observed with no active thread */
> +#define RWSEM_WAKE_READ_OWNED 2 /* rwsem was observed to be read owned */
> +
>  /*
>   * handle the lock release when processes blocked on it that can now run
>   * - if we come here from up_xxxx(), then:
> @@ -46,8 +54,8 @@ struct rwsem_waiter {
>   * - woken process blocks are discarded from the list after having task zeroed
>   * - writers are only woken if downgrading is false
>   */
> -static inline struct rw_semaphore *
> -__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
> +static struct rw_semaphore *
> +__rwsem_do_wake(struct rw_semaphore *sem, int wake_type)

You could convert the "wake_type" into an enum along with the defines
above.

Daniel


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files
  2010-05-24 20:31 ` [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files Michel Lespinasse
@ 2010-05-24 21:12   ` Daniel Walker
  2010-05-27 14:12   ` Arjan van de Ven
  1 sibling, 0 replies; 36+ messages in thread
From: Daniel Walker @ 2010-05-24 21:12 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, 2010-05-24 at 13:31 -0700, Michel Lespinasse wrote:
> This helps in the following situation:
> - Thread A takes a page fault while reading or writing memory.
>   do_page_fault() acquires the mmap_sem for read and blocks on disk
>   (either reading the page from file, or hitting swap) for a long time.
> - Thread B does an mmap call and blocks trying to acquire the mmap_sem
>   for write
> - Thread C is a monitoring process trying to read every /proc/pid/maps
>   in the system. This requires acquiring the mmap_sem for read. Thread C
>   blocks behind B, waiting for A to release the rwsem.  If thread C
>   could be allowed to run in parallel with A, it would probably get done
>   long before thread A's disk access completes, thus not actually slowing
>   down thread B.
> 
> Test results with down_read_critical_test (10 seconds):

Did you try using queuing on scheduling priorities in some way to do
this? It seems like your setting it up so Thread C is more important
than Thread B, but your using code to dictate that instead of scheduling
priorities. It would make more sense to me if the threads had priorities
to dictate what's "critical" and what's not.

Daniel


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
                   ` (10 preceding siblings ...)
  2010-05-24 20:31 ` [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files Michel Lespinasse
@ 2010-05-25  8:47 ` Peter Zijlstra
  2010-05-25  9:12   ` Michel Lespinasse
  2010-05-25 15:16   ` Linus Torvalds
  11 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-05-25  8:47 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, 2010-05-24 at 13:31 -0700, Michel Lespinasse wrote:
> This is version 4 of my rwsem changes.
> 
> Changes since V3:
> 
> - Split first x86 rwsem change in two smaller parts
> - Added David's rwsem_waiter_type enum suggestion into
>   'rwsem: down_read_critical infrastructure support'
> - Lots of minor style fixes and comments clarified
> 
> I would hope the entire series can be considered for inclusion;
> however if we can not agree on the down_read_critical() bits I would
> still like patches 1-7 to be independently considered as I think they
> still have merit on their own.
> 
> Michel Lespinasse (11):
>   x86 rwsem: stay on fast path when count>0 in __up_write()
>   x86 rwsem: minor cleanups
>   rwsem: fully separate code pathes to wake writers vs readers
>   rwsem: lighter active count checks when waking up readers
>   rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads
>   rwsem: wake queued readers when writer blocks on active read lock
>   rwsem: smaller wrappers around rwsem_down_failed_common
>   generic rwsem: implement down_read_critical() / up_read_critical()
>   rwsem: down_read_critical infrastructure support
>   x86 rwsem: down_read_critical implementation
>   Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files


So what happened to those patches that dropped mmap_sem during I/O?

I really don't like people tinkering with the lock implementations like
this. Nor do I like the naming, stats are in no way _critical_.

I really think adding something like this utterly defeats the purpose of
having a fair lock.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-25  8:47 ` [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Peter Zijlstra
@ 2010-05-25  9:12   ` Michel Lespinasse
  2010-05-25  9:27     ` Peter Zijlstra
  2010-05-25 15:16   ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-25  9:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Tue, May 25, 2010 at 1:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> So what happened to those patches that dropped mmap_sem during I/O?

Yes, we do have patches trying to release the mmap_sem when a page
fault for a file backed VMA blocks on accessing the corresponding
file. We have not given up on these, and we intend to try submitting
them again. However, these patches do *not* address the case of a page
fault blocking while trying to get a free page (i.e. when you get
under high memory pressure).

> I really don't like people tinkering with the lock implementations like
> this. Nor do I like the naming, stats are in no way _critical_.

Critical here refers to the fact that you're not allowed to block
while holding the unfairly acquired rwsem.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-25  9:12   ` Michel Lespinasse
@ 2010-05-25  9:27     ` Peter Zijlstra
  2010-05-26  1:23       ` KAMEZAWA Hiroyuki
  2010-05-27 10:59       ` Michel Lespinasse
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-05-25  9:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Tue, 2010-05-25 at 02:12 -0700, Michel Lespinasse wrote:
> On Tue, May 25, 2010 at 1:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > So what happened to those patches that dropped mmap_sem during I/O?
> 
> Yes, we do have patches trying to release the mmap_sem when a page
> fault for a file backed VMA blocks on accessing the corresponding
> file. We have not given up on these, and we intend to try submitting
> them again. However, these patches do *not* address the case of a page
> fault blocking while trying to get a free page (i.e. when you get
> under high memory pressure).

But I guess they could, right? Simply make the allocation under mmap_sem
be __GFP_HARDWALL|__GFP_HIGHMEM|__GFP_MOVABLE__GFP_NOWARN or
(GFP_HUGHUSER_MOVABLE & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))|__GFP_NOWARN

and drop the mmap_sem when that fails.

> > I really don't like people tinkering with the lock implementations like
> > this. Nor do I like the naming, stats are in no way _critical_.
> 
> Critical here refers to the fact that you're not allowed to block
> while holding the unfairly acquired rwsem.

We usually call that atomic, your 0/n patch didn't explain any of that.

Also, do you really think doing something like:

        /*
         * Check the vma index is within the range and do
         * sequential scan until m_index.
         */
        vma = NULL;
        if ((unsigned long)l < mm->map_count) {
                vma = mm->mmap;
                while (l-- && vma)
                        vma = vma->vm_next;
                goto out;
        }

with preemption disabled is a _good_ thing?

People were talking about raising our vma limit of 64k...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-25  8:47 ` [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Peter Zijlstra
  2010-05-25  9:12   ` Michel Lespinasse
@ 2010-05-25 15:16   ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2010-05-25 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han



On Tue, 25 May 2010, Peter Zijlstra wrote:
> 
> So what happened to those patches that dropped mmap_sem during I/O?

They were horrible. The amount of retrying was just nasty. I don't think 
it's going to ever be a viable approach.

Feel free to prove me wrong with clean patches, but I doubt you can.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-25  9:27     ` Peter Zijlstra
@ 2010-05-26  1:23       ` KAMEZAWA Hiroyuki
  2010-05-27 11:02         ` Michel Lespinasse
  2010-05-27 10:59       ` Michel Lespinasse
  1 sibling, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-05-26  1:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, Linus Torvalds, David Howells, Ingo Molnar,
	Thomas Gleixner, LKML, Andrew Morton, Mike Waychison,
	Suleiman Souhlal, Ying Han

On Tue, 25 May 2010 11:27:55 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2010-05-25 at 02:12 -0700, Michel Lespinasse wrote:
> > On Tue, May 25, 2010 at 1:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > So what happened to those patches that dropped mmap_sem during I/O?
> > 
> > Yes, we do have patches trying to release the mmap_sem when a page
> > fault for a file backed VMA blocks on accessing the corresponding
> > file. We have not given up on these, and we intend to try submitting
> > them again. However, these patches do *not* address the case of a page
> > fault blocking while trying to get a free page (i.e. when you get
> > under high memory pressure).
> 
> But I guess they could, right? Simply make the allocation under mmap_sem
> be __GFP_HARDWALL|__GFP_HIGHMEM|__GFP_MOVABLE__GFP_NOWARN or
> (GFP_HUGHUSER_MOVABLE & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))|__GFP_NOWARN
> 
> and drop the mmap_sem when that fails.
> 
> > > I really don't like people tinkering with the lock implementations like
> > > this. Nor do I like the naming, stats are in no way _critical_.
> > 
> > Critical here refers to the fact that you're not allowed to block
> > while holding the unfairly acquired rwsem.
> 
> We usually call that atomic, your 0/n patch didn't explain any of that.
> 
> Also, do you really think doing something like:
> 
>         /*
>          * Check the vma index is within the range and do
>          * sequential scan until m_index.
>          */
>         vma = NULL;
>         if ((unsigned long)l < mm->map_count) {
>                 vma = mm->mmap;
>                 while (l-- && vma)
>                         vma = vma->vm_next;
>                 goto out;
>         }
> 
> with preemption disabled is a _good_ thing?
> 
> People were talking about raising our vma limit of 64k...

Yes, at least, my customers want over 64k vmas ;)

Hmm..can't we use something speculative lookup for reading maps ?
(as we played in several months ago...)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-25  9:27     ` Peter Zijlstra
  2010-05-26  1:23       ` KAMEZAWA Hiroyuki
@ 2010-05-27 10:59       ` Michel Lespinasse
  2010-05-27 11:18         ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-27 10:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Tue, May 25, 2010 at 11:27:55AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-25 at 02:12 -0700, Michel Lespinasse wrote:
> > Yes, we do have patches trying to release the mmap_sem when a page
> > fault for a file backed VMA blocks on accessing the corresponding
> > file. We have not given up on these, and we intend to try submitting
> > them again. However, these patches do *not* address the case of a page
> > fault blocking while trying to get a free page (i.e. when you get
> > under high memory pressure).
> 
> But I guess they could, right? Simply make the allocation under mmap_sem
> be __GFP_HARDWALL|__GFP_HIGHMEM|__GFP_MOVABLE__GFP_NOWARN or
> (GFP_HUGHUSER_MOVABLE & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))|__GFP_NOWARN
> 
> and drop the mmap_sem when that fails.

It's not clear to me if this can lead to a clean uncontroversial solution.
Doing this for file backed VMAs does not sound any harder in principle,
but we could not get it past linus's NACK last time. I think it's worth
exploring again, but I don't expect it to be so easy :)

> > > I really don't like people tinkering with the lock implementations like
> > > this. Nor do I like the naming, stats are in no way _critical_.
> > 
> > Critical here refers to the fact that you're not allowed to block
> > while holding the unfairly acquired rwsem.
> 
> We usually call that atomic, your 0/n patch didn't explain any of that.

Would replacing the 'critical' name with 'atomic' address your concern
though, or would you remain fundamentally opposed to anything that involves
an unfair acquire path ?

What about patches 1-7 which don't deal with the critical/atomic API;
can we agree to get these in before we we figure out what to do with
the the last 4 ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-26  1:23       ` KAMEZAWA Hiroyuki
@ 2010-05-27 11:02         ` Michel Lespinasse
  2010-05-27 11:45           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-27 11:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, Linus Torvalds, David Howells, Ingo Molnar,
	Thomas Gleixner, LKML, Andrew Morton, Mike Waychison,
	Suleiman Souhlal, Ying Han

On Wed, May 26, 2010 at 10:23:59AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 25 May 2010 11:27:55 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > Also, do you really think doing something like:
> > 
> >         /*
> >          * Check the vma index is within the range and do
> >          * sequential scan until m_index.
> >          */
> >         vma = NULL;
> >         if ((unsigned long)l < mm->map_count) {
> >                 vma = mm->mmap;
> >                 while (l-- && vma)
> >                         vma = vma->vm_next;
> >                 goto out;
> >         }
> > 
> > with preemption disabled is a _good_ thing?
> > 
> > People were talking about raising our vma limit of 64k...

All right, this is clearly a problem then.

> Hmm..can't we use something speculative lookup for reading maps ?
> (as we played in several months ago...)

Would you happen to have a link to that conversation / remember the
subject line of the thread ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-27 10:59       ` Michel Lespinasse
@ 2010-05-27 11:18         ` Peter Zijlstra
  2010-05-27 11:47           ` Michel Lespinasse
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2010-05-27 11:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Thu, 2010-05-27 at 03:59 -0700, Michel Lespinasse wrote:
> On Tue, May 25, 2010 at 11:27:55AM +0200, Peter Zijlstra wrote:
> > On Tue, 2010-05-25 at 02:12 -0700, Michel Lespinasse wrote:
> > > Yes, we do have patches trying to release the mmap_sem when a page
> > > fault for a file backed VMA blocks on accessing the corresponding
> > > file. We have not given up on these, and we intend to try submitting
> > > them again. However, these patches do *not* address the case of a page
> > > fault blocking while trying to get a free page (i.e. when you get
> > > under high memory pressure).
> > 
> > But I guess they could, right? Simply make the allocation under mmap_sem
> > be __GFP_HARDWALL|__GFP_HIGHMEM|__GFP_MOVABLE__GFP_NOWARN or
> > (GFP_HUGHUSER_MOVABLE & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))|__GFP_NOWARN
> > 
> > and drop the mmap_sem when that fails.
> 
> It's not clear to me if this can lead to a clean uncontroversial solution.
> Doing this for file backed VMAs does not sound any harder in principle,
> but we could not get it past linus's NACK last time. I think it's worth
> exploring again, but I don't expect it to be so easy :)

Right, I was planning to have a look again, but my todo list is getting
out of hand again.

> > > > I really don't like people tinkering with the lock implementations like
> > > > this. Nor do I like the naming, stats are in no way _critical_.
> > > 
> > > Critical here refers to the fact that you're not allowed to block
> > > while holding the unfairly acquired rwsem.
> > 
> > We usually call that atomic, your 0/n patch didn't explain any of that.
> 
> Would replacing the 'critical' name with 'atomic' address your concern
> though, or would you remain fundamentally opposed to anything that involves
> an unfair acquire path ?

Yeah, I don't think its a good idea to add unfairness to the lock, such
things tend to backfire and generate starvation cases. For instance, a
tight loop around your /proc/$pid/maps mod will starve $pid from doing
anything requiring mmap_sem for writing.

> What about patches 1-7 which don't deal with the critical/atomic API;
> can we agree to get these in before we we figure out what to do with
> the the last 4 ?

patch 6 looks like it will break fairness, either that or the changelog
got me totally confused (already had to read it twice).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-27 11:02         ` Michel Lespinasse
@ 2010-05-27 11:45           ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-05-27 11:45 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: KAMEZAWA Hiroyuki, Linus Torvalds, David Howells, Ingo Molnar,
	Thomas Gleixner, LKML, Andrew Morton, Mike Waychison,
	Suleiman Souhlal, Ying Han

On Thu, 2010-05-27 at 04:02 -0700, Michel Lespinasse wrote:
> > Hmm..can't we use something speculative lookup for reading maps ?
> > (as we played in several months ago...)
> 
> Would you happen to have a link to that conversation / remember the
> subject line of the thread ? 

http://lkml.org/lkml/2009/12/24/153

My last series of patches lives here:

http://programming.kicks-ass.net/kernel-patches/speculative-fault/



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal
  2010-05-27 11:18         ` Peter Zijlstra
@ 2010-05-27 11:47           ` Michel Lespinasse
  0 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-05-27 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Thu, May 27, 2010 at 4:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-27 at 03:59 -0700, Michel Lespinasse wrote:
>> What about patches 1-7 which don't deal with the critical/atomic API;
>> can we agree to get these in before we we figure out what to do with
>> the the last 4 ?
>
> patch 6 looks like it will break fairness, either that or the changelog
> got me totally confused (already had to read it twice).

Hmmm. I must be bad at explaining this :)

It avoids a situation where threads got blocked waiting for all active
readers to release, but (due to a race condition when blocking) the
first queued thread is also a reader. Letting that reader proceed in
parallel with the currently active ones will not delay queued writers;
they would have had to wait for the first queued reader to proceed
anyway since they were queued behind it.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files
  2010-05-24 20:31 ` [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files Michel Lespinasse
  2010-05-24 21:12   ` Daniel Walker
@ 2010-05-27 14:12   ` Arjan van de Ven
  1 sibling, 0 replies; 36+ messages in thread
From: Arjan van de Ven @ 2010-05-27 14:12 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, 24 May 2010 13:31:21 -0700
Michel Lespinasse <walken@google.com> wrote:

> This helps in the following situation:
> - Thread A takes a page fault while reading or writing memory.
>   do_page_fault() acquires the mmap_sem for read and blocks on disk
>   (either reading the page from file, or hitting swap) for a long
> time.
> - Thread B does an mmap call and blocks trying to acquire the mmap_sem
>   for write
> - Thread C is a monitoring process trying to read every /proc/pid/maps
>   in the system. This requires acquiring the mmap_sem for read.
> Thread C blocks behind B, waiting for A to release the rwsem.  If
> thread C could be allowed to run in parallel with A, it would
> probably get done long before thread A's disk access completes, thus
> not actually slowing down thread B.
> 
> Test results with down_read_critical_test (10 seconds):


this is a really bad idea btw

we've had many issues in the past, when this was an unfair lock, that
"top" or other similar things, caused basically a DoS......
now any process that can get to /proc/<pid>/exe or maps, can do this in
a tight enough loop so that the actual process will never get the lock
for write. BAD IDEA ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on  active read lock
  2010-05-24 20:31 ` [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
@ 2010-08-12  1:24   ` Tony Luck
  2010-08-12  5:02     ` Michel Lespinasse
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-12  1:24 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Mon, May 24, 2010 at 1:31 PM, Michel Lespinasse <walken@google.com> wrote:
> This change addresses the following situation:
>
> - Thread A acquires the rwsem for read
> - Thread B tries to acquire the rwsem for write, notices there is already
>  an active owner for the rwsem.
> - Thread C tries to acquire the rwsem for read, notices that thread B already
>  tried to acquire it.
> - Thread C grabs the spinlock and queues itself on the wait queue.
> - Thread B grabs the spinlock and queues itself behind C. At this point A is
>  the only remaining active owner on the rwsem.
>
> In this situation thread B could notice that it was the last active writer
> on the rwsem, and decide to wake C to let it proceed in parallel with A
> since they both only want the rwsem for read.
>
> Signed-off-by: Michel Lespinasse <walken@google.com>

Linus tree this morning[1] was behaving badly on ia64 ... processes would wander
off into some unkillable state ... and since this happened to processes starting
from rc*.d I couln't get the system up to a login prompt. System is a 32-way
(4 sockets * quad-core * hyperthread).

git bisect pins the blame on this change (commit 424acaaeb...).
Reverting it (and
it's successor a8618a0e - because I assumed that it depended on 424...) gives
me a kernel that works fine.

Not sure what is wrong with this change. Maybe ia64 needs some more memory
ordering bits than the changed code provides? I can dig into it a bit
harder tomorrow,
but I thought you'd like an early heads-up in case anyone else is seeing similar
problems.

-Tony

[1] N.B. I've been on vacation - so this morning's tree was the first
I'd tried since 2.6.35

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on  active read lock
  2010-08-12  1:24   ` Tony Luck
@ 2010-08-12  5:02     ` Michel Lespinasse
  2010-08-12  5:09       ` Michel Lespinasse
  2010-08-12 16:22       ` Tony Luck
  0 siblings, 2 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-08-12  5:02 UTC (permalink / raw)
  To: Tony Luck
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, Aug 11, 2010 at 6:24 PM, Tony Luck <tony.luck@intel.com> wrote:
> Linus tree this morning[1] was behaving badly on ia64 ... processes would wander
> off into some unkillable state ... and since this happened to processes starting
> from rc*.d I couln't get the system up to a login prompt. System is a 32-way
> (4 sockets * quad-core * hyperthread).
>
> git bisect pins the blame on this change (commit 424acaaeb...).
> Reverting it (and
> it's successor a8618a0e - because I assumed that it depended on 424...) gives
> me a kernel that works fine.

Thanks for the report. FYI, a8618a0e does not depend on 424acaae so it
should be fine if you only revert 424acaae

> Not sure what is wrong with this change. Maybe ia64 needs some more memory
> ordering bits than the changed code provides? I can dig into it a bit
> harder tomorrow,
> but I thought you'd like an early heads-up in case anyone else is seeing similar
> problems.

In arch/ia64/include/asm/rwsem.h I see RWSEM_WAITING_BIAS defined as
-__IA64_UL_CONST(0x0000000100000000)

This makes it a large, positive unsigned value. This is probably
throwing off the rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS
comparison in my patch (supposed to be long versus long, but actually
is long versus unsigned long on ia64).

Also, it looks like ia64 uses intrinsics for the atomic accesses, not
asm (that I could see in a 5 minute look around), so maybe one could
just get rid of the __IA64_UL_CONST macros ???

I can not compile or test on ia64, but could you report as to what
happens if you replace the #defines in arch/ia64/include/asm/rwsem.h
with:
#define RWSEM_UNLOCKED_VALUE __IA64_UL_CONST(0x0000000000000000)
#define RWSEM_ACTIVE_BIAS (1L)
#define RWSEM_ACTIVE_MASK (0xffffffffL)
#define RWSEM_WAITING_BIAS (-0x100000000L)
#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)


Cheers,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on  active read lock
  2010-08-12  5:02     ` Michel Lespinasse
@ 2010-08-12  5:09       ` Michel Lespinasse
  2010-08-12 15:57         ` Linus Torvalds
  2010-08-12 16:22       ` Tony Luck
  1 sibling, 1 reply; 36+ messages in thread
From: Michel Lespinasse @ 2010-08-12  5:09 UTC (permalink / raw)
  To: Tony Luck
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, Aug 11, 2010 at 10:02 PM, Michel Lespinasse <walken@google.com> wrote:
> In arch/ia64/include/asm/rwsem.h I see RWSEM_WAITING_BIAS defined as
> -__IA64_UL_CONST(0x0000000100000000)
>
> This makes it a large, positive unsigned value. This is probably
> throwing off the rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS
> comparison in my patch (supposed to be long versus long, but actually
> is long versus unsigned long on ia64).

FYI, I just verified that RWSEM_WAITING_BIAS is defined as signed on
all architectures except for ia64. So, this would be consistent with
the issue being observed only on ia64.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-12  5:09       ` Michel Lespinasse
@ 2010-08-12 15:57         ` Linus Torvalds
  2010-08-12 16:24           ` Tony Luck
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2010-08-12 15:57 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Tony Luck, David Howells, Ingo Molnar, Thomas Gleixner, LKML,
	Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, Aug 11, 2010 at 10:09 PM, Michel Lespinasse <walken@google.com> wrote:
> On Wed, Aug 11, 2010 at 10:02 PM, Michel Lespinasse <walken@google.com> wrote:
>> In arch/ia64/include/asm/rwsem.h I see RWSEM_WAITING_BIAS defined as
>> -__IA64_UL_CONST(0x0000000100000000)
>>
>> This makes it a large, positive unsigned value. This is probably
>> throwing off the rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS
>> comparison in my patch (supposed to be long versus long, but actually
>> is long versus unsigned long on ia64).
>
> FYI, I just verified that RWSEM_WAITING_BIAS is defined as signed on
> all architectures except for ia64. So, this would be consistent with
> the issue being observed only on ia64.

Good.

Tony, can you verify that just dropping the "__IA64_UL_CONST" makes
things work for you? Having a non-working rwsem makes me really
nervous, but considering the lack of reports on other architectures, I
really do hope this is just a trivial ia64 signedness difference.

                              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-12  5:02     ` Michel Lespinasse
  2010-08-12  5:09       ` Michel Lespinasse
@ 2010-08-12 16:22       ` Tony Luck
  1 sibling, 0 replies; 36+ messages in thread
From: Tony Luck @ 2010-08-12 16:22 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Wed, Aug 11, 2010 at 10:02 PM, Michel Lespinasse <walken@google.com> wrote:
> #define RWSEM_UNLOCKED_VALUE __IA64_UL_CONST(0x0000000000000000)
> #define RWSEM_ACTIVE_BIAS (1L)
> #define RWSEM_ACTIVE_MASK (0xffffffffL)
> #define RWSEM_WAITING_BIAS (-0x100000000L)
> #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
> #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)

Thanks - that fixes it for me.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-12 15:57         ` Linus Torvalds
@ 2010-08-12 16:24           ` Tony Luck
  2010-08-12 22:23             ` Tony Luck
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-12 16:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Thu, Aug 12, 2010 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Tony, can you verify that just dropping the "__IA64_UL_CONST" makes
> things work for you? Having a non-working rwsem makes me really
> nervous, but considering the lack of reports on other architectures, I
> really do hope this is just a trivial ia64 signedness difference.

Yes it does. Please pull from

 git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6.git release

to get this fix.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-12 16:24           ` Tony Luck
@ 2010-08-12 22:23             ` Tony Luck
  2010-08-13 16:09               ` Tony Luck
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-12 22:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Thu, Aug 12, 2010 at 9:24 AM, Tony Luck <tony.luck@intel.com> wrote:
> Yes it does. Please pull from
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6.git release

I spoke too soon. Life is much better (boots with no issues). But when
using this
kernel to rebuild all my different configs, it locked up on the third
"make -j32".
That is to say that the build stopped making progress. The system is still alive
and can run other things.

top(1) shows few processes consuming cpu time. There's a pile of
defunct processes
and one of the "make" processes is stuck in "D" wait state.

I'm going to try the current kernel with the 424acaae reverted (and
with the ia64 rwsem.h
reverted too) to check whether this is still the same root cause - or
whether I need to
look for some other problem.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-12 22:23             ` Tony Luck
@ 2010-08-13 16:09               ` Tony Luck
  2010-08-13 21:19                 ` Tony Luck
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-13 16:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Thu, Aug 12, 2010 at 3:23 PM, Tony Luck <tony.luck@gmail.com> wrote:
> I'm going to try the current kernel with the 424acaae reverted (and
> with the ia64 rwsem.h reverted too) to check whether this is still the same
> root cause

This kernel ran "make clean ; make -j32" all night long. It's completed 415
cycles with no apparent problems. Average cycle time is 147.2 seconds.
Minimum is 144, max is 158 ... which all looks very normal.

It's probably a good thing that the unsigned RWSEM_WAITING_BIAS
problem led me to this commit with a problem that showed up during
boot. I'd hate to be bisecting this based on a problem that only shows
up after an unknown number of kernel builds :-)

Still no reports from other architectures? So this still seems to be an
ia64 specific problem. Time to start reading rwsem.c

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-13 16:09               ` Tony Luck
@ 2010-08-13 21:19                 ` Tony Luck
  2010-08-13 23:38                   ` Tony Luck
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-13 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Fri, Aug 13, 2010 at 9:09 AM, Tony Luck <tony.luck@intel.com> wrote:
> Still no reports from other architectures? So this still seems to be an
> ia64 specific problem. Time to start reading rwsem.c

My hung process is "make" (GNU Make 3.81). Stuck forever here:

SYSCALL_DEFINE1(brk, unsigned long, brk)
{
        unsigned long rlim, retval;
        unsigned long newbrk, oldbrk;
        struct mm_struct *mm = current->mm;
        unsigned long min_brk;

        down_write(&mm->mmap_sem);  <<<<<<<<<<<<<<<

Is make multi-threaded these days? If not, I'm confused about how anything
complicated could have happened on mmap_sem.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-13 21:19                 ` Tony Luck
@ 2010-08-13 23:38                   ` Tony Luck
  2010-08-13 23:48                     ` Michel Lespinasse
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Luck @ 2010-08-13 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

Aha!. I think I found it.

This embarrassing declaration in ia64's asm/atomic.h:

static __inline__ int
ia64_atomic64_add (__s64 i, atomic64_t *v)

looks to be the key.  Obviously it would be better to return all
64 bits of the answer using "long" rather than just 32 bits with "int".
The critical change in the rwsem code that exposed this silliness
is:

-           (rwsem_atomic_update(0, sem) & RWSEM_ACTIVE_MASK))
-               /* Someone grabbed the sem already */
+           rwsem_atomic_update(0, sem) < RWSEM_WAITING_BIAS)
+               /* Someone grabbed the sem for write already */

i.e. the old code only looked at the low 32-bits of the return value
(RWSEM_ACTIVE_MASK is 0xffffffff) - so the truncation didn't
matter.

ia64_atomic64_add() has been broken like this since before the
dawn of git time in 2.6.12. Obviously we haven't been using
atomic64_t much.

Running a new test now. 12 iterations so far (which is slightly
further than this test usually gets ... but I'll let it run for a few
more hours before declaring victory).

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock
  2010-08-13 23:38                   ` Tony Luck
@ 2010-08-13 23:48                     ` Michel Lespinasse
  0 siblings, 0 replies; 36+ messages in thread
From: Michel Lespinasse @ 2010-08-13 23:48 UTC (permalink / raw)
  To: Tony Luck
  Cc: Linus Torvalds, David Howells, Ingo Molnar, Thomas Gleixner,
	LKML, Andrew Morton, Mike Waychison, Suleiman Souhlal, Ying Han

On Fri, Aug 13, 2010 at 4:38 PM, Tony Luck <tony.luck@gmail.com> wrote:
> Aha!. I think I found it.
>
> This embarrassing declaration in ia64's asm/atomic.h:
>
> static __inline__ int
> ia64_atomic64_add (__s64 i, atomic64_t *v)
>
> looks to be the key.  Obviously it would be better to return all
> 64 bits of the answer using "long" rather than just 32 bits with "int".

Aha, good catch. Thanks for going to the bottom of this, it would have
taken me forever to figure it out (and I could not test this).

BTW there seems to be the same issue in ia64_atomic64_sub() too
(though I wonder if that ever gets used :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-08-13 23:48 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-24 20:31 [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Michel Lespinasse
2010-05-24 20:31 ` [PATCH 01/11] x86 rwsem: stay on fast path when count>0 in __up_write() Michel Lespinasse
2010-05-24 20:31 ` [PATCH 02/11] x86 rwsem: minor cleanups Michel Lespinasse
2010-05-24 20:31 ` [PATCH 03/11] rwsem: fully separate code pathes to wake writers vs readers Michel Lespinasse
2010-05-24 20:31 ` [PATCH 04/11] rwsem: lighter active count checks when waking up readers Michel Lespinasse
2010-05-24 20:49   ` Daniel Walker
2010-05-24 20:31 ` [PATCH 05/11] rwsem: let RWSEM_WAITING_BIAS represent any number of waiting threads Michel Lespinasse
2010-05-24 20:31 ` [PATCH 06/11] rwsem: wake queued readers when writer blocks on active read lock Michel Lespinasse
2010-08-12  1:24   ` Tony Luck
2010-08-12  5:02     ` Michel Lespinasse
2010-08-12  5:09       ` Michel Lespinasse
2010-08-12 15:57         ` Linus Torvalds
2010-08-12 16:24           ` Tony Luck
2010-08-12 22:23             ` Tony Luck
2010-08-13 16:09               ` Tony Luck
2010-08-13 21:19                 ` Tony Luck
2010-08-13 23:38                   ` Tony Luck
2010-08-13 23:48                     ` Michel Lespinasse
2010-08-12 16:22       ` Tony Luck
2010-05-24 20:31 ` [PATCH 07/11] rwsem: smaller wrappers around rwsem_down_failed_common Michel Lespinasse
2010-05-24 20:31 ` [PATCH 08/11] generic rwsem: implement down_read_critical() / up_read_critical() Michel Lespinasse
2010-05-24 20:31 ` [PATCH 09/11] rwsem: down_read_critical infrastructure support Michel Lespinasse
2010-05-24 20:31 ` [PATCH 10/11] x86 rwsem: down_read_critical implementation Michel Lespinasse
2010-05-24 20:31 ` [PATCH 11/11] Use down_read_critical() for /proc/<pid>/exe and /proc/<pid>/maps files Michel Lespinasse
2010-05-24 21:12   ` Daniel Walker
2010-05-27 14:12   ` Arjan van de Ven
2010-05-25  8:47 ` [PATCH 00/11] V4: rwsem changes + down_read_critical() proposal Peter Zijlstra
2010-05-25  9:12   ` Michel Lespinasse
2010-05-25  9:27     ` Peter Zijlstra
2010-05-26  1:23       ` KAMEZAWA Hiroyuki
2010-05-27 11:02         ` Michel Lespinasse
2010-05-27 11:45           ` Peter Zijlstra
2010-05-27 10:59       ` Michel Lespinasse
2010-05-27 11:18         ` Peter Zijlstra
2010-05-27 11:47           ` Michel Lespinasse
2010-05-25 15:16   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.