[PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations
@ 2016-05-09  4:56 Davidlohr Bueso
  2016-05-09  4:56 ` [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed() Davidlohr Bueso
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09  4:56 UTC (permalink / raw)
  To: mingo, peterz, tglx; +Cc: Waiman.Long, jason.low2, dave, linux-kernel

Hi,

This is a follow up series while reviewing Waiman's reader-owned
state work[1]. While I have based it on -tip instead of that change,
I can certainly rebase the series in some future iteration.

Changes are mainly around reader-waiter optimizations, in no particular
order. Has passed numerous DB benchmarks without things falling apart
for a 8 core Westmere doing page allocations (page_test) in aim9:

aim9
         4.6-rc6				4.6-rc6
						rwsemv2
Min      page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
Min      exec_test      499.00 (  0.00%)      502.67 (  0.74%)
Min      fork_test     3395.47 (  0.00%)     3537.64 (  4.19%)
Hmean    page_test   395433.06 (  0.00%)   414693.68 (  4.87%)
Hmean    exec_test      499.67 (  0.00%)      505.30 (  1.13%)
Hmean    fork_test     3504.22 (  0.00%)     3594.95 (  2.59%)
Stddev   page_test    17426.57 (  0.00%)    26649.92 (-52.93%)
Stddev   exec_test        0.47 (  0.00%)        1.41 (-199.05%)
Stddev   fork_test       63.74 (  0.00%)       32.59 ( 48.86%)
Max      page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
Max      exec_test      500.33 (  0.00%)      507.66 (  1.47%)
Max      fork_test     3653.33 (  0.00%)     3650.90 ( -0.07%)

             4.6-rc6     4.6-rc6
                         rwsemv2
User            1.12        0.04
System          0.23        0.04
Elapsed       727.27      721.98

[1] http://permalink.gmane.org/gmane.linux.kernel/2216743

Thanks!

Davidlohr Bueso (4):
  locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  locking/rwsem: Drop superfluous waiter refcount
  locking/rwsem: Enable lockless waiter wakeup(s)
  locking/rwsem: Rework zeroing reader waiter->task

 kernel/locking/rwsem-xadd.c | 74 ++++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 31 deletions(-)

-- 
2.8.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
@ 2016-05-09  4:56 ` Davidlohr Bueso
  2016-05-09  5:36   ` Peter Hurley
  2016-05-09  4:56 ` [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount Davidlohr Bueso
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09  4:56 UTC (permalink / raw)
  To: mingo, peterz, tglx
  Cc: Waiman.Long, jason.low2, dave, linux-kernel, Davidlohr Bueso

The field is obviously updated w.o the lock and needs a READ_ONCE
while waiting for lock holder(s) to go away, just like we do with
all other ->count accesses.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rwsem-xadd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index df4dcb883b50..7d62772600cf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			}
 			schedule();
 			set_current_state(state);
-		} while ((count = sem->count) & RWSEM_ACTIVE_MASK);
+		} while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
  2016-05-09  4:56 ` [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed() Davidlohr Bueso
@ 2016-05-09  4:56 ` Davidlohr Bueso
  2016-05-09  7:39   ` Peter Zijlstra
  2016-05-09  4:56 ` [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s) Davidlohr Bueso
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09  4:56 UTC (permalink / raw)
  To: mingo, peterz, tglx; +Cc: Waiman.Long, jason.low2, dave, linux-kernel

Read waiters are currently reference counted from the time it enters
the slowpath until the lock is released and the waiter is awoken. This
is fragile and superfluous considering everything occurs within down_read()
without returning to the caller, and the very nature of the primitive does
not suggest that the task can disappear from underneath us. In addition,
spurious wakeups can make the whole refcount useless as get_task_struct()
is only called when setting up the waiter.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 kernel/locking/rwsem-xadd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7d62772600cf..b592bb48d880 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -197,7 +197,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 		smp_mb();
 		waiter->task = NULL;
 		wake_up_process(tsk);
-		put_task_struct(tsk);
 	} while (--loop);
 
 	sem->wait_list.next = next;
@@ -220,7 +219,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
 	waiter.type = RWSEM_WAITING_FOR_READ;
-	get_task_struct(tsk);
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s)
  2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
  2016-05-09  4:56 ` [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed() Davidlohr Bueso
  2016-05-09  4:56 ` [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount Davidlohr Bueso
@ 2016-05-09  4:56 ` Davidlohr Bueso
  2016-05-09  7:48   ` Peter Zijlstra
  2016-05-09  4:56 ` [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task Davidlohr Bueso
  2016-05-10  1:31 ` [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Waiman Long
  4 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09  4:56 UTC (permalink / raw)
  To: mingo, peterz, tglx
  Cc: Waiman.Long, jason.low2, dave, linux-kernel, Davidlohr Bueso

As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).

In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).

Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.

Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rwsem-xadd.c | 53 +++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index b592bb48d880..1b8c1285a2aa 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -114,12 +114,16 @@ enum rwsem_wake_type {
  *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
  *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
  * - there must be someone on the queue
- * - the spinlock must be held by the caller
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s), preferably when the wait_lock
+ *   is released
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if downgrading is false
+ * - writers are only marked woken if downgrading is false
  */
 static struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
+__rwsem_mark_wake(struct rw_semaphore *sem,
+		  enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -129,12 +133,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY)
-			/* Wake writer at the front of the queue, but do not
-			 * grant it the lock yet as we want other writers
-			 * to be able to steal it.  Readers, on the other hand,
-			 * will block as they will notice the queued writer.
+			/*
+			 * Mark writer at the front of the queue for wakeup.
+			 * Until the task is actually later awoken later by
+			 * the caller, other writers are able to steal it the
+			 * lock to be able to steal it.  Readers, on the other,
+			 * hand, will block as they will notice the queued writer.
 			 */
-			wake_up_process(waiter->task);
+			wake_q_add(wake_q, waiter->task);
 		goto out;
 	}
 
@@ -196,12 +202,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 		 */
 		smp_mb();
 		waiter->task = NULL;
-		wake_up_process(tsk);
+		wake_q_add(wake_q, tsk);
 	} while (--loop);
 
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
-
  out:
 	return sem;
 }
@@ -215,6 +220,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
+	WAKE_Q(wake_q);
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -236,9 +242,10 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	if (count == RWSEM_WAITING_BIAS ||
 	    (count > RWSEM_WAITING_BIAS &&
 	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
 
 	/* wait to be given the lock */
 	while (true) {
@@ -470,9 +477,19 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 		 * no active writers, the lock must be read owned; so we try to
 		 * wake any read locks that were queued ahead of us.
 		 */
-		if (count > RWSEM_WAITING_BIAS)
-			sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
-
+		if (count > RWSEM_WAITING_BIAS) {
+			WAKE_Q(wake_q);
+
+			sem = __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			/*
+			 * The wakeup is normally called _after_ the wait_lock
+			 * is released, but given that we are proactively waking
+			 * readers we can deal with the wake_q overhead as it is
+			 * similar to releasing and taking the wait_lock again
+			 * for attempting rwsem_try_write_lock().
+			 */
+			wake_up_q(&wake_q);
+		}
 	} else
 		count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
 
@@ -526,6 +543,7 @@ __visible
 struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
+	WAKE_Q(wake_q);
 
 	/*
 	 * If a spinner is present, it is not necessary to do the wakeup.
@@ -562,9 +580,10 @@ locked:
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
 
 	return sem;
 }
@@ -579,14 +598,16 @@ __visible
 struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
+	WAKE_Q(wake_q);
 
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
+		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
 
 	return sem;
 }
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task
  2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2016-05-09  4:56 ` [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s) Davidlohr Bueso
@ 2016-05-09  4:56 ` Davidlohr Bueso
  2016-05-09  7:50   ` Peter Zijlstra
  2016-05-10  1:31 ` [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Waiman Long
  4 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09  4:56 UTC (permalink / raw)
  To: mingo, peterz, tglx
  Cc: Waiman.Long, jason.low2, dave, linux-kernel, Davidlohr Bueso

Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. There is a mismatch between the
smp_mb() and its documentation, in that the serialization is
done between reading the task and the nil store. Furthermore,
in addition to having the overlapping of loads and stores to
waiter->task guaranteed to be ordered within that CPU, both
wake_up_process() originally and now wake_q_add() already
imply barriers upon successful calls, which serves the comment.

Just atomically do a xchg() and simplify the whole thing. We can
use relaxed semantics as before mentioned in addition to the
barrier provided by wake_q_add(), delaying there is no risk in
reordering with the actual wakeup.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rwsem-xadd.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 1b8c1285a2aa..96e53cb4a4db 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -126,7 +126,6 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
 		  enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
 	struct rwsem_waiter *waiter;
-	struct task_struct *tsk;
 	struct list_head *next;
 	long oldcount, woken, loop, adjustment;
 
@@ -190,24 +189,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
 	next = sem->wait_list.next;
 	loop = woken;
 	do {
+		struct task_struct *tsk;
+
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
-		tsk = waiter->task;
-		/*
-		 * Make sure we do not wakeup the next reader before
-		 * setting the nil condition to grant the next reader;
-		 * otherwise we could miss the wakeup on the other
-		 * side and end up sleeping again. See the pairing
-		 * in rwsem_down_read_failed().
-		 */
-		smp_mb();
-		waiter->task = NULL;
+
+		tsk = xchg_relaxed(&waiter->task, NULL);
 		wake_q_add(wake_q, tsk);
 	} while (--loop);
 
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
- out:
+out:
 	return sem;
 }
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  2016-05-09  4:56 ` [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed() Davidlohr Bueso
@ 2016-05-09  5:36   ` Peter Hurley
  2016-05-09  7:12     ` Peter Zijlstra
  2016-05-09 14:29     ` Davidlohr Bueso
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Hurley @ 2016-05-09  5:36 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, tglx, Waiman.Long, jason.low2, linux-kernel,
	Davidlohr Bueso

On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
> The field is obviously updated w.o the lock and needs a READ_ONCE
> while waiting for lock holder(s) to go away, just like we do with
> all other ->count accesses.

This isn't actually fixing a bug because it's passed through
several full barriers which will force reloading from sem->count.

I think the patch is ok if you want it just for consistency anyway,
but please change $subject and changelog.

Regards,
Peter Hurley


> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  kernel/locking/rwsem-xadd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index df4dcb883b50..7d62772600cf 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
>  			}
>  			schedule();
>  			set_current_state(state);
> -		} while ((count = sem->count) & RWSEM_ACTIVE_MASK);
> +		} while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
>  
>  		raw_spin_lock_irq(&sem->wait_lock);
>  	}
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  2016-05-09  5:36   ` Peter Hurley
@ 2016-05-09  7:12     ` Peter Zijlstra
  2016-05-09 14:29     ` Davidlohr Bueso
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09  7:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Davidlohr Bueso, mingo, tglx, Waiman.Long, jason.low2,
	linux-kernel, Davidlohr Bueso

On Sun, May 08, 2016 at 10:36:21PM -0700, Peter Hurley wrote:
> On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
> > The field is obviously updated w.o the lock and needs a READ_ONCE
> > while waiting for lock holder(s) to go away, just like we do with
> > all other ->count accesses.
> 
> This isn't actually fixing a bug because it's passed through
> several full barriers which will force reloading from sem->count.
> 
> I think the patch is ok if you want it just for consistency anyway,
> but please change $subject and changelog.

Agreed, and note that the READ_ONCE does prohibit load-tearing, while
the current code does not.

So in that respect the patched code is more strict. But yes, the current
code does not allow using a stale value of sem->count() because, as
PeterH notes, we've just passed through at least the full memory barrier
implied by set_current_state().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09  4:56 ` [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount Davidlohr Bueso
@ 2016-05-09  7:39   ` Peter Zijlstra
  2016-05-09 15:56     ` Davidlohr Bueso
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09  7:39 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel

On Sun, May 08, 2016 at 09:56:08PM -0700, Davidlohr Bueso wrote:
> Read waiters are currently reference counted from the time it enters
> the slowpath until the lock is released and the waiter is awoken. This
> is fragile and superfluous considering everything occurs within down_read()
> without returning to the caller, and the very nature of the primitive does
> not suggest that the task can disappear from underneath us. In addition,
> spurious wakeups can make the whole refcount useless as get_task_struct()
> is only called when setting up the waiter.

So I think you're wrong here; imagine this:


	rwsem_down_read_failed()			rwsem_wake()
	  get_task_struct();
	  raw_spin_lock_irq(&wait_lock);
	  list_add_tail(&waiter.list, &wait_list);
	  raw_spin_unlock_irq(&wait_lock);
							  raw_spin_lock_irqsave(&wait_lock)
							  __rwsem_do_wake()
	  while (true) {
	    set_task_state(tsk, TASK_UNINTERRUPTIBLE);
							    waiter->task = NULL
	    if (!waiter.task) // true
	      break;

	  __set_task_state(tsk, TASK_RUNNING);

	do_exit();
							    wake_up_process(tsk); /* BOOM */

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s)
  2016-05-09  4:56 ` [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s) Davidlohr Bueso
@ 2016-05-09  7:48   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09  7:48 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel, Davidlohr Bueso

On Sun, May 08, 2016 at 09:56:09PM -0700, Davidlohr Bueso wrote:
> @@ -129,12 +133,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>  	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
>  	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
>  		if (wake_type == RWSEM_WAKE_ANY)
> -			/* Wake writer at the front of the queue, but do not
> -			 * grant it the lock yet as we want other writers
> -			 * to be able to steal it.  Readers, on the other hand,
> -			 * will block as they will notice the queued writer.
> +			/*
> +			 * Mark writer at the front of the queue for wakeup.
> +			 * Until the task is actually later awoken later by
> +			 * the caller, other writers are able to steal it the
> +			 * lock to be able to steal it.  Readers, on the other,
> +			 * hand, will block as they will notice the queued writer.
>  			 */
> -			wake_up_process(waiter->task);
> +			wake_q_add(wake_q, waiter->task);

Thanks for fixing that comment; that bugged the hell out of me ;-)

>  		goto out;
>  	}
>  
> @@ -196,12 +202,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>  		 */
>  		smp_mb();
>  		waiter->task = NULL;
> -		wake_up_process(tsk);
> +		wake_q_add(wake_q, tsk);


However, note that per the race in the previous email; this is too late
to acquire the tsk refcount. I think it'll work if you do wake_q_add()
_before_ the waiter->task = NULL.

On that same note; I think that you can replace:

	smp_mb();
	waiter->task = NULL;

with:

	smp_store_release(&waiter->task, NULL);

>  	} while (--loop);
>  
>  	sem->wait_list.next = next;
>  	next->prev = &sem->wait_list;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task
  2016-05-09  4:56 ` [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task Davidlohr Bueso
@ 2016-05-09  7:50   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09  7:50 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel, Davidlohr Bueso

On Sun, May 08, 2016 at 09:56:10PM -0700, Davidlohr Bueso wrote:
> Readers that are awoken will expect a nil ->task indicating
> that a wakeup has occurred. There is a mismatch between the
> smp_mb() and its documentation, in that the serialization is
> done between reading the task and the nil store. Furthermore,
> in addition to having the overlapping of loads and stores to
> waiter->task guaranteed to be ordered within that CPU, both
> wake_up_process() originally and now wake_q_add() already
> imply barriers upon successful calls, which serves the comment.
> 
> Just atomically do a xchg() and simplify the whole thing. We can
> use relaxed semantics as before mentioned in addition to the
> barrier provided by wake_q_add(), delaying there is no risk in
> reordering with the actual wakeup.

> @@ -190,24 +189,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
>  	next = sem->wait_list.next;
>  	loop = woken;
>  	do {
> +		struct task_struct *tsk;
> +
>  		waiter = list_entry(next, struct rwsem_waiter, list);
>  		next = waiter->list.next;
> -		tsk = waiter->task;
> -		/*
> -		 * Make sure we do not wakeup the next reader before
> -		 * setting the nil condition to grant the next reader;
> -		 * otherwise we could miss the wakeup on the other
> -		 * side and end up sleeping again. See the pairing
> -		 * in rwsem_down_read_failed().
> -		 */
> -		smp_mb();
> -		waiter->task = NULL;
> +
> +		tsk = xchg_relaxed(&waiter->task, NULL);
>  		wake_q_add(wake_q, tsk);

Not a great fan of this patch; it again doesn't fix the race, and
smp_store_release() is a cheaper option on x86.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  2016-05-09  5:36   ` Peter Hurley
  2016-05-09  7:12     ` Peter Zijlstra
@ 2016-05-09 14:29     ` Davidlohr Bueso
  1 sibling, 0 replies; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09 14:29 UTC (permalink / raw)
  To: Peter Hurley
  Cc: mingo, peterz, tglx, Waiman.Long, jason.low2, linux-kernel,
	Davidlohr Bueso

On Sun, 08 May 2016, Peter Hurley wrote:

>On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
>> The field is obviously updated w.o the lock and needs a READ_ONCE
>> while waiting for lock holder(s) to go away, just like we do with
>> all other ->count accesses.
>
>This isn't actually fixing a bug because it's passed through
>several full barriers which will force reloading from sem->count.

Yes.

>
>I think the patch is ok if you want it just for consistency anyway,
>but please change $subject and changelog.

Yeah, I wasn't actually concerned about a specific bug, it was more
just for documentation and consistency. This code has been like this
for ever, but it would still be good to have the READ_ONCE. It is
slightly suboptimal to use, but I do not see any real impact either.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09  7:39   ` Peter Zijlstra
@ 2016-05-09 15:56     ` Davidlohr Bueso
  2016-05-09 16:11       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09 15:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel

On Mon, 09 May 2016, Peter Zijlstra wrote:

>On Sun, May 08, 2016 at 09:56:08PM -0700, Davidlohr Bueso wrote:
>> Read waiters are currently reference counted from the time it enters
>> the slowpath until the lock is released and the waiter is awoken. This
>> is fragile and superfluous considering everything occurs within down_read()
>> without returning to the caller, and the very nature of the primitive does
>> not suggest that the task can disappear from underneath us. In addition,
>> spurious wakeups can make the whole refcount useless as get_task_struct()
>> is only called when setting up the waiter.
>
>So I think you're wrong here; imagine this:
>
>
>	rwsem_down_read_failed()			rwsem_wake()
>	  get_task_struct();
>	  raw_spin_lock_irq(&wait_lock);
>	  list_add_tail(&waiter.list, &wait_list);
>	  raw_spin_unlock_irq(&wait_lock);
>							  raw_spin_lock_irqsave(&wait_lock)
>							  __rwsem_do_wake()
>	  while (true) {
>	    set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>							    waiter->task = NULL
>	    if (!waiter.task) // true
>	      break;
>
>	  __set_task_state(tsk, TASK_RUNNING);
>
>	do_exit();
>							    wake_up_process(tsk); /* BOOM */

I may be missing something, but rwsem_down_read_failed() will not return until
after the wakeup is done by the rwsem_wake() thread. So racing with do_exit() isn't
going to occur because the task is still blocked at that point. This is even more
so with delaying the wakeup. Similarly, we don't do this for writers either, which
could also suffer from similar scenarios.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09 15:56     ` Davidlohr Bueso
@ 2016-05-09 16:11       ` Peter Zijlstra
  2016-05-09 18:51         ` Davidlohr Bueso
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09 16:11 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel

On Mon, May 09, 2016 at 08:56:07AM -0700, Davidlohr Bueso wrote:
> On Mon, 09 May 2016, Peter Zijlstra wrote:
> 
> >On Sun, May 08, 2016 at 09:56:08PM -0700, Davidlohr Bueso wrote:
> >>Read waiters are currently reference counted from the time it enters
> >>the slowpath until the lock is released and the waiter is awoken. This
> >>is fragile and superfluous considering everything occurs within down_read()
> >>without returning to the caller, and the very nature of the primitive does
> >>not suggest that the task can disappear from underneath us. In addition,
> >>spurious wakeups can make the whole refcount useless as get_task_struct()
> >>is only called when setting up the waiter.
> >
> >So I think you're wrong here; imagine this:
> >
> >
> >	rwsem_down_read_failed()			rwsem_wake()
> >	  get_task_struct();
> >	  raw_spin_lock_irq(&wait_lock);
> >	  list_add_tail(&waiter.list, &wait_list);
> >	  raw_spin_unlock_irq(&wait_lock);
> >							  raw_spin_lock_irqsave(&wait_lock)
> >							  __rwsem_do_wake()
> >	  while (true) {
> >	    set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> >							    waiter->task = NULL
> >	    if (!waiter.task) // true
> >	      break;
> >
> >	  __set_task_state(tsk, TASK_RUNNING);
> >
> >	do_exit();
> >							    wake_up_process(tsk); /* BOOM */
> 
> I may be missing something, but rwsem_down_read_failed() will not return until
> after the wakeup is done by the rwsem_wake() thread. 

The above never gets to schedule(), and even if it did, a spurious
wakeup could've happened, no?

> So racing with do_exit() isn't
> going to occur because the task is still blocked at that point. This is even more
> so with delaying the wakeup. Similarly, we don't do this for writers either, which
> could also suffer from similar scenarios.

The write side is different; it serializes on wait_lock. See how it
takes wait_lock again, after blocking, and removes itself from the
wait_list.

Readers do not do this, they rely on the waker to remove them, and
therefore suffer this problem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09 16:11       ` Peter Zijlstra
@ 2016-05-09 18:51         ` Davidlohr Bueso
  2016-05-09 18:59           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Davidlohr Bueso @ 2016-05-09 18:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel

On Mon, 09 May 2016, Peter Zijlstra wrote:

>> >So I think you're wrong here; imagine this:
>> >
>> >
>> >	rwsem_down_read_failed()			rwsem_wake()
>> >	  get_task_struct();
>> >	  raw_spin_lock_irq(&wait_lock);
>> >	  list_add_tail(&waiter.list, &wait_list);
>> >	  raw_spin_unlock_irq(&wait_lock);
>> >							  raw_spin_lock_irqsave(&wait_lock)
>> >							  __rwsem_do_wake()
>> >	  while (true) {
>> >	    set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>> >							    waiter->task = NULL
>> >	    if (!waiter.task) // true
>> >	      break;
>> >
>> >	  __set_task_state(tsk, TASK_RUNNING);
>> >
>> >	do_exit();
>> >							    wake_up_process(tsk); /* BOOM */
>>
>> I may be missing something, but rwsem_down_read_failed() will not return until
>> after the wakeup is done by the rwsem_wake() thread.
>
>The above never gets to schedule(), and even if it did, a spurious
>wakeup could've happened, no?

Ah indeed, you are most certainly correct. For some reason  I was always
considering schedule() in the picture. Hmm I'll have to think about this
some more, but given the small chance of a waiter actually seeing the nil
task at the first iteration I'm wondering if we could just invert the code
and call schedule() before the task check. Saving the refcounts will serve
_all_ reader waiters otoh, but this would obviously need numbers...

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount
  2016-05-09 18:51         ` Davidlohr Bueso
@ 2016-05-09 18:59           ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-05-09 18:59 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: mingo, tglx, Waiman.Long, jason.low2, linux-kernel

On Mon, May 09, 2016 at 11:51:18AM -0700, Davidlohr Bueso wrote:
> On Mon, 09 May 2016, Peter Zijlstra wrote:
> 
> >>>So I think you're wrong here; imagine this:
> >>>
> >>>
> >>>	rwsem_down_read_failed()			rwsem_wake()
> >>>	  get_task_struct();
> >>>	  raw_spin_lock_irq(&wait_lock);
> >>>	  list_add_tail(&waiter.list, &wait_list);
> >>>	  raw_spin_unlock_irq(&wait_lock);
> >>>							  raw_spin_lock_irqsave(&wait_lock)
> >>>							  __rwsem_do_wake()
> >>>	  while (true) {
> >>>	    set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> >>>							    waiter->task = NULL
> >>>	    if (!waiter.task) // true
> >>>	      break;
> >>>
> >>>	  __set_task_state(tsk, TASK_RUNNING);
> >>>
> >>>	do_exit();
> >>>							    wake_up_process(tsk); /* BOOM */
> >>
> >>I may be missing something, but rwsem_down_read_failed() will not return until
> >>after the wakeup is done by the rwsem_wake() thread.
> >
> >The above never gets to schedule(), and even if it did, a spurious
> >wakeup could've happened, no?
> 
> Ah indeed, you are most certainly correct. For some reason  I was always
> considering schedule() in the picture. Hmm I'll have to think about this
> some more, but given the small chance of a waiter actually seeing the nil
> task at the first iteration I'm wondering if we could just invert the code
> and call schedule() before the task check. Saving the refcounts will serve
> _all_ reader waiters otoh, but this would obviously need numbers...

So with where you're going -- using wake_q, it naturally goes away if
you do:

	wake_q_add(&wake_q, tsk);
	smp_store_release(&waiter->task, NULL);

Because the wake_q already takes a task ref, and we'll not actually
issue the wakeup until after the waiter->task store.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations
  2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2016-05-09  4:56 ` [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task Davidlohr Bueso
@ 2016-05-10  1:31 ` Waiman Long
  4 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-05-10  1:31 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: mingo, peterz, tglx, jason.low2, linux-kernel

On 05/09/2016 12:56 AM, Davidlohr Bueso wrote:
> Hi,
>
> This is a follow up series while reviewing Waiman's reader-owned
> state work[1]. While I have based it on -tip instead of that change,
> I can certainly rebase the series in some future iteration.

My patch was also based on -tip, but I used 4.6-rc1 for testing purpose.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-10  1:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09  4:56 [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Davidlohr Bueso
2016-05-09  4:56 ` [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed() Davidlohr Bueso
2016-05-09  5:36   ` Peter Hurley
2016-05-09  7:12     ` Peter Zijlstra
2016-05-09 14:29     ` Davidlohr Bueso
2016-05-09  4:56 ` [PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount Davidlohr Bueso
2016-05-09  7:39   ` Peter Zijlstra
2016-05-09 15:56     ` Davidlohr Bueso
2016-05-09 16:11       ` Peter Zijlstra
2016-05-09 18:51         ` Davidlohr Bueso
2016-05-09 18:59           ` Peter Zijlstra
2016-05-09  4:56 ` [PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s) Davidlohr Bueso
2016-05-09  7:48   ` Peter Zijlstra
2016-05-09  4:56 ` [PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task Davidlohr Bueso
2016-05-09  7:50   ` Peter Zijlstra
2016-05-10  1:31 ` [PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).