linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
@ 2018-11-29 12:50 Yongji Xie
  2018-11-29 13:12 ` Peter Zijlstra
  2019-01-21 11:28 ` [tip:locking/core] locking/rwsem: Fix (possible) missed wakeup tip-bot for Xie Yongji
  0 siblings, 2 replies; 50+ messages in thread
From: Yongji Xie @ 2018-11-29 12:50 UTC (permalink / raw)
  To: peterz, mingo, will.deacon
  Cc: linux-kernel, xieyongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24

From: Xie Yongji <xieyongji@baidu.com>

Our system encountered a problem recently, the khungtaskd detected
some process hang on mmap_sem. But the odd thing was that one task which
is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
Through code inspection, we found a potential bug can lead to this.

Imaging this:

Thread 1                                  Thread 2
                                          down_write();
rwsem_down_read_failed()
 raw_spin_lock_irq(&sem->wait_lock);
 list_add_tail(&waiter.list, &wait_list);
 raw_spin_unlock_irq(&sem->wait_lock);
                                          __up_write();
                                           rwsem_wake();
                                            __rwsem_mark_wake();
                                             wake_q_add();
                                             list_del(&waiter->list);
                                             waiter->task = NULL;
 while (true) {
  set_current_state(TASK_UNINTERRUPTIBLE);
  if (!waiter.task) // true
      break;
 }
 __set_current_state(TASK_RUNNING);

Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
Thread 1 call rwsem_down_read_failed() again because Thread 3
hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
do wakeup, it will fail and miss wakeup:

Thread 1                                  Thread 2      Thread 3
                                                        down_write();
rwsem_down_read_failed()
 raw_spin_lock_irq(&sem->wait_lock);
 list_add_tail(&waiter.list, &wait_list);
 raw_spin_unlock_irq(&sem->wait_lock);
                                                        __rwsem_mark_wake();
                                                         wake_q_add();
                                          wake_up_q();
                                                         waiter->task = NULL;
 while (true) {
  set_current_state(TASK_UNINTERRUPTIBLE);
  if (!waiter.task) // false
      break;
  schedule();
 }
                                                        wake_up_q(&wake_q);

In another word, that means we might issue the wakeup before setting the reader
waiter to nil. If so, the wakeup may do nothing when it was called before reader
set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
the reader any more, and cause other writers such as "ps" command stuck on it.

This patch is not verified because we still have no way to reproduce the problem.
But I'd like to ask for some comments from community firstly.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 kernel/locking/rwsem-xadd.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 09b1800..50d9af6 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		woken++;
 		tsk = waiter->task;
 
-		wake_q_add(wake_q, tsk);
+		get_task_struct(tsk);
 		list_del(&waiter->list);
 		/*
-		 * Ensure that the last operation is setting the reader
+		 * Ensure calling get_task_struct() before setting the reader
 		 * waiter to nil such that rwsem_down_read_failed() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
 		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add(wake_q, tsk);
+		/* wake_q_add() already take the task ref */
+		put_task_struct(tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
-- 
2.2.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 12:50 [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
@ 2018-11-29 13:12 ` Peter Zijlstra
  2018-11-29 13:44   ` Peter Zijlstra
  2018-11-29 15:21   ` Waiman Long
  2019-01-21 11:28 ` [tip:locking/core] locking/rwsem: Fix (possible) missed wakeup tip-bot for Xie Yongji
  1 sibling, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 13:12 UTC (permalink / raw)
  To: Yongji Xie
  Cc: mingo, will.deacon, linux-kernel, xieyongji, zhangyu31, liuqi16,
	yuanlinsi01, nixun, lilin24, Davidlohr Bueso, Waiman Long


+Cc davidlohr and waiman

On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> Our system encountered a problem recently, the khungtaskd detected
> some process hang on mmap_sem. But the odd thing was that one task which
> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
> Through code inspection, we found a potential bug can lead to this.
> 
> Imaging this:
> 
> Thread 1                                  Thread 2
>                                           down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                           __up_write();
>                                            rwsem_wake();
>                                             __rwsem_mark_wake();
>                                              wake_q_add();
>                                              list_del(&waiter->list);
>                                              waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // true
>       break;
>  }
>  __set_current_state(TASK_RUNNING);
> 
> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
> Thread 1 call rwsem_down_read_failed() again because Thread 3
> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
> do wakeup, it will fail and miss wakeup:
> 
> Thread 1                                  Thread 2      Thread 3
>                                                         down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                                         __rwsem_mark_wake();
>                                                          wake_q_add();
>                                           wake_up_q();
>                                                          waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // false
>       break;
>   schedule();
>  }
>                                                         wake_up_q(&wake_q);
> 
> In another word, that means we might issue the wakeup before setting the reader
> waiter to nil. If so, the wakeup may do nothing when it was called before reader
> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
> the reader any more, and cause other writers such as "ps" command stuck on it.
> 
> This patch is not verified because we still have no way to reproduce the problem.
> But I'd like to ask for some comments from community firstly.

Urgh; so the case where the cmpxchg() fails because it already has a
wakeup in progress, which then 'violates' our expectation of when the
wakeup happens.

Yes, I think this is real, and worse, I think we need to go audit all
wake_q_add() users and document this behaviour.

In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
but I don't think we can easily fix that.

> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> ---
>  kernel/locking/rwsem-xadd.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 09b1800..50d9af6 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  		woken++;
>  		tsk = waiter->task;
>  
> -		wake_q_add(wake_q, tsk);
> +		get_task_struct(tsk);
>  		list_del(&waiter->list);
>  		/*
> -		 * Ensure that the last operation is setting the reader
> +		 * Ensure calling get_task_struct() before setting the reader
>  		 * waiter to nil such that rwsem_down_read_failed() cannot
>  		 * race with do_exit() by always holding a reference count
>  		 * to the task to wakeup.
>  		 */
>  		smp_store_release(&waiter->task, NULL);
> +		/*
> +		 * Ensure issuing the wakeup (either by us or someone else)
> +		 * after setting the reader waiter to nil.
> +		 */
> +		wake_q_add(wake_q, tsk);
> +		/* wake_q_add() already take the task ref */
> +		put_task_struct(tsk);
>  	}
>  
>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> -- 
> 2.2.3
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 13:12 ` Peter Zijlstra
@ 2018-11-29 13:44   ` Peter Zijlstra
  2018-11-29 14:02     ` Yongji Xie
  2018-11-29 18:43     ` Davidlohr Bueso
  2018-11-29 15:21   ` Waiman Long
  1 sibling, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 13:44 UTC (permalink / raw)
  To: Yongji Xie
  Cc: mingo, will.deacon, linux-kernel, xieyongji, zhangyu31, liuqi16,
	yuanlinsi01, nixun, lilin24, Davidlohr Bueso, Waiman Long,
	Thomas Gleixner

On Thu, Nov 29, 2018 at 02:12:32PM +0100, Peter Zijlstra wrote:
> 
> +Cc davidlohr and waiman

> Urgh; so the case where the cmpxchg() fails because it already has a
> wakeup in progress, which then 'violates' our expectation of when the
> wakeup happens.
> 
> Yes, I think this is real, and worse, I think we need to go audit all
> wake_q_add() users and document this behaviour.
> 
> In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
> but I don't think we can easily fix that.

See commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups"), I think
that introduces the exact same bug.

Something like the below perhaps, altough this pattern seems to want a
wake_a_add() variant that already assumes get_task_struct().

diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..d14971f6ed3d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1387,11 +1387,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
 		return;
 
-	/*
-	 * Queue the task for later wakeup for after we've released
-	 * the hb->lock. wake_q_add() grabs reference to p.
-	 */
-	wake_q_add(wake_q, p);
+	get_task_struct(p);
 	__unqueue_futex(q);
 	/*
 	 * The waiting task can free the futex_q as soon as q->lock_ptr = NULL
@@ -1401,6 +1397,13 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * plist_del in __unqueue_futex().
 	 */
 	smp_store_release(&q->lock_ptr, NULL);
+
+	/*
+	 * Queue the task for later wakeup for after we've released
+	 * the hb->lock. wake_q_add() grabs reference to p.
+	 */
+	wake_q_add(wake_q, p);
+	put_task_struct(p);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 13:44   ` Peter Zijlstra
@ 2018-11-29 14:02     ` Yongji Xie
  2018-11-29 18:43     ` Davidlohr Bueso
  1 sibling, 0 replies; 50+ messages in thread
From: Yongji Xie @ 2018-11-29 14:02 UTC (permalink / raw)
  To: peterz
  Cc: mingo, will.deacon, linux-kernel, xieyongji, zhangyu31, liuqi16,
	yuanlinsi01, nixun, lilin24, dave, longman, tglx

On Thu, 29 Nov 2018 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 29, 2018 at 02:12:32PM +0100, Peter Zijlstra wrote:
> >
> > +Cc davidlohr and waiman
>
> > Urgh; so the case where the cmpxchg() fails because it already has a
> > wakeup in progress, which then 'violates' our expectation of when the
> > wakeup happens.
> >
> > Yes, I think this is real, and worse, I think we need to go audit all
> > wake_q_add() users and document this behaviour.
> >
> > In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
> > but I don't think we can easily fix that.
>
> See commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups"), I think
> that introduces the exact same bug.
>

Hmm...Yes, even the thread may be in futex's wake_q and lead to rwsem's wakeup
missing.

Seems like fix this problem casy by case and document the behaviour is easier
than delay the actual wakeup to the last wake_up_q()...

Thanks,
Yongji

> Something like the below perhaps, altough this pattern seems to want a
> wake_a_add() variant that already assumes get_task_struct().
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index f423f9b6577e..d14971f6ed3d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1387,11 +1387,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
>         if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
>                 return;
>
> -       /*
> -        * Queue the task for later wakeup for after we've released
> -        * the hb->lock. wake_q_add() grabs reference to p.
> -        */
> -       wake_q_add(wake_q, p);
> +       get_task_struct(p);
>         __unqueue_futex(q);
>         /*
>          * The waiting task can free the futex_q as soon as q->lock_ptr = NULL
> @@ -1401,6 +1397,13 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
>          * plist_del in __unqueue_futex().
>          */
>         smp_store_release(&q->lock_ptr, NULL);
> +
> +       /*
> +        * Queue the task for later wakeup for after we've released
> +        * the hb->lock. wake_q_add() grabs reference to p.
> +        */
> +       wake_q_add(wake_q, p);
> +       put_task_struct(p);
>  }
>
>  /*

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 13:12 ` Peter Zijlstra
  2018-11-29 13:44   ` Peter Zijlstra
@ 2018-11-29 15:21   ` Waiman Long
  2018-11-29 15:29     ` Waiman Long
  2018-11-29 16:06     ` Peter Zijlstra
  1 sibling, 2 replies; 50+ messages in thread
From: Waiman Long @ 2018-11-29 15:21 UTC (permalink / raw)
  To: Peter Zijlstra, Yongji Xie
  Cc: mingo, will.deacon, linux-kernel, xieyongji, zhangyu31, liuqi16,
	yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 08:12 AM, Peter Zijlstra wrote:
> +Cc davidlohr and waiman
>
> On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
>> From: Xie Yongji <xieyongji@baidu.com>
>>
>> Our system encountered a problem recently, the khungtaskd detected
>> some process hang on mmap_sem. But the odd thing was that one task which
>> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
>> Through code inspection, we found a potential bug can lead to this.
>>
>> Imaging this:
>>
>> Thread 1                                  Thread 2
>>                                           down_write();
>> rwsem_down_read_failed()
>>  raw_spin_lock_irq(&sem->wait_lock);
>>  list_add_tail(&waiter.list, &wait_list);
>>  raw_spin_unlock_irq(&sem->wait_lock);
>>                                           __up_write();
>>                                            rwsem_wake();
>>                                             __rwsem_mark_wake();
>>                                              wake_q_add();
>>                                              list_del(&waiter->list);
>>                                              waiter->task = NULL;
>>  while (true) {
>>   set_current_state(TASK_UNINTERRUPTIBLE);
>>   if (!waiter.task) // true
>>       break;
>>  }
>>  __set_current_state(TASK_RUNNING);
>>
>> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
>> Thread 1 call rwsem_down_read_failed() again because Thread 3
>> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
>> do wakeup, it will fail and miss wakeup:
>>
>> Thread 1                                  Thread 2      Thread 3
>>                                                         down_write();
>> rwsem_down_read_failed()
>>  raw_spin_lock_irq(&sem->wait_lock);
>>  list_add_tail(&waiter.list, &wait_list);
>>  raw_spin_unlock_irq(&sem->wait_lock);
>>                                                         __rwsem_mark_wake();
>>                                                          wake_q_add();
>>                                           wake_up_q();
>>                                                          waiter->task = NULL;
>>  while (true) {
>>   set_current_state(TASK_UNINTERRUPTIBLE);
>>   if (!waiter.task) // false
>>       break;
>>   schedule();
>>  }
>>                                                         wake_up_q(&wake_q);
>>
>> In another word, that means we might issue the wakeup before setting the reader
>> waiter to nil. If so, the wakeup may do nothing when it was called before reader
>> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
>> the reader any more, and cause other writers such as "ps" command stuck on it.
>>
>> This patch is not verified because we still have no way to reproduce the problem.
>> But I'd like to ask for some comments from community firstly.
> Urgh; so the case where the cmpxchg() fails because it already has a
> wakeup in progress, which then 'violates' our expectation of when the
> wakeup happens.
>
> Yes, I think this is real, and worse, I think we need to go audit all
> wake_q_add() users and document this behaviour.

Yes, I also think this is a valid race scenario that can cause missed
wakeup. Actually, I had bug reports of similar symptom of sleeping
reader not in a wait queue.  I was puzzled by how that could happen.
That clearly is one possible cause of that.


> In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
> but I don't think we can easily fix that.
>
>> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
>> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
>> ---
>>  kernel/locking/rwsem-xadd.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>> index 09b1800..50d9af6 100644
>> --- a/kernel/locking/rwsem-xadd.c
>> +++ b/kernel/locking/rwsem-xadd.c
>> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  		woken++;
>>  		tsk = waiter->task;
>>  
>> -		wake_q_add(wake_q, tsk);
>> +		get_task_struct(tsk);
>>  		list_del(&waiter->list);
>>  		/*
>> -		 * Ensure that the last operation is setting the reader
>> +		 * Ensure calling get_task_struct() before setting the reader
>>  		 * waiter to nil such that rwsem_down_read_failed() cannot
>>  		 * race with do_exit() by always holding a reference count
>>  		 * to the task to wakeup.
>>  		 */
>>  		smp_store_release(&waiter->task, NULL);
>> +		/*
>> +		 * Ensure issuing the wakeup (either by us or someone else)
>> +		 * after setting the reader waiter to nil.
>> +		 */
>> +		wake_q_add(wake_q, tsk);
>> +		/* wake_q_add() already take the task ref */
>> +		put_task_struct(tsk);
>>  	}
>>  
>>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;

I doubt putting wake_q_add() after clearing waiter->task can really fix
the problem. The wake_up_q() function happens asynchronous to the
detection of NULL waiter->task in __rwsem_down_read_failed_common(). I
believe the same scenario may still happen.

One possible solution that I can think of is as follows:

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 10b19a1..1513cdc 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -47,6 +47,14 @@ static inline void wake_q_init(struct wake_q_head *head)
        head->lastp = &head->first;
 }
 
+/*
+ * Return true if the current task is on a wake_q.
+ */
+static inline bool wake_q_pending(void)
+{
+       return !!current->wake_q.next;
+}
+
 extern void wake_q_add(struct wake_q_head *head,
                       struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3dbe593..b656777 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -269,7 +269,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
        /* wait to be given the lock */
        while (true) {
                set_current_state(state);
-               if (!waiter.task)
+               if (!smp_load_acquire(&waiter.task))
                        break;
                if (signal_pending_state(state, current)) {
                        raw_spin_lock_irq(&sem->wait_lock);
@@ -282,6 +282,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
        }
 
        __set_current_state(TASK_RUNNING);
+
+       /*
+        * If waiter is still queuing in a wake_q somewhere, we have to wait
+        * until the wake_up_q() process is complete as the memory of the
+        * waiter structure will no longer be valid when we return.
+        */
+       while (wake_q_pending())
+               cpu_relax();
+
        return sem;
 out_nolock:
        list_del(&waiter.list);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 15:21   ` Waiman Long
@ 2018-11-29 15:29     ` Waiman Long
  2018-11-29 16:06     ` Peter Zijlstra
  1 sibling, 0 replies; 50+ messages in thread
From: Waiman Long @ 2018-11-29 15:29 UTC (permalink / raw)
  To: Peter Zijlstra, Yongji Xie
  Cc: mingo, will.deacon, linux-kernel, xieyongji, zhangyu31, liuqi16,
	yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 10:21 AM, Waiman Long wrote:
> On 11/29/2018 08:12 AM, Peter Zijlstra wrote:
>> +Cc davidlohr and waiman
>>
>> On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
>>> From: Xie Yongji <xieyongji@baidu.com>
>>>
>>> Our system encountered a problem recently, the khungtaskd detected
>>> some process hang on mmap_sem. But the odd thing was that one task which
>>> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
>>> Through code inspection, we found a potential bug can lead to this.
>>>
>>> Imaging this:
>>>
>>> Thread 1                                  Thread 2
>>>                                           down_write();
>>> rwsem_down_read_failed()
>>>  raw_spin_lock_irq(&sem->wait_lock);
>>>  list_add_tail(&waiter.list, &wait_list);
>>>  raw_spin_unlock_irq(&sem->wait_lock);
>>>                                           __up_write();
>>>                                            rwsem_wake();
>>>                                             __rwsem_mark_wake();
>>>                                              wake_q_add();
>>>                                              list_del(&waiter->list);
>>>                                              waiter->task = NULL;
>>>  while (true) {
>>>   set_current_state(TASK_UNINTERRUPTIBLE);
>>>   if (!waiter.task) // true
>>>       break;
>>>  }
>>>  __set_current_state(TASK_RUNNING);
>>>
>>> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
>>> Thread 1 call rwsem_down_read_failed() again because Thread 3
>>> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
>>> do wakeup, it will fail and miss wakeup:
>>>
>>> Thread 1                                  Thread 2      Thread 3
>>>                                                         down_write();
>>> rwsem_down_read_failed()
>>>  raw_spin_lock_irq(&sem->wait_lock);
>>>  list_add_tail(&waiter.list, &wait_list);
>>>  raw_spin_unlock_irq(&sem->wait_lock);
>>>                                                         __rwsem_mark_wake();
>>>                                                          wake_q_add();
>>>                                           wake_up_q();
>>>                                                          waiter->task = NULL;
>>>  while (true) {
>>>   set_current_state(TASK_UNINTERRUPTIBLE);
>>>   if (!waiter.task) // false
>>>       break;
>>>   schedule();
>>>  }
>>>                                                         wake_up_q(&wake_q);
>>>
>>> In another word, that means we might issue the wakeup before setting the reader
>>> waiter to nil. If so, the wakeup may do nothing when it was called before reader
>>> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
>>> the reader any more, and cause other writers such as "ps" command stuck on it.
>>>
>>> This patch is not verified because we still have no way to reproduce the problem.
>>> But I'd like to ask for some comments from community firstly.
>> Urgh; so the case where the cmpxchg() fails because it already has a
>> wakeup in progress, which then 'violates' our expectation of when the
>> wakeup happens.
>>
>> Yes, I think this is real, and worse, I think we need to go audit all
>> wake_q_add() users and document this behaviour.
> Yes, I also think this is a valid race scenario that can cause missed
> wakeup. Actually, I had bug reports of similar symptom of sleeping
> reader not in a wait queue.  I was puzzled by how that could happen.
> That clearly is one possible cause of that.
>
>
>> In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
>> but I don't think we can easily fix that.
>>
>>> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
>>> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
>>> ---
>>>  kernel/locking/rwsem-xadd.c | 11 +++++++++--
>>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>>> index 09b1800..50d9af6 100644
>>> --- a/kernel/locking/rwsem-xadd.c
>>> +++ b/kernel/locking/rwsem-xadd.c
>>> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>>  		woken++;
>>>  		tsk = waiter->task;
>>>  
>>> -		wake_q_add(wake_q, tsk);
>>> +		get_task_struct(tsk);
>>>  		list_del(&waiter->list);
>>>  		/*
>>> -		 * Ensure that the last operation is setting the reader
>>> +		 * Ensure calling get_task_struct() before setting the reader
>>>  		 * waiter to nil such that rwsem_down_read_failed() cannot
>>>  		 * race with do_exit() by always holding a reference count
>>>  		 * to the task to wakeup.
>>>  		 */
>>>  		smp_store_release(&waiter->task, NULL);
>>> +		/*
>>> +		 * Ensure issuing the wakeup (either by us or someone else)
>>> +		 * after setting the reader waiter to nil.
>>> +		 */
>>> +		wake_q_add(wake_q, tsk);
>>> +		/* wake_q_add() already take the task ref */
>>> +		put_task_struct(tsk);
>>>  	}
>>>  
>>>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> I doubt putting wake_q_add() after clearing waiter->task can really fix
> the problem. The wake_up_q() function happens asynchronous to the
> detection of NULL waiter->task in __rwsem_down_read_failed_common(). I
> believe the same scenario may still happen.
>
> One possible solution that I can think of is as follows:
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index 10b19a1..1513cdc 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -47,6 +47,14 @@ static inline void wake_q_init(struct wake_q_head *head)
>         head->lastp = &head->first;
>  }
>  
> +/*
> + * Return true if the current task is on a wake_q.
> + */
> +static inline bool wake_q_pending(void)
> +{
> +       return !!current->wake_q.next;
> +}
> +
>  extern void wake_q_add(struct wake_q_head *head,
>                        struct task_struct *task);
>  extern void wake_up_q(struct wake_q_head *head);
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 3dbe593..b656777 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -269,7 +269,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>         /* wait to be given the lock */
>         while (true) {
>                 set_current_state(state);
> -               if (!waiter.task)
> +               if (!smp_load_acquire(&waiter.task))
>                         break;
>                 if (signal_pending_state(state, current)) {
>                         raw_spin_lock_irq(&sem->wait_lock);
> @@ -282,6 +282,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>         }
>  
>         __set_current_state(TASK_RUNNING);
> +
> +       /*
> +        * If waiter is still queuing in a wake_q somewhere, we have to wait
> +        * until the wake_up_q() process is complete as the memory of the
> +        * waiter structure will no longer be valid when we return.
> +        */

Sorry, the comment is wrong. I should say something like
/*
 * If we are still queuing in a wake_q somewhere, we have to wait until
the wake_up_q() function is complete to prevent against concurrent
wake_q operation.
 */
> +       while (wake_q_pending())
> +               cpu_relax();
> +
>         return sem;
>  out_nolock:
>         list_del(&waiter.list);
>
>
Cheers,
Longman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 15:21   ` Waiman Long
  2018-11-29 15:29     ` Waiman Long
@ 2018-11-29 16:06     ` Peter Zijlstra
  2018-11-29 17:02       ` Waiman Long
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 16:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 10:21:58AM -0500, Waiman Long wrote:

> >> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> >> index 09b1800..50d9af6 100644
> >> --- a/kernel/locking/rwsem-xadd.c
> >> +++ b/kernel/locking/rwsem-xadd.c
> >> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> >>  		woken++;
> >>  		tsk = waiter->task;
> >>  
> >> -		wake_q_add(wake_q, tsk);
> >> +		get_task_struct(tsk);
> >>  		list_del(&waiter->list);
> >>  		/*
> >> -		 * Ensure that the last operation is setting the reader
> >> +		 * Ensure calling get_task_struct() before setting the reader
> >>  		 * waiter to nil such that rwsem_down_read_failed() cannot
> >>  		 * race with do_exit() by always holding a reference count
> >>  		 * to the task to wakeup.
> >>  		 */
> >>  		smp_store_release(&waiter->task, NULL);
> >> +		/*
> >> +		 * Ensure issuing the wakeup (either by us or someone else)
> >> +		 * after setting the reader waiter to nil.
> >> +		 */
> >> +		wake_q_add(wake_q, tsk);
> >> +		/* wake_q_add() already take the task ref */
> >> +		put_task_struct(tsk);
> >>  	}
> >>  
> >>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> 
> I doubt putting wake_q_add() after clearing waiter->task can really fix

Why; at that point we know the wakeup will happen after, which is all we
require.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 16:06     ` Peter Zijlstra
@ 2018-11-29 17:02       ` Waiman Long
  2018-11-29 17:27         ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Waiman Long @ 2018-11-29 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 11:06 AM, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 10:21:58AM -0500, Waiman Long wrote:
>
>>>> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>>>> index 09b1800..50d9af6 100644
>>>> --- a/kernel/locking/rwsem-xadd.c
>>>> +++ b/kernel/locking/rwsem-xadd.c
>>>> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>>>  		woken++;
>>>>  		tsk = waiter->task;
>>>>  
>>>> -		wake_q_add(wake_q, tsk);
>>>> +		get_task_struct(tsk);
>>>>  		list_del(&waiter->list);
>>>>  		/*
>>>> -		 * Ensure that the last operation is setting the reader
>>>> +		 * Ensure calling get_task_struct() before setting the reader
>>>>  		 * waiter to nil such that rwsem_down_read_failed() cannot
>>>>  		 * race with do_exit() by always holding a reference count
>>>>  		 * to the task to wakeup.
>>>>  		 */
>>>>  		smp_store_release(&waiter->task, NULL);
>>>> +		/*
>>>> +		 * Ensure issuing the wakeup (either by us or someone else)
>>>> +		 * after setting the reader waiter to nil.
>>>> +		 */
>>>> +		wake_q_add(wake_q, tsk);
>>>> +		/* wake_q_add() already take the task ref */
>>>> +		put_task_struct(tsk);
>>>>  	}
>>>>  
>>>>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
>> I doubt putting wake_q_add() after clearing waiter->task can really fix
> Why; at that point we know the wakeup will happen after, which is all we
> require.
>
>

Thread 1                                  Thread 2      Thread 3

    rwsem_down_read_failed()
 raw_spin_lock_irq(&sem->wait_lock);
 list_add_tail(&waiter.list, &wait_list);
 raw_spin_unlock_irq(&sem->wait_lock);
                                                        __rwsem_mark_wake();
                                                         wake_q_add();
                                          wake_up_q();
                                                         waiter->task =
NULL; --+
 while (true)
{                                                                 |
 
set_current_state(TASK_UNINTERRUPTIBLE);                                     
|
  if (!waiter.task) //
false                                                    |
     
break;                                                                    |
 
schedule();                                                                  
|
 }                                                                       
<-----+
                                                        wake_up_q(&wake_q);

OK, I got confused by the thread racing chart shown in the patch. It
will be clearer if the clearing of waiter->task is moved down as shown.
Otherwise, moving the clearing of waiter->task before wake_q_add() won't
make a difference. So the patch can be a possible fix.

Still we are talking about 3 threads racing with each other. The
clearing of wake_q.next in wake_up_q() is not atomic and it is hard to
predict the racing result of the concurrent wake_q operations between
threads 2 and 3. The essence of my tentative patch is to prevent the
concurrent wake_q operations in the first place.

Cheers,
Longman



The second wake_q_add() above will fail to add the task to the second
wake_q because it is still in the first wake_q. So the second
wake_up_q() will not wake up the task because it is not in its wake_q.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 17:02       ` Waiman Long
@ 2018-11-29 17:27         ` Peter Zijlstra
  2018-11-29 17:58           ` Waiman Long
  2018-11-29 18:08           ` Peter Zijlstra
  0 siblings, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 17:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 12:02:19PM -0500, Waiman Long wrote:
> On 11/29/2018 11:06 AM, Peter Zijlstra wrote:

> > Why; at that point we know the wakeup will happen after, which is all we
> > require.
> >

> Thread 1                                  Thread 2      Thread 3
> 
>     rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                                         __rwsem_mark_wake();
>                                                          wake_q_add();
>                                           wake_up_q();
>                                                          waiter->task =
> NULL; --+
>  while (true)
> {                                                                 |
>  
> set_current_state(TASK_UNINTERRUPTIBLE);                                     
> |
>   if (!waiter.task) //
> false                                                    |
>      
> break;                                                                    |
>  
> schedule();                                                                  
> |
>  }                                                                       
> <-----+
>                                                         wake_up_q(&wake_q);

I think that thing is horribly whitespace damanaged. At least, it's not
making sense to me.

> OK, I got confused by the thread racing chart shown in the patch. It
> will be clearer if the clearing of waiter->task is moved down as shown.
> Otherwise, moving the clearing of waiter->task before wake_q_add() won't
> make a difference. So the patch can be a possible fix.
> 
> Still we are talking about 3 threads racing with each other. The
> clearing of wake_q.next in wake_up_q() is not atomic and it is hard to
> predict the racing result of the concurrent wake_q operations between
> threads 2 and 3. The essence of my tentative patch is to prevent the
> concurrent wake_q operations in the first place.

wake_up_q() should, per the barriers in wake_up_process, ensure that if
wake_a_add() fails, there will be a wakeup of that task after that
point.

So if we put wake_up_q() at the location where wake_up_process() should
be, it should all work.

The bug in question is that it can happen at any time after
wake_q_add(), not necessarily at wake_up_q().

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 17:27         ` Peter Zijlstra
@ 2018-11-29 17:58           ` Waiman Long
  2018-11-29 18:13             ` Peter Zijlstra
  2018-11-29 18:08           ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Waiman Long @ 2018-11-29 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 12:27 PM, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 12:02:19PM -0500, Waiman Long wrote:
>> On 11/29/2018 11:06 AM, Peter Zijlstra wrote:
>>> Why; at that point we know the wakeup will happen after, which is all we
>>> require.
>>>
>> Thread 1                                  Thread 2      Thread 3
>>
>>     rwsem_down_read_failed()
>>  raw_spin_lock_irq(&sem->wait_lock);
>>  list_add_tail(&waiter.list, &wait_list);
>>  raw_spin_unlock_irq(&sem->wait_lock);
>>                                                         __rwsem_mark_wake();
>>                                                          wake_q_add();
>>                                           wake_up_q();
>>                                                          waiter->task =
>> NULL; --+
>>  while (true)
>> {                                                                 |
>>  
>> set_current_state(TASK_UNINTERRUPTIBLE);                                     
>> |
>>   if (!waiter.task) //
>> false                                                    |
>>      
>> break;                                                                    |
>>  
>> schedule();                                                                  
>> |
>>  }                                                                       
>> <-----+
>>                                                         wake_up_q(&wake_q);
> I think that thing is horribly whitespace damanaged. At least, it's not
> making sense to me.

I am sorry. It was line-wrapped by my email client. The chart was

Thread 1                       Thread 2      Thread 3
                                             down_write();
rwsem_down_read_failed()
 raw_spin_lock_irq(&sem->wait_lock);
 list_add_tail(&waiter.list, &wait_list);
 raw_spin_unlock_irq(&sem->wait_lock);
                                             __rwsem_mark_wake();
                                              wake_q_add();
                               wake_up_q();
                                              waiter->task = NULL;-+
 while (true) {                                                    |
  set_current_state(TASK_UNINTERRUPTIBLE);                         |
  if (!waiter.task) // false                                       |
      break;                                                       |
  schedule();                                                      |
 }                                                               <-+
                                             wake_up_q(&wake_q);

>> OK, I got confused by the thread racing chart shown in the patch. It
>> will be clearer if the clearing of waiter->task is moved down as shown.
>> Otherwise, moving the clearing of waiter->task before wake_q_add() won't
>> make a difference. So the patch can be a possible fix.
>>
>> Still we are talking about 3 threads racing with each other. The
>> clearing of wake_q.next in wake_up_q() is not atomic and it is hard to
>> predict the racing result of the concurrent wake_q operations between
>> threads 2 and 3. The essence of my tentative patch is to prevent the
>> concurrent wake_q operations in the first place.
> wake_up_q() should, per the barriers in wake_up_process, ensure that if
> wake_a_add() fails, there will be a wakeup of that task after that
> point.
>
> So if we put wake_up_q() at the location where wake_up_process() should
> be, it should all work.
>
> The bug in question is that it can happen at any time after
> wake_q_add(), not necessarily at wake_up_q().

OK, you convinced me. However, that can still lead to anonymous wakeups
that can be problematic if it happens in certain places. Should we try
to reduce anonymous wakeup as much as possible?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 17:27         ` Peter Zijlstra
  2018-11-29 17:58           ` Waiman Long
@ 2018-11-29 18:08           ` Peter Zijlstra
  2018-11-29 18:26             ` Waiman Long
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 18:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 06:27:00PM +0100, Peter Zijlstra wrote:
> 
> wake_up_q() should, per the barriers in wake_up_process, ensure that if
> wake_a_add() fails, there will be a wakeup of that task after that
> point.
> 
> So if we put wake_up_q() at the location where wake_up_process() should
> be, it should all work.
> 
> The bug in question is that it can happen at any time after
> wake_q_add(), not necessarily at wake_up_q().

Hmm, I think we're missing a barrier in wake_q_add(); when cmpxchg()
fails we still need an smp_mb().

Something like so.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d87a28da378..69def558edf6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -400,6 +400,13 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
+	/*
+	 * Ensure, that when the below cmpxchg() fails, the corresponding
+	 * wake_up_q() will observe our prior state.
+	 *
+	 * Pairs with the smp_mb() from wake_up_q()'s wake_up_process().
+	 */
+	smp_mb();
 	/*
 	 * Atomically grab the task, if ->wake_q is !nil already it means
 	 * its already queued (either by us or someone else) and will get the
@@ -408,7 +415,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 * This cmpxchg() executes a full barrier, which pairs with the full
 	 * barrier executed by the wakeup in wake_up_q().
 	 */
-	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+	if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
 		return;
 
 	get_task_struct(task);

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 17:58           ` Waiman Long
@ 2018-11-29 18:13             ` Peter Zijlstra
  2018-11-29 18:17               ` Davidlohr Bueso
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 18:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 12:58:26PM -0500, Waiman Long wrote:
> OK, you convinced me. However, that can still lead to anonymous wakeups
> that can be problematic if it happens in certain places. Should we try
> to reduce anonymous wakeup as much as possible?

No, we should at all times accept and expect spurious wakeups.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:13             ` Peter Zijlstra
@ 2018-11-29 18:17               ` Davidlohr Bueso
  0 siblings, 0 replies; 50+ messages in thread
From: Davidlohr Bueso @ 2018-11-29 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Yongji Xie, mingo, will.deacon, linux-kernel,
	xieyongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24

On Thu, 29 Nov 2018, Peter Zijlstra wrote:

>On Thu, Nov 29, 2018 at 12:58:26PM -0500, Waiman Long wrote:
>> OK, you convinced me. However, that can still lead to anonymous wakeups
>> that can be problematic if it happens in certain places. Should we try
>> to reduce anonymous wakeup as much as possible?
>
>No, we should at all times accept and expect spurious wakeups.

Right, when this was merged, spurious wakeups were acknowledged as quite
possible.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:08           ` Peter Zijlstra
@ 2018-11-29 18:26             ` Waiman Long
  2018-11-29 18:31               ` Will Deacon
  2018-11-29 21:30               ` Davidlohr Bueso
  0 siblings, 2 replies; 50+ messages in thread
From: Waiman Long @ 2018-11-29 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 01:08 PM, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 06:27:00PM +0100, Peter Zijlstra wrote:
>> wake_up_q() should, per the barriers in wake_up_process, ensure that if
>> wake_a_add() fails, there will be a wakeup of that task after that
>> point.
>>
>> So if we put wake_up_q() at the location where wake_up_process() should
>> be, it should all work.
>>
>> The bug in question is that it can happen at any time after
>> wake_q_add(), not necessarily at wake_up_q().
> Hmm, I think we're missing a barrier in wake_q_add(); when cmpxchg()
> fails we still need an smp_mb().
>
> Something like so.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3d87a28da378..69def558edf6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -400,6 +400,13 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  {
>  	struct wake_q_node *node = &task->wake_q;
>  
> +	/*
> +	 * Ensure, that when the below cmpxchg() fails, the corresponding
> +	 * wake_up_q() will observe our prior state.
> +	 *
> +	 * Pairs with the smp_mb() from wake_up_q()'s wake_up_process().
> +	 */
> +	smp_mb();
>  	/*
>  	 * Atomically grab the task, if ->wake_q is !nil already it means
>  	 * its already queued (either by us or someone else) and will get the
> @@ -408,7 +415,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	 * This cmpxchg() executes a full barrier, which pairs with the full
>  	 * barrier executed by the wakeup in wake_up_q().
>  	 */
> -	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> +	if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
>  		return;
>  
>  	get_task_struct(task);

That can be costly for x86 which will now have 2 locked instructions.
Should we introduce a kind of special cmpxchg (e.g. cmpxchg_mb) that
will guarantee a memory barrier whether the operation fails or not?

Cheers,
Longman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:26             ` Waiman Long
@ 2018-11-29 18:31               ` Will Deacon
  2018-11-29 18:34                 ` Waiman Long
  2018-11-29 21:30               ` Davidlohr Bueso
  1 sibling, 1 reply; 50+ messages in thread
From: Will Deacon @ 2018-11-29 18:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Yongji Xie, mingo, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 01:26:34PM -0500, Waiman Long wrote:
> On 11/29/2018 01:08 PM, Peter Zijlstra wrote:
> > Hmm, I think we're missing a barrier in wake_q_add(); when cmpxchg()
> > fails we still need an smp_mb().
> >
> > Something like so.
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3d87a28da378..69def558edf6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -400,6 +400,13 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> >  {
> >  	struct wake_q_node *node = &task->wake_q;
> >  
> > +	/*
> > +	 * Ensure, that when the below cmpxchg() fails, the corresponding
> > +	 * wake_up_q() will observe our prior state.
> > +	 *
> > +	 * Pairs with the smp_mb() from wake_up_q()'s wake_up_process().
> > +	 */
> > +	smp_mb();
> >  	/*
> >  	 * Atomically grab the task, if ->wake_q is !nil already it means
> >  	 * its already queued (either by us or someone else) and will get the
> > @@ -408,7 +415,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> >  	 * This cmpxchg() executes a full barrier, which pairs with the full
> >  	 * barrier executed by the wakeup in wake_up_q().
> >  	 */
> > -	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> > +	if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
> >  		return;
> >  
> >  	get_task_struct(task);
> 
> That can be costly for x86 which will now have 2 locked instructions.
> Should we introduce a kind of special cmpxchg (e.g. cmpxchg_mb) that
> will guarantee a memory barrier whether the operation fails or not?

I thought smp_mb__before_atomic() was designed for this sort of thing?

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:31               ` Will Deacon
@ 2018-11-29 18:34                 ` Waiman Long
  2018-11-29 22:05                   ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Waiman Long @ 2018-11-29 18:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Yongji Xie, mingo, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On 11/29/2018 01:31 PM, Will Deacon wrote:
> On Thu, Nov 29, 2018 at 01:26:34PM -0500, Waiman Long wrote:
>> On 11/29/2018 01:08 PM, Peter Zijlstra wrote:
>>> Hmm, I think we're missing a barrier in wake_q_add(); when cmpxchg()
>>> fails we still need an smp_mb().
>>>
>>> Something like so.
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 3d87a28da378..69def558edf6 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -400,6 +400,13 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>>  {
>>>  	struct wake_q_node *node = &task->wake_q;
>>>  
>>> +	/*
>>> +	 * Ensure, that when the below cmpxchg() fails, the corresponding
>>> +	 * wake_up_q() will observe our prior state.
>>> +	 *
>>> +	 * Pairs with the smp_mb() from wake_up_q()'s wake_up_process().
>>> +	 */
>>> +	smp_mb();
>>>  	/*
>>>  	 * Atomically grab the task, if ->wake_q is !nil already it means
>>>  	 * its already queued (either by us or someone else) and will get the
>>> @@ -408,7 +415,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>>  	 * This cmpxchg() executes a full barrier, which pairs with the full
>>>  	 * barrier executed by the wakeup in wake_up_q().
>>>  	 */
>>> -	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>>> +	if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
>>>  		return;
>>>  
>>>  	get_task_struct(task);
>> That can be costly for x86 which will now have 2 locked instructions.
>> Should we introduce a kind of special cmpxchg (e.g. cmpxchg_mb) that
>> will guarantee a memory barrier whether the operation fails or not?
> I thought smp_mb__before_atomic() was designed for this sort of thing?
>
> Will

That will certainly work for x86. However, I am not sure if that will be
optimal for arm and powerpc.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 13:44   ` Peter Zijlstra
  2018-11-29 14:02     ` Yongji Xie
@ 2018-11-29 18:43     ` Davidlohr Bueso
  2018-11-29 18:49       ` Waiman Long
  1 sibling, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-11-29 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Waiman Long,
	Thomas Gleixner

On Thu, 29 Nov 2018, Peter Zijlstra wrote:

>On Thu, Nov 29, 2018 at 02:12:32PM +0100, Peter Zijlstra wrote:
>> Yes, I think this is real, and worse, I think we need to go audit all
>> wake_q_add() users and document this behaviour.
>>
>> In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
>> but I don't think we can easily fix that.
>
>See commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups"), I think
>that introduces the exact same bug.
>
>Something like the below perhaps, altough this pattern seems to want a
>wake_a_add() variant that already assumes get_task_struct().

So I was looking at ways to avoid the redundant reference counting,
but given how wake_q_add() and wake_up_q() are so loose I can't
see how to avoid it -- we hold reference across the calls to maintain
valid mem.

For example, wake_q will grab reference iff the cmpxchg succeeds,
likewise it will enter the wakeup loop in wake_up_q(), and there is no
awareness of which caller had the failed cmpxchg because another wakup
was in progress.

And yes, afaict all wake_q users suffer from the same issue, so we have
to move the wake_q_add() after the condition, while explicitly doing
the task ref counting.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:43     ` Davidlohr Bueso
@ 2018-11-29 18:49       ` Waiman Long
  0 siblings, 0 replies; 50+ messages in thread
From: Waiman Long @ 2018-11-29 18:49 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Thomas Gleixner

On 11/29/2018 01:43 PM, Davidlohr Bueso wrote:
> On Thu, 29 Nov 2018, Peter Zijlstra wrote:
>
>> On Thu, Nov 29, 2018 at 02:12:32PM +0100, Peter Zijlstra wrote:
>>> Yes, I think this is real, and worse, I think we need to go audit all
>>> wake_q_add() users and document this behaviour.
>>>
>>> In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
>>> but I don't think we can easily fix that.
>>
>> See commit: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups"), I think
>> that introduces the exact same bug.
>>
>> Something like the below perhaps, altough this pattern seems to want a
>> wake_a_add() variant that already assumes get_task_struct().
>
> So I was looking at ways to avoid the redundant reference counting,
> but given how wake_q_add() and wake_up_q() are so loose I can't
> see how to avoid it -- we hold reference across the calls to maintain
> valid mem.
>
> For example, wake_q will grab reference iff the cmpxchg succeeds,
> likewise it will enter the wakeup loop in wake_up_q(), and there is no
> awareness of which caller had the failed cmpxchg because another wakup
> was in progress.
>

I think it will be useful to make wake_q_add() to return the status of
wake_q insertion so that we can respond appropriately if it fails.

> And yes, afaict all wake_q users suffer from the same issue, so we have
> to move the wake_q_add() after the condition, while explicitly doing
> the task ref counting.

How about adding two helpers - wake_q_enter(), wake_q_exit() that can
hide all the reference counting?

Thanks,
Longman

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:26             ` Waiman Long
  2018-11-29 18:31               ` Will Deacon
@ 2018-11-29 21:30               ` Davidlohr Bueso
  2018-11-29 21:34                 ` Davidlohr Bueso
  1 sibling, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-11-29 21:30 UTC (permalink / raw)
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	xieyongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24

On Thu, 29 Nov 2018, Waiman Long wrote:

>That can be costly for x86 which will now have 2 locked instructions.

Yeah, and when used as an actual queue we should really start to notice.
Some users just have a single task in the wake_q because avoiding the cost
of wake_up_process() with locks held is significant.

How about instead of adding the barrier before the cmpxchg, we do it
in the failed branch, right before we return. This is the uncommon
path.

Thanks,
Davidlohr

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 091e089063be..0d844a18a9dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -408,8 +408,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 * This cmpxchg() executes a full barrier, which pairs with the full
 	 * barrier executed by the wakeup in wake_up_q().
 	 */
-	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) {
+		/*
+		 * Ensure, that when the cmpxchg() fails, the corresponding
+		 * wake_up_q() will observe our prior state.
+		 */
+		smp_mb__after_atomic();
 		return;
+	}
 
 	get_task_struct(task);
 

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 21:30               ` Davidlohr Bueso
@ 2018-11-29 21:34                 ` Davidlohr Bueso
  2018-11-29 22:17                   ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-11-29 21:34 UTC (permalink / raw)
  To: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	xieyongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	longman

I messed up something such that waiman was not in the thread. Ccing.

>On Thu, 29 Nov 2018, Waiman Long wrote:
>
>>That can be costly for x86 which will now have 2 locked instructions.
>
>Yeah, and when used as an actual queue we should really start to notice.
>Some users just have a single task in the wake_q because avoiding the cost
>of wake_up_process() with locks held is significant.
>
>How about instead of adding the barrier before the cmpxchg, we do it
>in the failed branch, right before we return. This is the uncommon
>path.
>
>Thanks,
>Davidlohr
>
>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>index 091e089063be..0d844a18a9dc 100644
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -408,8 +408,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>	 * This cmpxchg() executes a full barrier, which pairs with the full
>	 * barrier executed by the wakeup in wake_up_q().
>	 */
>-	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>+	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) {
>+		/*
>+		 * Ensure, that when the cmpxchg() fails, the corresponding
>+		 * wake_up_q() will observe our prior state.
>+		 */
>+		smp_mb__after_atomic();
>		return;
>+	}
>
>	get_task_struct(task);
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 18:34                 ` Waiman Long
@ 2018-11-29 22:05                   ` Peter Zijlstra
  2018-11-30  9:34                     ` 答复: " Liu,Qi(ACU-T1)
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 22:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Will Deacon, Yongji Xie, mingo, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, Davidlohr Bueso

On Thu, Nov 29, 2018 at 01:34:05PM -0500, Waiman Long wrote:

> That will certainly work for x86. However, I am not sure if that will be
> optimal for arm and powerpc.

	smp_mb__before_atomic()
	cmpxchg_relaxed()

works. Also see pv_kick_node()'s commit:

  34d54f3d6917 ("locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures")

you might know the author of that :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 21:34                 ` Davidlohr Bueso
@ 2018-11-29 22:17                   ` Peter Zijlstra
  2018-11-30  9:30                     ` Andrea Parri
                                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-29 22:17 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

On Thu, Nov 29, 2018 at 01:34:21PM -0800, Davidlohr Bueso wrote:
> I messed up something such that waiman was not in the thread. Ccing.
> 
> > On Thu, 29 Nov 2018, Waiman Long wrote:
> > 
> > > That can be costly for x86 which will now have 2 locked instructions.
> > 
> > Yeah, and when used as an actual queue we should really start to notice.
> > Some users just have a single task in the wake_q because avoiding the cost
> > of wake_up_process() with locks held is significant.
> > 
> > How about instead of adding the barrier before the cmpxchg, we do it
> > in the failed branch, right before we return. This is the uncommon
> > path.
> > 
> > Thanks,
> > Davidlohr
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 091e089063be..0d844a18a9dc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -408,8 +408,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> > 	 * This cmpxchg() executes a full barrier, which pairs with the full
> > 	 * barrier executed by the wakeup in wake_up_q().
> > 	 */
> > -	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> > +	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) {
> > +		/*
> > +		 * Ensure, that when the cmpxchg() fails, the corresponding
> > +		 * wake_up_q() will observe our prior state.
> > +		 */
> > +		smp_mb__after_atomic();
> > 		return;
> > +	}

So wake_up_q() does:

  wake_up_q():
	node->next = NULL;
	/* implied smp_mb */
	wake_up_process();

So per the cross your variables 'rule', this side then should do:

  wake_q_add():
	/* wake_cond = true */
	smp_mb()
	cmpxchg_relaxed(&node->next, ...);

So that the ordering pivots around node->next.

Either we see NULL and win the cmpxchg (in which case we'll do the
wakeup later) or, when we fail the cmpxchg, we must observe what came
before the failure.

If it wasn't so damn late, I'd try and write a litmus test for this,
because now I'm starting to get confused -- also probably because it's
late.

In any case, I think you patch is 'wrong' because it puts the barrier on
the wrong side of the cmpxchg() (after, as opposed to before).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 22:17                   ` Peter Zijlstra
@ 2018-11-30  9:30                     ` Andrea Parri
  2018-12-03  5:31                     ` [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg Davidlohr Bueso
  2018-12-10 15:12                     ` [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
  2 siblings, 0 replies; 50+ messages in thread
From: Andrea Parri @ 2018-11-30  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Yongji Xie, mingo, will.deacon, linux-kernel,
	xieyongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	longman

On Thu, Nov 29, 2018 at 11:17:14PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 01:34:21PM -0800, Davidlohr Bueso wrote:
> > I messed up something such that waiman was not in the thread. Ccing.
> > 
> > > On Thu, 29 Nov 2018, Waiman Long wrote:
> > > 
> > > > That can be costly for x86 which will now have 2 locked instructions.
> > > 
> > > Yeah, and when used as an actual queue we should really start to notice.
> > > Some users just have a single task in the wake_q because avoiding the cost
> > > of wake_up_process() with locks held is significant.
> > > 
> > > How about instead of adding the barrier before the cmpxchg, we do it
> > > in the failed branch, right before we return. This is the uncommon
> > > path.
> > > 
> > > Thanks,
> > > Davidlohr
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 091e089063be..0d844a18a9dc 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -408,8 +408,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> > > 	 * This cmpxchg() executes a full barrier, which pairs with the full
> > > 	 * barrier executed by the wakeup in wake_up_q().
> > > 	 */
> > > -	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> > > +	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) {
> > > +		/*
> > > +		 * Ensure, that when the cmpxchg() fails, the corresponding
> > > +		 * wake_up_q() will observe our prior state.
> > > +		 */
> > > +		smp_mb__after_atomic();
> > > 		return;
> > > +	}
> 
> So wake_up_q() does:
> 
>   wake_up_q():
> 	node->next = NULL;
> 	/* implied smp_mb */
> 	wake_up_process();
> 
> So per the cross your variables 'rule', this side then should do:
> 
>   wake_q_add():
> 	/* wake_cond = true */
> 	smp_mb()
> 	cmpxchg_relaxed(&node->next, ...);
> 
> So that the ordering pivots around node->next.
> 
> Either we see NULL and win the cmpxchg (in which case we'll do the
> wakeup later) or, when we fail the cmpxchg, we must observe what came
> before the failure.
> 
> If it wasn't so damn late, I'd try and write a litmus test for this,
> because now I'm starting to get confused -- also probably because it's
> late.

The above description suggests:

C wake_up_q-wake_q_add

{
	int next = 0;
	int y = 0;
}

P0(int *next, int *y)
{
	int r0;

	/* in wake_up_q() */

	WRITE_ONCE(*next, 1);	/* node->next = NULL */
	smp_mb();		/* implied by wake_up_process() */
	r0 = READ_ONCE(*y);
}

P1(int *next, int *y)
{
	int r1;

	/* in wake_q_add() */

	WRITE_ONCE(*y, 1);	/* wake_cond = true */
	smp_mb__before_atomic();
	r1 = cmpxchg_relaxed(next, 1, 2);
}

exists (0:r0=0 /\ 1:r1=0)


This "exists" clause cannot be satisfied according to the LKMM:

Test wake_up_q-wake_q_add Allowed
States 3
0:r0=0; 1:r1=1;
0:r0=1; 1:r1=0;
0:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r0=0 /\ 1:r1=0)
Observation wake_up_q-wake_q_add Never 0 3
Time wake_up_q-wake_q_add 0.00
Hash=72d85545f97ef7fd35c8928259225ee0


(TBH, I'm not sure what "y" (you denoted it "wake_cond") is pointing to
 here/is modeling, but I might have missed some previous remarks...)

  Andrea


> 
> In any case, I think you patch is 'wrong' because it puts the barrier on
> the wrong side of the cmpxchg() (after, as opposed to before).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* 答复: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 22:05                   ` Peter Zijlstra
@ 2018-11-30  9:34                     ` Liu,Qi(ACU-T1)
  2018-11-30 14:15                       ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Liu,Qi(ACU-T1) @ 2018-11-30  9:34 UTC (permalink / raw)
  To: 'Peter Zijlstra', Waiman Long
  Cc: Will Deacon, Yongji Xie, mingo, linux-kernel, Xie,Yongji,
	Zhang,Yu(ACU-T1), Yuan,Linsi, Ni,Xun, Li,Lin(ACU-T1),
	Davidlohr Bueso

Is there a semantic divergence between x86 instruction "LOCK cmpxchg" and the macro cmpxchg defined in linux kernel? The former guarantee full barrier in any case, and the latter only imply barrier in case of success?
So, we use 	
    smp_mb__before_atomic()
	cmpxchg_relaxed()  // no barrier
to get rid of the redundant barrier
   smp_mb__before_atomic()
   cmpxchg()          // imply a redundant barrier when successing  
    
-----邮件原件-----
发件人: Peter Zijlstra [mailto:peterz@infradead.org] 
发送时间: 2018年11月30日 6:06
收件人: Waiman Long <longman@redhat.com>
抄送: Will Deacon <will.deacon@arm.com>; Yongji Xie <elohimes@gmail.com>; mingo@redhat.com; linux-kernel@vger.kernel.org; Xie,Yongji <xieyongji@baidu.com>; Zhang,Yu(ACU-T1) <zhangyu31@baidu.com>; Liu,Qi(ACU-T1) <liuqi16@baidu.com>; Yuan,Linsi <yuanlinsi01@baidu.com>; Ni,Xun <nixun@baidu.com>; Li,Lin(ACU-T1) <lilin24@baidu.com>; Davidlohr Bueso <dave@stgolabs.net>
主题: Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil

On Thu, Nov 29, 2018 at 01:34:05PM -0500, Waiman Long wrote:

> That will certainly work for x86. However, I am not sure if that will 
> be optimal for arm and powerpc.

	smp_mb__before_atomic()
	cmpxchg_relaxed()

works. Also see pv_kick_node()'s commit:

  34d54f3d6917 ("locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures")

you might know the author of that :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: 答复: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-30  9:34                     ` 答复: " Liu,Qi(ACU-T1)
@ 2018-11-30 14:15                       ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-11-30 14:15 UTC (permalink / raw)
  To: Liu,Qi(ACU-T1)
  Cc: Waiman Long, Will Deacon, Yongji Xie, mingo, linux-kernel,
	Xie,Yongji, Zhang,Yu(ACU-T1), Yuan,Linsi, Ni,Xun, Li,Lin(ACU-T1),
	Davidlohr Bueso

On Fri, Nov 30, 2018 at 09:34:19AM +0000, Liu,Qi(ACU-T1) wrote:
> Is there a semantic divergence between x86 instruction "LOCK cmpxchg"
> and the macro cmpxchg defined in linux kernel? The former guarantee
> full barrier in any case, and the latter only imply barrier in case of
> success?

> So, we use 	
>     smp_mb__before_atomic()
> 	cmpxchg_relaxed()  // no barrier
> to get rid of the redundant barrier
>    smp_mb__before_atomic()
>    cmpxchg()          // imply a redundant barrier when successing  

No, it is all horribly more complicated :-)

On x86: cmpxchg_relaxed() == cmpxchg() == LOCK CMPXCHG, however
smp_mb__{before,after}_atomic() is a no-op.

On say arm OTOH: cmpxchg_relaxed() == LL/SC, but then
smp_mb__{before,after}_atomic() is smp_mb(), such that: cmpxchg() := {
smp_mb__before_atomic(); cmpxchg_relaxed(); smp_mb__after_atomic(); }

The difference is that on x86 atomic instructions unconditionally imply
memory ordering; whereas on ARM they do not.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg
  2018-11-29 22:17                   ` Peter Zijlstra
  2018-11-30  9:30                     ` Andrea Parri
@ 2018-12-03  5:31                     ` Davidlohr Bueso
  2018-12-03 16:10                       ` Waiman Long
  2019-01-21 11:28                       ` [tip:locking/core] sched/wake_q: Add branch prediction hint to " tip-bot for Davidlohr Bueso
  2018-12-10 15:12                     ` [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
  2 siblings, 2 replies; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-03  5:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

The cmpxchg will fail when the task is already in the process
of waking up, and as such is an extremely rare occurrence.
Micro-optimize the call and put an unlikely() around it.

To no surprise, when using CONFIG_PROFILE_ANNOTATED_BRANCHES
under a number of workloads the incorrect rate was a mere 1-2%.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 091e089063be..f7747cf6e427 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -408,7 +408,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 * This cmpxchg() executes a full barrier, which pairs with the full
 	 * barrier executed by the wakeup in wake_up_q().
 	 */
-	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+	if (unlikely(cmpxchg(&node->next, NULL, WAKE_Q_TAIL)))
 		return;
 
 	get_task_struct(task);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg
  2018-12-03  5:31                     ` [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg Davidlohr Bueso
@ 2018-12-03 16:10                       ` Waiman Long
  2019-01-21 11:28                       ` [tip:locking/core] sched/wake_q: Add branch prediction hint to " tip-bot for Davidlohr Bueso
  1 sibling, 0 replies; 50+ messages in thread
From: Waiman Long @ 2018-12-03 16:10 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, xieyongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, andrea.parri

On 12/03/2018 12:31 AM, Davidlohr Bueso wrote:
> The cmpxchg will fail when the task is already in the process
> of waking up, and as such is an extremely rare occurrence.
> Micro-optimize the call and put an unlikely() around it.
>
> To no surprise, when using CONFIG_PROFILE_ANNOTATED_BRANCHES
> under a number of workloads the incorrect rate was a mere 1-2%.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 091e089063be..f7747cf6e427 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -408,7 +408,7 @@ void wake_q_add(struct wake_q_head *head, struct
> task_struct *task)
>      * This cmpxchg() executes a full barrier, which pairs with the full
>      * barrier executed by the wakeup in wake_up_q().
>      */
> -    if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> +    if (unlikely(cmpxchg(&node->next, NULL, WAKE_Q_TAIL)))
>         return;
>
>     get_task_struct(task);

Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-11-29 22:17                   ` Peter Zijlstra
  2018-11-30  9:30                     ` Andrea Parri
  2018-12-03  5:31                     ` [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg Davidlohr Bueso
@ 2018-12-10 15:12                     ` Yongji Xie
  2018-12-17 11:37                       ` Peter Zijlstra
  2 siblings, 1 reply; 50+ messages in thread
From: Yongji Xie @ 2018-12-10 15:12 UTC (permalink / raw)
  To: peterz
  Cc: dave, mingo, will.deacon, linux-kernel, Xie Yongji, zhangyu31,
	liuqi16, yuanlinsi01, nixun, lilin24, longman, andrea.parri

On Fri, 30 Nov 2018 at 06:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 29, 2018 at 01:34:21PM -0800, Davidlohr Bueso wrote:
> > I messed up something such that waiman was not in the thread. Ccing.
> >
> > > On Thu, 29 Nov 2018, Waiman Long wrote:
> > >
> > > > That can be costly for x86 which will now have 2 locked instructions.
> > >
> > > Yeah, and when used as an actual queue we should really start to notice.
> > > Some users just have a single task in the wake_q because avoiding the cost
> > > of wake_up_process() with locks held is significant.
> > >
> > > How about instead of adding the barrier before the cmpxchg, we do it
> > > in the failed branch, right before we return. This is the uncommon
> > > path.
> > >
> > > Thanks,
> > > Davidlohr
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 091e089063be..0d844a18a9dc 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -408,8 +408,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> > >      * This cmpxchg() executes a full barrier, which pairs with the full
> > >      * barrier executed by the wakeup in wake_up_q().
> > >      */
> > > -   if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> > > +   if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) {
> > > +           /*
> > > +            * Ensure, that when the cmpxchg() fails, the corresponding
> > > +            * wake_up_q() will observe our prior state.
> > > +            */
> > > +           smp_mb__after_atomic();
> > >             return;
> > > +   }
>
> So wake_up_q() does:
>
>   wake_up_q():
>         node->next = NULL;
>         /* implied smp_mb */
>         wake_up_process();
>
> So per the cross your variables 'rule', this side then should do:
>
>   wake_q_add():
>         /* wake_cond = true */
>         smp_mb()
>         cmpxchg_relaxed(&node->next, ...);
>
> So that the ordering pivots around node->next.
>
> Either we see NULL and win the cmpxchg (in which case we'll do the
> wakeup later) or, when we fail the cmpxchg, we must observe what came
> before the failure.
>
> If it wasn't so damn late, I'd try and write a litmus test for this,
> because now I'm starting to get confused -- also probably because it's
> late.
>

Hi Peter,

Please let me know If there is any progress on this issue. Thank you!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-10 15:12                     ` [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
@ 2018-12-17 11:37                       ` Peter Zijlstra
  2018-12-17 13:12                         ` Yongji Xie
  2018-12-17 20:53                         ` Davidlohr Bueso
  0 siblings, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-12-17 11:37 UTC (permalink / raw)
  To: Yongji Xie
  Cc: dave, mingo, will.deacon, linux-kernel, Xie Yongji, zhangyu31,
	liuqi16, yuanlinsi01, nixun, lilin24, longman, andrea.parri

On Mon, Dec 10, 2018 at 11:12:52PM +0800, Yongji Xie wrote:
> Hi Peter,
> 
> Please let me know If there is any progress on this issue. Thank you!

Right, sorry, my brain was filled with snot and didn't want to make
sense of things.

I've put some patches here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core

Could you have a look?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-17 11:37                       ` Peter Zijlstra
@ 2018-12-17 13:12                         ` Yongji Xie
  2019-01-07 14:35                           ` Waiman Long
  2018-12-17 20:53                         ` Davidlohr Bueso
  1 sibling, 1 reply; 50+ messages in thread
From: Yongji Xie @ 2018-12-17 13:12 UTC (permalink / raw)
  To: peterz
  Cc: dave, mingo, will.deacon, linux-kernel, Xie Yongji, zhangyu31,
	liuqi16, yuanlinsi01, nixun, lilin24, longman, andrea.parri

On Mon, 17 Dec 2018 at 19:37, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 10, 2018 at 11:12:52PM +0800, Yongji Xie wrote:
> > Hi Peter,
> >
> > Please let me know If there is any progress on this issue. Thank you!
>
> Right, sorry, my brain was filled with snot and didn't want to make
> sense of things.
>
> I've put some patches here:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
>
> Could you have a look?

That looks good to me. Thank you!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-17 11:37                       ` Peter Zijlstra
  2018-12-17 13:12                         ` Yongji Xie
@ 2018-12-17 20:53                         ` Davidlohr Bueso
  2018-12-18 13:10                           ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-17 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

On Mon, 17 Dec 2018, Peter Zijlstra wrote:
>
>I've put some patches here:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
>
>Could you have a look?

So how about the following to reduce some of the performance penalty (at
the cost of more complexity)?

Thanks,
Davidlohr

----------8<-----------------------------------------------------------
[PATCH] sched/wake_q: Reduce reference counting for special users

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_tasksafe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/wakeup cycle). These
users also need to check the return value of the operation and
do the put() if necessary when the cmpxchg() fails. Regular users
of wake_q_add() that don't care about when the wakeup actually
happens can just ignore the return value.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/sched/wake_q.h |  7 ++++--
 kernel/futex.c               |  4 ++--
 kernel/locking/rwsem-xadd.c  |  7 +++---
 kernel/sched/core.c          | 53 +++++++++++++++++++++++++++++++-------------
 4 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..8c1fc6434c6c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern bool wake_q_add(struct wake_q_head *head,
+		      struct task_struct *task);
+extern bool wake_q_add_tasksafe(struct wake_q_head *head,
+				struct task_struct *task);
+
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index d14971f6ed3d..2ff7e811f13b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,8 +1402,8 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	if (!wake_q_add_tasksafe(wake_q, p))
+		put_task_struct(p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..dea4dcf9d8f5 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,10 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		if (!wake_q_add_tasksafe(wake_q, tsk)) {
+			/* wake_q_add() already take the task ref */
+			put_task_struct(tsk);
+		}
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d740d7a3608d..2c1825fe46e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -396,19 +396,8 @@ static bool set_nr_if_polling(struct task_struct *p)
 #endif
 #endif
 
-/**
- * wake_q_add() - queue a wakeup for 'later' waking.
- * @head: the wake_q_head to add @task to
- * @task: the task to queue for 'later' wakeup
- *
- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
- *
- * This function must be used as-if it were wake_up_process(); IOW the task
- * must be ready to be woken at this location.
- */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+bool __wake_q_add(struct wake_q_head *head,
+		  struct task_struct *task, bool tasksafe)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +411,49 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
+		return false;
 
-	get_task_struct(task);
+	if (!tasksafe)
+		get_task_struct(task);
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+	return true;
+}
+
+/**
+ * wake_q_add() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * Returns whether or not the task was successfully queued for wakeup.
+ * If false, the task is already queued and can happen at any time after
+ * this call.
+ */
+bool wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	return __wake_q_add(head, task, false);
+}
+
+/*
+ * wake_q_add_tasksafe() is the same as the above wake_q_add(), except that
+ * the caller has already done the task reference counting for us. Normally
+ * the 'tasksafe' caller will check the return value and cleanup refcounting
+ * accordingly.
+ */
+bool wake_q_add_tasksafe(struct wake_q_head *head, struct task_struct *task)
+{
+	return __wake_q_add(head, task, true);
 }
 
 void wake_up_q(struct wake_q_head *head)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-17 20:53                         ` Davidlohr Bueso
@ 2018-12-18 13:10                           ` Peter Zijlstra
  2018-12-18 13:14                             ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2018-12-18 13:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

On Mon, Dec 17, 2018 at 12:53:10PM -0800, Davidlohr Bueso wrote:
> On Mon, 17 Dec 2018, Peter Zijlstra wrote:
> > 
> > I've put some patches here:
> > 
> >  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
> > 
> > Could you have a look?
> 
> So how about the following to reduce some of the performance penalty (at
> the cost of more complexity)?

I'd rather do it like so, except I'm still conflicted on the naming.

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..ad826d2a4557 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index d14971f6ed3d..6218d98f649b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	wake_q_add_safe(wake_q, p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..fbe96341beee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		wake_q_add_safe(wake_q, tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d740d7a3608d..72d82ce73714 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -408,7 +408,7 @@ static bool set_nr_if_polling(struct task_struct *p)
  * This function must be used as-if it were wake_up_process(); IOW the task
  * must be ready to be woken at this location.
  */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +422,27 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
-
-	get_task_struct(task);
+		return false;
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+
+	return true;
+}
+
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	if (__wake_q_add(head, task))
+		get_task_struct(task);
+}
+
+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
+{
+	if (!__wake_a_add(head, task))
+		put_task_struct(task);
 }
 
 void wake_up_q(struct wake_q_head *head)

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-18 13:10                           ` Peter Zijlstra
@ 2018-12-18 13:14                             ` Peter Zijlstra
  2018-12-18 17:27                               ` Davidlohr Bueso
  2018-12-18 18:54                               ` [PATCH v2] sched/wake_q: Reduce reference counting for special users Davidlohr Bueso
  0 siblings, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2018-12-18 13:14 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

On Tue, Dec 18, 2018 at 02:10:31PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 17, 2018 at 12:53:10PM -0800, Davidlohr Bueso wrote:

> > So how about the following to reduce some of the performance penalty (at
> > the cost of more complexity)?
> 
> I'd rather do it like so, except I'm still conflicted on the naming.

> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +{
> +	if (__wake_q_add(head, task))
> +		get_task_struct(task);
> +}
> +
> +void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
> +{
> +	if (!__wake_a_add(head, task))
> +		put_task_struct(task);
>  }

That is, in the one case it has internal reference counting, in the
other case it consumes the reference counting.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-18 13:14                             ` Peter Zijlstra
@ 2018-12-18 17:27                               ` Davidlohr Bueso
  2018-12-18 18:54                               ` [PATCH v2] sched/wake_q: Reduce reference counting for special users Davidlohr Bueso
  1 sibling, 0 replies; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-18 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

On Tue, 18 Dec 2018, Peter Zijlstra wrote:
>> I'd rather do it like so, except I'm still conflicted on the naming.
>
>> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>> +{
>> +	if (__wake_q_add(head, task))
>> +		get_task_struct(task);
>> +}
>> +
>> +void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
>> +{
>> +	if (!__wake_a_add(head, task))
>> +		put_task_struct(task);
>>  }
>
>That is, in the one case it has internal reference counting, in the
>other case it consumes the reference counting.

Yeah I like that better for an interface. Also no concurrency in head
so delaying the get() should be ok.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2] sched/wake_q: Reduce reference counting for special users
  2018-12-18 13:14                             ` Peter Zijlstra
  2018-12-18 17:27                               ` Davidlohr Bueso
@ 2018-12-18 18:54                               ` Davidlohr Bueso
  2018-12-18 19:17                                 ` Waiman Long
  1 sibling, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-18 18:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, longman,
	andrea.parri

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

Changes from v1:
 - Simplify some of the interface allowing callers to not worry
   about the return value of the add operation. This makes caller
   refcouting two levels/asymm by putting the refcouting in the
   wake_q's hands, but the benefits are worth it.

Applies on peterz' queue tree.

 include/linux/sched/wake_q.h |  4 +--
 kernel/futex.c               |  3 +--
 kernel/locking/rwsem-xadd.c  |  4 +--
 kernel/sched/core.c          | 62 ++++++++++++++++++++++++++++++++------------
 4 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..ad826d2a4557 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index d14971f6ed3d..6218d98f649b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	wake_q_add_safe(wake_q, p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..fbe96341beee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		wake_q_add_safe(wake_q, tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d740d7a3608d..ef85c6ad4caa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #endif
 #endif
 
-/**
- * wake_q_add() - queue a wakeup for 'later' waking.
- * @head: the wake_q_head to add @task to
- * @task: the task to queue for 'later' wakeup
- *
- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
- *
- * This function must be used as-if it were wake_up_process(); IOW the task
- * must be ready to be woken at this location.
- */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +410,57 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
-
-	get_task_struct(task);
+		return false;
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+	return true;
+}
+
+/**
+ * wake_q_add() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * The @task will be refcounted upon a
+ */
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	if (__wake_q_add(head, task))
+		get_task_struct(task);
+}
+
+/**
+ * wake_q_add_safe() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
+ * that already hold reference to @task can call the 'safe' version and trust
+ * wake_q to do the right thing depending whether or not the @task is already
+ * queued for wakeup.
+ */
+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
+{
+	if (!__wake_q_add(head, task))
+		get_task_struct(task);
 }
 
 void wake_up_q(struct wake_q_head *head)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] sched/wake_q: Reduce reference counting for special users
  2018-12-18 18:54                               ` [PATCH v2] sched/wake_q: Reduce reference counting for special users Davidlohr Bueso
@ 2018-12-18 19:17                                 ` Waiman Long
  2018-12-18 19:30                                   ` Davidlohr Bueso
  0 siblings, 1 reply; 50+ messages in thread
From: Waiman Long @ 2018-12-18 19:17 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: Yongji Xie, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, andrea.parri

On 12/18/2018 01:54 PM, Davidlohr Bueso wrote:
> Some users, specifically futexes and rwsems, required fixes
> that allowed the callers to be safe when wakeups occur before
> they are expected by wake_up_q(). Such scenarios also play
> games and rely on reference counting, and until now were
> pivoting on wake_q doing it. With the wake_q_add() call being
> moved down, this can no longer be the case. As such we end up
> with a a double task refcounting overhead; and these callers
> care enough about this (being rather core-ish).
>
> This patch introduces a wake_q_add_safe() call that serves
> for callers that have already done refcounting and therefore the
> task is 'safe' from wake_q point of view (int that it requires
> reference throughout the entire queue/>wakeup cycle). In the one
> case it has internal reference counting, in the other case it
> consumes the reference counting.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>
> Changes from v1:
> - Simplify some of the interface allowing callers to not worry
>   about the return value of the add operation. This makes caller
>   refcouting two levels/asymm by putting the refcouting in the
>   wake_q's hands, but the benefits are worth it.
>
> Applies on peterz' queue tree.
>
> include/linux/sched/wake_q.h |  4 +--
> kernel/futex.c               |  3 +--
> kernel/locking/rwsem-xadd.c  |  4 +--
> kernel/sched/core.c          | 62
> ++++++++++++++++++++++++++++++++------------
> 4 files changed, 50 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index 545f37138057..ad826d2a4557 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head
> *head)
>     head->lastp = &head->first;
> }
>
> -extern void wake_q_add(struct wake_q_head *head,
> -               struct task_struct *task);
> +extern void wake_q_add(struct wake_q_head *head, struct task_struct
> *task);
> +extern void wake_q_add_safe(struct wake_q_head *head, struct
> task_struct *task);
> extern void wake_up_q(struct wake_q_head *head);
>
> #endif /* _LINUX_SCHED_WAKE_Q_H */
> diff --git a/kernel/futex.c b/kernel/futex.c
> index d14971f6ed3d..6218d98f649b 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head
> *wake_q, struct futex_q *q)
>      * Queue the task for later wakeup for after we've released
>      * the hb->lock. wake_q_add() grabs reference to p.
>      */
> -    wake_q_add(wake_q, p);
> -    put_task_struct(p);
> +    wake_q_add_safe(wake_q, p);
> }
>
> /*
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 50d9af615dc4..fbe96341beee 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore
> *sem,
>          * Ensure issuing the wakeup (either by us or someone else)
>          * after setting the reader waiter to nil.
>          */
> -        wake_q_add(wake_q, tsk);
> -        /* wake_q_add() already take the task ref */
> -        put_task_struct(tsk);
> +        wake_q_add_safe(wake_q, tsk);
>     }
>
>     adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d740d7a3608d..ef85c6ad4caa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
> #endif
> #endif
>
> -/**
> - * wake_q_add() - queue a wakeup for 'later' waking.
> - * @head: the wake_q_head to add @task to
> - * @task: the task to queue for 'later' wakeup
> - *
> - * Queue a task for later wakeup, most likely by the wake_up_q() call
> in the
> - * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
> - * instantly.
> - *
> - * This function must be used as-if it were wake_up_process(); IOW
> the task
> - * must be ready to be woken at this location.
> - */
> -void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +static bool __wake_q_add(struct wake_q_head *head, struct task_struct
> *task)
> {
>     struct wake_q_node *node = &task->wake_q;
>
> @@ -422,15 +410,57 @@ void wake_q_add(struct wake_q_head *head, struct
> task_struct *task)
>      */
>     smp_mb__before_atomic();
>     if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
> -        return;
> -
> -    get_task_struct(task);
> +        return false;
>
>     /*
>      * The head is context local, there can be no concurrency.
>      */
>     *head->lastp = node;
>     head->lastp = &node->next;
> +    return true;
> +}
> +
> +/**
> + * wake_q_add() - queue a wakeup for 'later' waking.
> + * @head: the wake_q_head to add @task to
> + * @task: the task to queue for 'later' wakeup
> + *
> + * Queue a task for later wakeup, most likely by the wake_up_q() call
> in the
> + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
> + * instantly.
> + *
> + * This function must be used as-if it were wake_up_process(); IOW
> the task
> + * must be ready to be woken at this location.
> + *
> + * The @task will be refcounted upon a

Are you missing some text here?

-Longman

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] sched/wake_q: Reduce reference counting for special users
  2018-12-18 19:17                                 ` Waiman Long
@ 2018-12-18 19:30                                   ` Davidlohr Bueso
  2018-12-18 19:39                                     ` Davidlohr Bueso
  0 siblings, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-18 19:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

>Are you missing some text here?

Nah this was a leftover, I was just removing it in v3.

Thanks.

----8<---------------------------------------------------------------

[PATCH v3] sched/wake_q: Reduce reference counting for special users

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
Changes from v2: got rid of some bogus/incomplete leftover comments
in wake_q_add().

 include/linux/sched/wake_q.h |  4 +--
 kernel/futex.c               |  3 +--
 kernel/locking/rwsem-xadd.c  |  4 +--
 kernel/sched/core.c          | 60 ++++++++++++++++++++++++++++++++------------
 4 files changed, 48 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..ad826d2a4557 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index d14971f6ed3d..6218d98f649b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	wake_q_add_safe(wake_q, p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..fbe96341beee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		wake_q_add_safe(wake_q, tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d740d7a3608d..8611c9a3ddd8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #endif
 #endif
 
-/**
- * wake_q_add() - queue a wakeup for 'later' waking.
- * @head: the wake_q_head to add @task to
- * @task: the task to queue for 'later' wakeup
- *
- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
- *
- * This function must be used as-if it were wake_up_process(); IOW the task
- * must be ready to be woken at this location.
- */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
-
-	get_task_struct(task);
+		return false;
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+	return true;
+}
+
+/**
+ * wake_q_add() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ */
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	if (__wake_q_add(head, task))
+		get_task_struct(task);
+}
+
+/**
+ * wake_q_add_safe() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
+ * that already hold reference to @task can call the 'safe' version and trust
+ * wake_q to do the right thing depending whether or not the @task is already
+ * queued for wakeup.
+ */
+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
+{
+	if (!__wake_q_add(head, task))
+		get_task_struct(task);
 }
 
 void wake_up_q(struct wake_q_head *head)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] sched/wake_q: Reduce reference counting for special users
  2018-12-18 19:30                                   ` Davidlohr Bueso
@ 2018-12-18 19:39                                     ` Davidlohr Bueso
  2018-12-18 19:53                                       ` [PATCH v4] " Davidlohr Bueso
  0 siblings, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-18 19:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

On Tue, 18 Dec 2018, Davidlohr Bueso wrote:
>+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
>+{
>+	if (!__wake_q_add(head, task))
>+		get_task_struct(task);

*sigh* and this should be put().

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4] sched/wake_q: Reduce reference counting for special users
  2018-12-18 19:39                                     ` Davidlohr Bueso
@ 2018-12-18 19:53                                       ` Davidlohr Bueso
  2018-12-18 20:35                                         ` Waiman Long
  2019-02-04  8:57                                         ` [tip:locking/core] " tip-bot for Davidlohr Bueso
  0 siblings, 2 replies; 50+ messages in thread
From: Davidlohr Bueso @ 2018-12-18 19:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

- Changes from v3: fixed wake_q_add_safe. While previous version
  had been tested with a bootup, the failed cmpxchg path obviously
  hadn't been exercised.  Sorry about the noise.

 include/linux/sched/wake_q.h |  4 +--
 kernel/futex.c               |  3 +--
 kernel/locking/rwsem-xadd.c  |  4 +--
 kernel/sched/core.c          | 60 ++++++++++++++++++++++++++++++++------------
 4 files changed, 48 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..ad826d2a4557 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index d14971f6ed3d..6218d98f649b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	wake_q_add_safe(wake_q, p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..fbe96341beee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		wake_q_add_safe(wake_q, tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d740d7a3608d..be977df66a21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #endif
 #endif
 
-/**
- * wake_q_add() - queue a wakeup for 'later' waking.
- * @head: the wake_q_head to add @task to
- * @task: the task to queue for 'later' wakeup
- *
- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
- *
- * This function must be used as-if it were wake_up_process(); IOW the task
- * must be ready to be woken at this location.
- */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
-
-	get_task_struct(task);
+		return false;
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+	return true;
+}
+
+/**
+ * wake_q_add() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ */
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	if (__wake_q_add(head, task))
+		get_task_struct(task);
+}
+
+/**
+ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
+ * that already hold reference to @task can call the 'safe' version and trust
+ * wake_q to do the right thing depending whether or not the @task is already
+ * queued for wakeup.
+ */
+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
+{
+	if (!__wake_q_add(head, task))
+		put_task_struct(task);
 }
 
 void wake_up_q(struct wake_q_head *head)
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v4] sched/wake_q: Reduce reference counting for special users
  2018-12-18 19:53                                       ` [PATCH v4] " Davidlohr Bueso
@ 2018-12-18 20:35                                         ` Waiman Long
  2019-01-21 16:02                                           ` Davidlohr Bueso
  2019-02-04  8:57                                         ` [tip:locking/core] " tip-bot for Davidlohr Bueso
  1 sibling, 1 reply; 50+ messages in thread
From: Waiman Long @ 2018-12-18 20:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

On 12/18/2018 02:53 PM, Davidlohr Bueso wrote:
> Some users, specifically futexes and rwsems, required fixes
> that allowed the callers to be safe when wakeups occur before
> they are expected by wake_up_q(). Such scenarios also play
> games and rely on reference counting, and until now were
> pivoting on wake_q doing it. With the wake_q_add() call being
> moved down, this can no longer be the case. As such we end up
> with a a double task refcounting overhead; and these callers
> care enough about this (being rather core-ish).
>
> This patch introduces a wake_q_add_safe() call that serves
> for callers that have already done refcounting and therefore the
> task is 'safe' from wake_q point of view (int that it requires
> reference throughout the entire queue/>wakeup cycle). In the one
> case it has internal reference counting, in the other case it
> consumes the reference counting.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>
> - Changes from v3: fixed wake_q_add_safe. While previous version
>  had been tested with a bootup, the failed cmpxchg path obviously
>  hadn't been exercised.  Sorry about the noise.
>
> include/linux/sched/wake_q.h |  4 +--
> kernel/futex.c               |  3 +--
> kernel/locking/rwsem-xadd.c  |  4 +--
> kernel/sched/core.c          | 60
> ++++++++++++++++++++++++++++++++------------
> 4 files changed, 48 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index 545f37138057..ad826d2a4557 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head
> *head)
>     head->lastp = &head->first;
> }
>
> -extern void wake_q_add(struct wake_q_head *head,
> -               struct task_struct *task);
> +extern void wake_q_add(struct wake_q_head *head, struct task_struct
> *task);
> +extern void wake_q_add_safe(struct wake_q_head *head, struct
> task_struct *task);
> extern void wake_up_q(struct wake_q_head *head);
>
> #endif /* _LINUX_SCHED_WAKE_Q_H */
> diff --git a/kernel/futex.c b/kernel/futex.c
> index d14971f6ed3d..6218d98f649b 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head
> *wake_q, struct futex_q *q)
>      * Queue the task for later wakeup for after we've released
>      * the hb->lock. wake_q_add() grabs reference to p.
>      */
> -    wake_q_add(wake_q, p);
> -    put_task_struct(p);
> +    wake_q_add_safe(wake_q, p);
> }
>
> /*
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 50d9af615dc4..fbe96341beee 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore
> *sem,
>          * Ensure issuing the wakeup (either by us or someone else)
>          * after setting the reader waiter to nil.
>          */
> -        wake_q_add(wake_q, tsk);
> -        /* wake_q_add() already take the task ref */
> -        put_task_struct(tsk);
> +        wake_q_add_safe(wake_q, tsk);
>     }
>
>     adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d740d7a3608d..be977df66a21 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
> #endif
> #endif
>
> -/**
> - * wake_q_add() - queue a wakeup for 'later' waking.
> - * @head: the wake_q_head to add @task to
> - * @task: the task to queue for 'later' wakeup
> - *
> - * Queue a task for later wakeup, most likely by the wake_up_q() call
> in the
> - * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
> - * instantly.
> - *
> - * This function must be used as-if it were wake_up_process(); IOW
> the task
> - * must be ready to be woken at this location.
> - */
> -void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +static bool __wake_q_add(struct wake_q_head *head, struct task_struct
> *task)
> {
>     struct wake_q_node *node = &task->wake_q;
>
> @@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct
> task_struct *task)
>      */
>     smp_mb__before_atomic();
>     if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
> -        return;
> -
> -    get_task_struct(task);
> +        return false;
>
>     /*
>      * The head is context local, there can be no concurrency.
>      */
>     *head->lastp = node;
>     head->lastp = &node->next;
> +    return true;
> +}
> +
> +/**
> + * wake_q_add() - queue a wakeup for 'later' waking.
> + * @head: the wake_q_head to add @task to
> + * @task: the task to queue for 'later' wakeup
> + *
> + * Queue a task for later wakeup, most likely by the wake_up_q() call
> in the
> + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
> + * instantly.
> + *
> + * This function must be used as-if it were wake_up_process(); IOW
> the task
> + * must be ready to be woken at this location.
> + */
> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> +{
> +    if (__wake_q_add(head, task))
> +        get_task_struct(task);
> +}
> +
> +/**
> + * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
> + * @head: the wake_q_head to add @task to
> + * @task: the task to queue for 'later' wakeup
> + *
> + * Queue a task for later wakeup, most likely by the wake_up_q() call
> in the
> + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
> + * instantly.
> + *
> + * This function must be used as-if it were wake_up_process(); IOW
> the task
> + * must be ready to be woken at this location.
> + *
> + * This function is essentially a task-safe equivalent to
> wake_q_add(). Callers
> + * that already hold reference to @task can call the 'safe' version
> and trust
> + * wake_q to do the right thing depending whether or not the @task is
> already
> + * queued for wakeup.
> + */
> +void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
> +{
> +    if (!__wake_q_add(head, task))
> +        put_task_struct(task);
> }
>
> void wake_up_q(struct wake_q_head *head)

Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2018-12-17 13:12                         ` Yongji Xie
@ 2019-01-07 14:35                           ` Waiman Long
  2019-01-07 15:31                             ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Waiman Long @ 2019-01-07 14:35 UTC (permalink / raw)
  To: peterz
  Cc: Yongji Xie, dave, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, andrea.parri

On 12/17/2018 08:12 AM, Yongji Xie wrote:
> On Mon, 17 Dec 2018 at 19:37, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Dec 10, 2018 at 11:12:52PM +0800, Yongji Xie wrote:
>>> Hi Peter,
>>>
>>> Please let me know If there is any progress on this issue. Thank you!
>> Right, sorry, my brain was filled with snot and didn't want to make
>> sense of things.
>>
>> I've put some patches here:
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
>>
>> Could you have a look?
> That looks good to me. Thank you!
>
> Thanks,
> Yongji

Peter,

Are you planning to move the patch over to tip as the merge window has
just closed?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2019-01-07 14:35                           ` Waiman Long
@ 2019-01-07 15:31                             ` Peter Zijlstra
  2019-01-07 15:35                               ` Waiman Long
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-01-07 15:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Yongji Xie, dave, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, andrea.parri

On Mon, Jan 07, 2019 at 09:35:09AM -0500, Waiman Long wrote:
> On 12/17/2018 08:12 AM, Yongji Xie wrote:
> > On Mon, 17 Dec 2018 at 19:37, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Mon, Dec 10, 2018 at 11:12:52PM +0800, Yongji Xie wrote:
> >>> Hi Peter,
> >>>
> >>> Please let me know If there is any progress on this issue. Thank you!
> >> Right, sorry, my brain was filled with snot and didn't want to make
> >> sense of things.
> >>
> >> I've put some patches here:
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
> >>
> >> Could you have a look?
> > That looks good to me. Thank you!
> >
> > Thanks,
> > Yongji
> 
> Peter,
> 
> Are you planning to move the patch over to tip as the merge window has
> just closed?

Sorry; yes, probably, I just got back from a 2 week holiday and am
somewhat scrambling to remember where I left off and what all landed in
my inbox.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil
  2019-01-07 15:31                             ` Peter Zijlstra
@ 2019-01-07 15:35                               ` Waiman Long
  0 siblings, 0 replies; 50+ messages in thread
From: Waiman Long @ 2019-01-07 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yongji Xie, dave, mingo, will.deacon, linux-kernel, Xie Yongji,
	zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24, andrea.parri

On 01/07/2019 10:31 AM, Peter Zijlstra wrote:
> On Mon, Jan 07, 2019 at 09:35:09AM -0500, Waiman Long wrote:
>> On 12/17/2018 08:12 AM, Yongji Xie wrote:
>>> On Mon, 17 Dec 2018 at 19:37, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Mon, Dec 10, 2018 at 11:12:52PM +0800, Yongji Xie wrote:
>>>>> Hi Peter,
>>>>>
>>>>> Please let me know If there is any progress on this issue. Thank you!
>>>> Right, sorry, my brain was filled with snot and didn't want to make
>>>> sense of things.
>>>>
>>>> I've put some patches here:
>>>>
>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/core
>>>>
>>>> Could you have a look?
>>> That looks good to me. Thank you!
>>>
>>> Thanks,
>>> Yongji
>> Peter,
>>
>> Are you planning to move the patch over to tip as the merge window has
>> just closed?
> Sorry; yes, probably, I just got back from a 2 week holiday and am
> somewhat scrambling to remember where I left off and what all landed in
> my inbox.

Happy New Year!

I have bug reports on some mysterious rwsem related hangs which I think
can probably be fixed by this patch. That is why I am looking forward
for it to land into the tip tree.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [tip:locking/core] locking/rwsem: Fix (possible) missed wakeup
  2018-11-29 12:50 [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
  2018-11-29 13:12 ` Peter Zijlstra
@ 2019-01-21 11:28 ` tip-bot for Xie Yongji
  1 sibling, 0 replies; 50+ messages in thread
From: tip-bot for Xie Yongji @ 2019-01-21 11:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, torvalds, mingo, peterz, linux-kernel, zhangyu31, tglx, xieyongji

Commit-ID:  e158488be27b157802753a59b336142dc0eb0380
Gitweb:     https://git.kernel.org/tip/e158488be27b157802753a59b336142dc0eb0380
Author:     Xie Yongji <xieyongji@baidu.com>
AuthorDate: Thu, 29 Nov 2018 20:50:30 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Jan 2019 11:15:39 +0100

locking/rwsem: Fix (possible) missed wakeup

Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:

  e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")

relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.

[ peterz: Added changelog. ]

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem-xadd.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 09b180063ee1..50d9af615dc4 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		woken++;
 		tsk = waiter->task;
 
-		wake_q_add(wake_q, tsk);
+		get_task_struct(tsk);
 		list_del(&waiter->list);
 		/*
-		 * Ensure that the last operation is setting the reader
+		 * Ensure calling get_task_struct() before setting the reader
 		 * waiter to nil such that rwsem_down_read_failed() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
 		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add(wake_q, tsk);
+		/* wake_q_add() already take the task ref */
+		put_task_struct(tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [tip:locking/core] sched/wake_q: Add branch prediction hint to wake_q_add() cmpxchg
  2018-12-03  5:31                     ` [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg Davidlohr Bueso
  2018-12-03 16:10                       ` Waiman Long
@ 2019-01-21 11:28                       ` tip-bot for Davidlohr Bueso
  1 sibling, 0 replies; 50+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2019-01-21 11:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, longman, akpm, dave, peterz, tglx,
	will.deacon, elohimes, dbueso, hpa, paulmck, torvalds

Commit-ID:  87ff19cb2f1aa55a5d8b691e6690cc059a59d2ec
Gitweb:     https://git.kernel.org/tip/87ff19cb2f1aa55a5d8b691e6690cc059a59d2ec
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Sun, 2 Dec 2018 21:31:30 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Jan 2019 11:18:50 +0100

sched/wake_q: Add branch prediction hint to wake_q_add() cmpxchg

The cmpxchg() will fail when the task is already in the process
of waking up, and as such is an extremely rare occurrence.
Micro-optimize the call and put an unlikely() around it.

To no surprise, when using CONFIG_PROFILE_ANNOTATED_BRANCHES
under a number of workloads the incorrect rate was a mere 1-2%.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yongji Xie <elohimes@gmail.com>
Cc: andrea.parri@amarulasolutions.com
Cc: lilin24@baidu.com
Cc: liuqi16@baidu.com
Cc: nixun@baidu.com
Cc: xieyongji@baidu.com
Cc: yuanlinsi01@baidu.com
Cc: zhangyu31@baidu.com
Link: https://lkml.kernel.org/r/20181203053130.gwkw6kg72azt2npb@linux-r8p5
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8d76a65cfdd..b05eef7d7a1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -421,7 +421,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 * state, even in the failed case, an explicit smp_mb() must be used.
 	 */
 	smp_mb__before_atomic();
-	if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
+	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
 		return;
 
 	get_task_struct(task);

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v4] sched/wake_q: Reduce reference counting for special users
  2018-12-18 20:35                                         ` Waiman Long
@ 2019-01-21 16:02                                           ` Davidlohr Bueso
  2019-01-22  8:55                                             ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 16:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

Hi - considering that the wake_q patches were picked up for tip/urgent, can
this one make it in as well?

Thanks,
Davidlohr

On Tue, 18 Dec 2018, Waiman Long wrote:

>On 12/18/2018 02:53 PM, Davidlohr Bueso wrote:
>> Some users, specifically futexes and rwsems, required fixes
>> that allowed the callers to be safe when wakeups occur before
>> they are expected by wake_up_q(). Such scenarios also play
>> games and rely on reference counting, and until now were
>> pivoting on wake_q doing it. With the wake_q_add() call being
>> moved down, this can no longer be the case. As such we end up
>> with a a double task refcounting overhead; and these callers
>> care enough about this (being rather core-ish).
>>
>> This patch introduces a wake_q_add_safe() call that serves
>> for callers that have already done refcounting and therefore the
>> task is 'safe' from wake_q point of view (int that it requires
>> reference throughout the entire queue/>wakeup cycle). In the one
>> case it has internal reference counting, in the other case it
>> consumes the reference counting.
>>
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>>
>> - Changes from v3: fixed wake_q_add_safe. While previous version
>>  had been tested with a bootup, the failed cmpxchg path obviously
>>  hadn't been exercised.  Sorry about the noise.
>>
>> include/linux/sched/wake_q.h |  4 +--
>> kernel/futex.c               |  3 +--
>> kernel/locking/rwsem-xadd.c  |  4 +--
>> kernel/sched/core.c          | 60
>> ++++++++++++++++++++++++++++++++------------
>> 4 files changed, 48 insertions(+), 23 deletions(-)
>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index 545f37138057..ad826d2a4557 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head
>> *head)
>>     head->lastp = &head->first;
>> }
>>
>> -extern void wake_q_add(struct wake_q_head *head,
>> -               struct task_struct *task);
>> +extern void wake_q_add(struct wake_q_head *head, struct task_struct
>> *task);
>> +extern void wake_q_add_safe(struct wake_q_head *head, struct
>> task_struct *task);
>> extern void wake_up_q(struct wake_q_head *head);
>>
>> #endif /* _LINUX_SCHED_WAKE_Q_H */
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index d14971f6ed3d..6218d98f649b 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -1402,8 +1402,7 @@ static void mark_wake_futex(struct wake_q_head
>> *wake_q, struct futex_q *q)
>>      * Queue the task for later wakeup for after we've released
>>      * the hb->lock. wake_q_add() grabs reference to p.
>>      */
>> -    wake_q_add(wake_q, p);
>> -    put_task_struct(p);
>> +    wake_q_add_safe(wake_q, p);
>> }
>>
>> /*
>> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>> index 50d9af615dc4..fbe96341beee 100644
>> --- a/kernel/locking/rwsem-xadd.c
>> +++ b/kernel/locking/rwsem-xadd.c
>> @@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore
>> *sem,
>>          * Ensure issuing the wakeup (either by us or someone else)
>>          * after setting the reader waiter to nil.
>>          */
>> -        wake_q_add(wake_q, tsk);
>> -        /* wake_q_add() already take the task ref */
>> -        put_task_struct(tsk);
>> +        wake_q_add_safe(wake_q, tsk);
>>     }
>>
>>     adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d740d7a3608d..be977df66a21 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
>> #endif
>> #endif
>>
>> -/**
>> - * wake_q_add() - queue a wakeup for 'later' waking.
>> - * @head: the wake_q_head to add @task to
>> - * @task: the task to queue for 'later' wakeup
>> - *
>> - * Queue a task for later wakeup, most likely by the wake_up_q() call
>> in the
>> - * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>> - * instantly.
>> - *
>> - * This function must be used as-if it were wake_up_process(); IOW
>> the task
>> - * must be ready to be woken at this location.
>> - */
>> -void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>> +static bool __wake_q_add(struct wake_q_head *head, struct task_struct
>> *task)
>> {
>>     struct wake_q_node *node = &task->wake_q;
>>
>> @@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct
>> task_struct *task)
>>      */
>>     smp_mb__before_atomic();
>>     if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
>> -        return;
>> -
>> -    get_task_struct(task);
>> +        return false;
>>
>>     /*
>>      * The head is context local, there can be no concurrency.
>>      */
>>     *head->lastp = node;
>>     head->lastp = &node->next;
>> +    return true;
>> +}
>> +
>> +/**
>> + * wake_q_add() - queue a wakeup for 'later' waking.
>> + * @head: the wake_q_head to add @task to
>> + * @task: the task to queue for 'later' wakeup
>> + *
>> + * Queue a task for later wakeup, most likely by the wake_up_q() call
>> in the
>> + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>> + * instantly.
>> + *
>> + * This function must be used as-if it were wake_up_process(); IOW
>> the task
>> + * must be ready to be woken at this location.
>> + */
>> +void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>> +{
>> +    if (__wake_q_add(head, task))
>> +        get_task_struct(task);
>> +}
>> +
>> +/**
>> + * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
>> + * @head: the wake_q_head to add @task to
>> + * @task: the task to queue for 'later' wakeup
>> + *
>> + * Queue a task for later wakeup, most likely by the wake_up_q() call
>> in the
>> + * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>> + * instantly.
>> + *
>> + * This function must be used as-if it were wake_up_process(); IOW
>> the task
>> + * must be ready to be woken at this location.
>> + *
>> + * This function is essentially a task-safe equivalent to
>> wake_q_add(). Callers
>> + * that already hold reference to @task can call the 'safe' version
>> and trust
>> + * wake_q to do the right thing depending whether or not the @task is
>> already
>> + * queued for wakeup.
>> + */
>> +void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
>> +{
>> +    if (!__wake_q_add(head, task))
>> +        put_task_struct(task);
>> }
>>
>> void wake_up_q(struct wake_q_head *head)
>
>Acked-by: Waiman Long <longman@redhat.com>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4] sched/wake_q: Reduce reference counting for special users
  2019-01-21 16:02                                           ` Davidlohr Bueso
@ 2019-01-22  8:55                                             ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2019-01-22  8:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Yongji Xie, mingo, will.deacon, linux-kernel,
	Xie Yongji, zhangyu31, liuqi16, yuanlinsi01, nixun, lilin24,
	andrea.parri

On Mon, Jan 21, 2019 at 08:02:18AM -0800, Davidlohr Bueso wrote:
> Hi - considering that the wake_q patches were picked up for tip/urgent, can
> this one make it in as well?

Ah, here it is. Yes, got it now.

I was actually wondering what happened, but couldn't find it in a hurry.

All sorted now.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [tip:locking/core] sched/wake_q: Reduce reference counting for special users
  2018-12-18 19:53                                       ` [PATCH v4] " Davidlohr Bueso
  2018-12-18 20:35                                         ` Waiman Long
@ 2019-02-04  8:57                                         ` tip-bot for Davidlohr Bueso
  2019-02-07 19:30                                           ` Davidlohr Bueso
  2019-02-12 14:14                                           ` Daniel Vacek
  1 sibling, 2 replies; 50+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2019-02-04  8:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, will.deacon, longman, xieyongji, dbueso, torvalds, tglx,
	dave, linux-kernel, elohimes, peterz, mingo, paulmck, hpa

Commit-ID:  07879c6a3740fbbf3c8891a0ab484c20a12794d8
Gitweb:     https://git.kernel.org/tip/07879c6a3740fbbf3c8891a0ab484c20a12794d8
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 18 Dec 2018 11:53:52 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 4 Feb 2019 09:03:28 +0100

sched/wake_q: Reduce reference counting for special users

Some users, specifically futexes and rwsems, required fixes
that allowed the callers to be safe when wakeups occur before
they are expected by wake_up_q(). Such scenarios also play
games and rely on reference counting, and until now were
pivoting on wake_q doing it. With the wake_q_add() call being
moved down, this can no longer be the case. As such we end up
with a a double task refcounting overhead; and these callers
care enough about this (being rather core-ish).

This patch introduces a wake_q_add_safe() call that serves
for callers that have already done refcounting and therefore the
task is 'safe' from wake_q point of view (int that it requires
reference throughout the entire queue/>wakeup cycle). In the one
case it has internal reference counting, in the other case it
consumes the reference counting.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Xie Yongji <xieyongji@baidu.com>
Cc: Yongji Xie <elohimes@gmail.com>
Cc: andrea.parri@amarulasolutions.com
Cc: lilin24@baidu.com
Cc: liuqi16@baidu.com
Cc: nixun@baidu.com
Cc: yuanlinsi01@baidu.com
Cc: zhangyu31@baidu.com
Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/wake_q.h |  4 +--
 kernel/futex.c               |  3 +--
 kernel/locking/rwsem-xadd.c  |  4 +--
 kernel/sched/core.c          | 60 ++++++++++++++++++++++++++++++++------------
 4 files changed, 48 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 545f37138057..ad826d2a4557 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
-extern void wake_q_add(struct wake_q_head *head,
-		       struct task_struct *task);
+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index 69e619baf709..2abe1a0b3062 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1463,8 +1463,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * Queue the task for later wakeup for after we've released
 	 * the hb->lock. wake_q_add() grabs reference to p.
 	 */
-	wake_q_add(wake_q, p);
-	put_task_struct(p);
+	wake_q_add_safe(wake_q, p);
 }
 
 /*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 50d9af615dc4..fbe96341beee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 * Ensure issuing the wakeup (either by us or someone else)
 		 * after setting the reader waiter to nil.
 		 */
-		wake_q_add(wake_q, tsk);
-		/* wake_q_add() already take the task ref */
-		put_task_struct(tsk);
+		wake_q_add_safe(wake_q, tsk);
 	}
 
 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c8b4dba3d2d..64ceaa5158c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #endif
 #endif
 
-/**
- * wake_q_add() - queue a wakeup for 'later' waking.
- * @head: the wake_q_head to add @task to
- * @task: the task to queue for 'later' wakeup
- *
- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
- *
- * This function must be used as-if it were wake_up_process(); IOW the task
- * must be ready to be woken at this location.
- */
-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
 {
 	struct wake_q_node *node = &task->wake_q;
 
@@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 */
 	smp_mb__before_atomic();
 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
-		return;
-
-	get_task_struct(task);
+		return false;
 
 	/*
 	 * The head is context local, there can be no concurrency.
 	 */
 	*head->lastp = node;
 	head->lastp = &node->next;
+	return true;
+}
+
+/**
+ * wake_q_add() - queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ */
+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
+{
+	if (__wake_q_add(head, task))
+		get_task_struct(task);
+}
+
+/**
+ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
+ * @head: the wake_q_head to add @task to
+ * @task: the task to queue for 'later' wakeup
+ *
+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
+ * instantly.
+ *
+ * This function must be used as-if it were wake_up_process(); IOW the task
+ * must be ready to be woken at this location.
+ *
+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
+ * that already hold reference to @task can call the 'safe' version and trust
+ * wake_q to do the right thing depending whether or not the @task is already
+ * queued for wakeup.
+ */
+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
+{
+	if (!__wake_q_add(head, task))
+		put_task_struct(task);
 }
 
 void wake_up_q(struct wake_q_head *head)

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [tip:locking/core] sched/wake_q: Reduce reference counting for special users
  2019-02-04  8:57                                         ` [tip:locking/core] " tip-bot for Davidlohr Bueso
@ 2019-02-07 19:30                                           ` Davidlohr Bueso
  2019-02-12 14:14                                           ` Daniel Vacek
  1 sibling, 0 replies; 50+ messages in thread
From: Davidlohr Bueso @ 2019-02-07 19:30 UTC (permalink / raw)
  To: torvalds, dbueso, longman, will.deacon, akpm, xieyongji, mingo,
	paulmck, hpa, tglx, linux-kernel, peterz, elohimes
  Cc: linux-tip-commits

Could this change be pushed to v5.0 (tip/urgent) just like the wake_q fixes
that are already in Linus' tree? This will help backporting efforts as
most distros will want to avoid the performance hit and include this
patch.

Thanks,
Davidlohr

On Mon, 04 Feb 2019, tip-bot for Davidlohr Bueso wrote:

>Commit-ID:  07879c6a3740fbbf3c8891a0ab484c20a12794d8
>Gitweb:     https://git.kernel.org/tip/07879c6a3740fbbf3c8891a0ab484c20a12794d8
>Author:     Davidlohr Bueso <dave@stgolabs.net>
>AuthorDate: Tue, 18 Dec 2018 11:53:52 -0800
>Committer:  Ingo Molnar <mingo@kernel.org>
>CommitDate: Mon, 4 Feb 2019 09:03:28 +0100
>
>sched/wake_q: Reduce reference counting for special users
>
>Some users, specifically futexes and rwsems, required fixes
>that allowed the callers to be safe when wakeups occur before
>they are expected by wake_up_q(). Such scenarios also play
>games and rely on reference counting, and until now were
>pivoting on wake_q doing it. With the wake_q_add() call being
>moved down, this can no longer be the case. As such we end up
>with a a double task refcounting overhead; and these callers
>care enough about this (being rather core-ish).
>
>This patch introduces a wake_q_add_safe() call that serves
>for callers that have already done refcounting and therefore the
>task is 'safe' from wake_q point of view (int that it requires
>reference throughout the entire queue/>wakeup cycle). In the one
>case it has internal reference counting, in the other case it
>consumes the reference counting.
>
>Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Linus Torvalds <torvalds@linux-foundation.org>
>Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>Cc: Peter Zijlstra <peterz@infradead.org>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Waiman Long <longman@redhat.com>
>Cc: Will Deacon <will.deacon@arm.com>
>Cc: Xie Yongji <xieyongji@baidu.com>
>Cc: Yongji Xie <elohimes@gmail.com>
>Cc: andrea.parri@amarulasolutions.com
>Cc: lilin24@baidu.com
>Cc: liuqi16@baidu.com
>Cc: nixun@baidu.com
>Cc: yuanlinsi01@baidu.com
>Cc: zhangyu31@baidu.com
>Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5
>Signed-off-by: Ingo Molnar <mingo@kernel.org>
>---
> include/linux/sched/wake_q.h |  4 +--
> kernel/futex.c               |  3 +--
> kernel/locking/rwsem-xadd.c  |  4 +--
> kernel/sched/core.c          | 60 ++++++++++++++++++++++++++++++++------------
> 4 files changed, 48 insertions(+), 23 deletions(-)
>
>diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>index 545f37138057..ad826d2a4557 100644
>--- a/include/linux/sched/wake_q.h
>+++ b/include/linux/sched/wake_q.h
>@@ -51,8 +51,8 @@ static inline void wake_q_init(struct wake_q_head *head)
> 	head->lastp = &head->first;
> }
>
>-extern void wake_q_add(struct wake_q_head *head,
>-		       struct task_struct *task);
>+extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
>+extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
> extern void wake_up_q(struct wake_q_head *head);
>
> #endif /* _LINUX_SCHED_WAKE_Q_H */
>diff --git a/kernel/futex.c b/kernel/futex.c
>index 69e619baf709..2abe1a0b3062 100644
>--- a/kernel/futex.c
>+++ b/kernel/futex.c
>@@ -1463,8 +1463,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
> 	 * Queue the task for later wakeup for after we've released
> 	 * the hb->lock. wake_q_add() grabs reference to p.
> 	 */
>-	wake_q_add(wake_q, p);
>-	put_task_struct(p);
>+	wake_q_add_safe(wake_q, p);
> }
>
> /*
>diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>index 50d9af615dc4..fbe96341beee 100644
>--- a/kernel/locking/rwsem-xadd.c
>+++ b/kernel/locking/rwsem-xadd.c
>@@ -211,9 +211,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> 		 * Ensure issuing the wakeup (either by us or someone else)
> 		 * after setting the reader waiter to nil.
> 		 */
>-		wake_q_add(wake_q, tsk);
>-		/* wake_q_add() already take the task ref */
>-		put_task_struct(tsk);
>+		wake_q_add_safe(wake_q, tsk);
> 	}
>
> 	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>index 3c8b4dba3d2d..64ceaa5158c5 100644
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -396,19 +396,7 @@ static bool set_nr_if_polling(struct task_struct *p)
> #endif
> #endif
>
>-/**
>- * wake_q_add() - queue a wakeup for 'later' waking.
>- * @head: the wake_q_head to add @task to
>- * @task: the task to queue for 'later' wakeup
>- *
>- * Queue a task for later wakeup, most likely by the wake_up_q() call in the
>- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>- * instantly.
>- *
>- * This function must be used as-if it were wake_up_process(); IOW the task
>- * must be ready to be woken at this location.
>- */
>-void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>+static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task)
> {
> 	struct wake_q_node *node = &task->wake_q;
>
>@@ -422,15 +410,55 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> 	 */
> 	smp_mb__before_atomic();
> 	if (unlikely(cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL)))
>-		return;
>-
>-	get_task_struct(task);
>+		return false;
>
> 	/*
> 	 * The head is context local, there can be no concurrency.
> 	 */
> 	*head->lastp = node;
> 	head->lastp = &node->next;
>+	return true;
>+}
>+
>+/**
>+ * wake_q_add() - queue a wakeup for 'later' waking.
>+ * @head: the wake_q_head to add @task to
>+ * @task: the task to queue for 'later' wakeup
>+ *
>+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
>+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>+ * instantly.
>+ *
>+ * This function must be used as-if it were wake_up_process(); IOW the task
>+ * must be ready to be woken at this location.
>+ */
>+void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>+{
>+	if (__wake_q_add(head, task))
>+		get_task_struct(task);
>+}
>+
>+/**
>+ * wake_q_add_safe() - safely queue a wakeup for 'later' waking.
>+ * @head: the wake_q_head to add @task to
>+ * @task: the task to queue for 'later' wakeup
>+ *
>+ * Queue a task for later wakeup, most likely by the wake_up_q() call in the
>+ * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
>+ * instantly.
>+ *
>+ * This function must be used as-if it were wake_up_process(); IOW the task
>+ * must be ready to be woken at this location.
>+ *
>+ * This function is essentially a task-safe equivalent to wake_q_add(). Callers
>+ * that already hold reference to @task can call the 'safe' version and trust
>+ * wake_q to do the right thing depending whether or not the @task is already
>+ * queued for wakeup.
>+ */
>+void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task)
>+{
>+	if (!__wake_q_add(head, task))
>+		put_task_struct(task);
> }
>
> void wake_up_q(struct wake_q_head *head)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [tip:locking/core] sched/wake_q: Reduce reference counting for special users
  2019-02-04  8:57                                         ` [tip:locking/core] " tip-bot for Davidlohr Bueso
  2019-02-07 19:30                                           ` Davidlohr Bueso
@ 2019-02-12 14:14                                           ` Daniel Vacek
  1 sibling, 0 replies; 50+ messages in thread
From: Daniel Vacek @ 2019-02-12 14:14 UTC (permalink / raw)
  To: tipbot
  Cc: akpm, dave, dbueso, elohimes, hpa, linux-kernel,
	linux-tip-commits, longman, mingo, paulmck, peterz, tglx,
	torvalds, will.deacon, xieyongji, neelx

> diff --git a/kernel/futex.c b/kernel/futex.c
> index 69e619baf709..2abe1a0b3062 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1463,8 +1463,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
>  	 * Queue the task for later wakeup for after we've released
>  	 * the hb->lock. wake_q_add() grabs reference to p.

Should this comment be fixed as well?

--nX

>  	 */
> -	wake_q_add(wake_q, p);
> -	put_task_struct(p);
> +	wake_q_add_safe(wake_q, p);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2019-02-12 14:14 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29 12:50 [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
2018-11-29 13:12 ` Peter Zijlstra
2018-11-29 13:44   ` Peter Zijlstra
2018-11-29 14:02     ` Yongji Xie
2018-11-29 18:43     ` Davidlohr Bueso
2018-11-29 18:49       ` Waiman Long
2018-11-29 15:21   ` Waiman Long
2018-11-29 15:29     ` Waiman Long
2018-11-29 16:06     ` Peter Zijlstra
2018-11-29 17:02       ` Waiman Long
2018-11-29 17:27         ` Peter Zijlstra
2018-11-29 17:58           ` Waiman Long
2018-11-29 18:13             ` Peter Zijlstra
2018-11-29 18:17               ` Davidlohr Bueso
2018-11-29 18:08           ` Peter Zijlstra
2018-11-29 18:26             ` Waiman Long
2018-11-29 18:31               ` Will Deacon
2018-11-29 18:34                 ` Waiman Long
2018-11-29 22:05                   ` Peter Zijlstra
2018-11-30  9:34                     ` 答复: " Liu,Qi(ACU-T1)
2018-11-30 14:15                       ` Peter Zijlstra
2018-11-29 21:30               ` Davidlohr Bueso
2018-11-29 21:34                 ` Davidlohr Bueso
2018-11-29 22:17                   ` Peter Zijlstra
2018-11-30  9:30                     ` Andrea Parri
2018-12-03  5:31                     ` [PATCH -tip] kernel/sched,wake_q: Branch predict wake_q_add() cmpxchg Davidlohr Bueso
2018-12-03 16:10                       ` Waiman Long
2019-01-21 11:28                       ` [tip:locking/core] sched/wake_q: Add branch prediction hint to " tip-bot for Davidlohr Bueso
2018-12-10 15:12                     ` [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil Yongji Xie
2018-12-17 11:37                       ` Peter Zijlstra
2018-12-17 13:12                         ` Yongji Xie
2019-01-07 14:35                           ` Waiman Long
2019-01-07 15:31                             ` Peter Zijlstra
2019-01-07 15:35                               ` Waiman Long
2018-12-17 20:53                         ` Davidlohr Bueso
2018-12-18 13:10                           ` Peter Zijlstra
2018-12-18 13:14                             ` Peter Zijlstra
2018-12-18 17:27                               ` Davidlohr Bueso
2018-12-18 18:54                               ` [PATCH v2] sched/wake_q: Reduce reference counting for special users Davidlohr Bueso
2018-12-18 19:17                                 ` Waiman Long
2018-12-18 19:30                                   ` Davidlohr Bueso
2018-12-18 19:39                                     ` Davidlohr Bueso
2018-12-18 19:53                                       ` [PATCH v4] " Davidlohr Bueso
2018-12-18 20:35                                         ` Waiman Long
2019-01-21 16:02                                           ` Davidlohr Bueso
2019-01-22  8:55                                             ` Peter Zijlstra
2019-02-04  8:57                                         ` [tip:locking/core] " tip-bot for Davidlohr Bueso
2019-02-07 19:30                                           ` Davidlohr Bueso
2019-02-12 14:14                                           ` Daniel Vacek
2019-01-21 11:28 ` [tip:locking/core] locking/rwsem: Fix (possible) missed wakeup tip-bot for Xie Yongji

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).