linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Clear ttwu_pending after enqueue_task
@ 2022-11-01  7:36 Tianchen Ding
  2022-11-01 10:34 ` Peter Zijlstra
  2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
  0 siblings, 2 replies; 11+ messages in thread
From: Tianchen Ding @ 2022-11-01  7:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: linux-kernel

We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f674555   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 kernel/sched/core.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87c9cdf37a26..b07de1753be5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
 	if (!llist)
 		return;
 
-	/*
-	 * rq::ttwu_pending racy indication of out-standing wakeups.
-	 * Races such that false-negatives are possible, since they
-	 * are shorter lived that false-positives would be.
-	 */
-	WRITE_ONCE(rq->ttwu_pending, 0);
-
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -3760,6 +3753,7 @@ void sched_ttwu_pending(void *arg)
 	}
 
 	rq_unlock_irqrestore(rq, &rf);
+	WRITE_ONCE(rq->ttwu_pending, 0);
 }
 
 void send_call_function_single_ipi(int cpu)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01  7:36 [PATCH] sched: Clear ttwu_pending after enqueue_task Tianchen Ding
@ 2022-11-01 10:34 ` Peter Zijlstra
  2022-11-01 13:51   ` Chen Yu
  2022-11-02  6:40   ` Tianchen Ding
  2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
  1 sibling, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2022-11-01 10:34 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Tue, Nov 01, 2022 at 03:36:30PM +0800, Tianchen Ding wrote:
> We found a long tail latency in schbench whem m*t is close to nr_cpus.
> (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
> 
> This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
> too early, and idle_cpu() will return true until the wakee task enqueued.
> This will mislead the waker when selecting idle cpu, and wake multiple
> worker threads on the same wakee cpu. This situation is enlarged by
> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
> wakelist if wakee cpu is idle") because it tends to use wakelist.
> 
> Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
> (Intel(R) Xeon(R) Platinum 8369B).
> 
> Latency percentiles (usec):
>                 base      base+revert_f3dd3f674555   base+this_patch
> 50.0000th:         9                            13                 9
> 75.0000th:        12                            19                12
> 90.0000th:        15                            22                15
> 95.0000th:        18                            24                17
> *99.0000th:       27                            31                24
> 99.5000th:      3364                            33                27
> 99.9000th:     12560                            36                30

Nice; but have you also ran other benchmarks and confirmed it doesn't
negatively affect those?

If so; mentioning that is very helpful. If not; best go do so :-)

> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> ---
>  kernel/sched/core.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..b07de1753be5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
>  	if (!llist)
>  		return;
>  
> -	/*
> -	 * rq::ttwu_pending racy indication of out-standing wakeups.
> -	 * Races such that false-negatives are possible, since they
> -	 * are shorter lived that false-positives would be.
> -	 */
> -	WRITE_ONCE(rq->ttwu_pending, 0);
> -
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  

Could you try the below instead? Also note the comment; since you did
the work to figure out why -- best record that for posterity.

@@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
 			set_task_cpu(p, cpu_of(rq));
 
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+		/*
+		 * Must be after enqueueing at least once task such that
+		 * idle_cpu() does not observe a false-negative -- if it does,
+		 * it is possible for select_idle_siblings() to stack a number
+		 * of tasks on this CPU during that window.
+		 */
+		WRITE_ONCE(rq->ttwu_pending, 0);
 	}
 
 	rq_unlock_irqrestore(rq, &rf);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01 10:34 ` Peter Zijlstra
@ 2022-11-01 13:51   ` Chen Yu
  2022-11-01 14:59     ` Peter Zijlstra
  2022-11-02  6:40   ` Tianchen Ding
  1 sibling, 1 reply; 11+ messages in thread
From: Chen Yu @ 2022-11-01 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tianchen Ding, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On 2022-11-01 at 11:34:04 +0100, Peter Zijlstra wrote:
> On Tue, Nov 01, 2022 at 03:36:30PM +0800, Tianchen Ding wrote:
> > We found a long tail latency in schbench whem m*t is close to nr_cpus.
> > (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
> > 
> > This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
> > too early, and idle_cpu() will return true until the wakee task enqueued.
> > This will mislead the waker when selecting idle cpu, and wake multiple
> > worker threads on the same wakee cpu. This situation is enlarged by
> > commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
> > wakelist if wakee cpu is idle") because it tends to use wakelist.
> > 
> > Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
> > (Intel(R) Xeon(R) Platinum 8369B).
> > 
> > Latency percentiles (usec):
> >                 base      base+revert_f3dd3f674555   base+this_patch
> > 50.0000th:         9                            13                 9
> > 75.0000th:        12                            19                12
> > 90.0000th:        15                            22                15
> > 95.0000th:        18                            24                17
> > *99.0000th:       27                            31                24
> > 99.5000th:      3364                            33                27
> > 99.9000th:     12560                            36                30
> 
> Nice; but have you also ran other benchmarks and confirmed it doesn't
> negatively affect those?
> 
> If so; mentioning that is very helpful. If not; best go do so :-)
> 
> > Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> > ---
> >  kernel/sched/core.c | 8 +-------
> >  1 file changed, 1 insertion(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 87c9cdf37a26..b07de1753be5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
> >  	if (!llist)
> >  		return;
> >  
> > -	/*
> > -	 * rq::ttwu_pending racy indication of out-standing wakeups.
> > -	 * Races such that false-negatives are possible, since they
> > -	 * are shorter lived that false-positives would be.
> > -	 */
> > -	WRITE_ONCE(rq->ttwu_pending, 0);
> > -
> >  	rq_lock_irqsave(rq, &rf);
> >  	update_rq_clock(rq);
> >  
> 
> Could you try the below instead? Also note the comment; since you did
> the work to figure out why -- best record that for posterity.
> 
> @@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
>  			set_task_cpu(p, cpu_of(rq));
>  
>  		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> +		/*
> +		 * Must be after enqueueing at least once task such that
> +		 * idle_cpu() does not observe a false-negative -- if it does,
> +		 * it is possible for select_idle_siblings() to stack a number
> +		 * of tasks on this CPU during that window.
> +		 */
> +		WRITE_ONCE(rq->ttwu_pending, 0);
Just curious why do we put above code inside llist_for_each_entry_safe loop?
My understanding is that once 1 task is queued, select_idle_cpu() would not
treat this rq as idle anymore because nr_running is not 0. But would this bring
overhead to write the rq->ttwu_pending multiple times, do I miss something?

thanks,
Chenyu
>  	}
>  
>  	rq_unlock_irqrestore(rq, &rf);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01 13:51   ` Chen Yu
@ 2022-11-01 14:59     ` Peter Zijlstra
  2022-11-02  3:01       ` Chen Yu
  2022-11-02  6:40       ` Tianchen Ding
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2022-11-01 14:59 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tianchen Ding, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Tue, Nov 01, 2022 at 09:51:25PM +0800, Chen Yu wrote:

> > Could you try the below instead? Also note the comment; since you did
> > the work to figure out why -- best record that for posterity.
> > 
> > @@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
> >  			set_task_cpu(p, cpu_of(rq));
> >  
> >  		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> > +		/*
> > +		 * Must be after enqueueing at least once task such that
> > +		 * idle_cpu() does not observe a false-negative -- if it does,
> > +		 * it is possible for select_idle_siblings() to stack a number
> > +		 * of tasks on this CPU during that window.
> > +		 */
> > +		WRITE_ONCE(rq->ttwu_pending, 0);
> Just curious why do we put above code inside llist_for_each_entry_safe loop?

> My understanding is that once 1 task is queued, select_idle_cpu() would not
> treat this rq as idle anymore because nr_running is not 0. But would this bring
> overhead to write the rq->ttwu_pending multiple times, do I miss something?

So the consideration is that by clearing it late, you might also clear a
next set; consider something like:


	cpu0			cpu1			cpu2

	ttwu_queue()
	  ->ttwu_pending = 1;
	  llist_add()

				sched_ttwu_pending()
				  llist_del_all()
				  ... long ...
							ttwu_queue()
							  ->ttwu_pending = 1
							  llist_add()

				  ... time ...
				  ->ttwu_pending = 0

Which leaves you with a non-empty list but with ttwu_pending == 0.

But I suppose that's not actually better with my variant, since it keeps
writing 0s. We can make it more complicated again, but perhaps it
doesn't matter and your version is good enough.

But please update with a comment on why it needs to be after
ttwu_do_activate().


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01 14:59     ` Peter Zijlstra
@ 2022-11-02  3:01       ` Chen Yu
  2022-11-02  6:40       ` Tianchen Ding
  1 sibling, 0 replies; 11+ messages in thread
From: Chen Yu @ 2022-11-02  3:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tianchen Ding, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On 2022-11-01 at 15:59:25 +0100, Peter Zijlstra wrote:
> On Tue, Nov 01, 2022 at 09:51:25PM +0800, Chen Yu wrote:
> 
> > > Could you try the below instead? Also note the comment; since you did
> > > the work to figure out why -- best record that for posterity.
> > > 
> > > @@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
> > >  			set_task_cpu(p, cpu_of(rq));
> > >  
> > >  		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> > > +		/*
> > > +		 * Must be after enqueueing at least once task such that
> > > +		 * idle_cpu() does not observe a false-negative -- if it does,
> > > +		 * it is possible for select_idle_siblings() to stack a number
> > > +		 * of tasks on this CPU during that window.
> > > +		 */
> > > +		WRITE_ONCE(rq->ttwu_pending, 0);
> > Just curious why do we put above code inside llist_for_each_entry_safe loop?
> 
> > My understanding is that once 1 task is queued, select_idle_cpu() would not
> > treat this rq as idle anymore because nr_running is not 0. But would this bring
> > overhead to write the rq->ttwu_pending multiple times, do I miss something?
> 
> So the consideration is that by clearing it late, you might also clear a
> next set; consider something like:
> 
> 
> 	cpu0			cpu1			cpu2
> 
> 	ttwu_queue()
> 	  ->ttwu_pending = 1;
> 	  llist_add()
> 
> 				sched_ttwu_pending()
> 				  llist_del_all()
> 				  ... long ...
> 							ttwu_queue()
> 							  ->ttwu_pending = 1
> 							  llist_add()
> 
> 				  ... time ...
> 				  ->ttwu_pending = 0
> 
> Which leaves you with a non-empty list but with ttwu_pending == 0.
>
Thanks for the explaination, in theory the race windows could
be shrinked but could not be closed due to ttwu_pending is
not protected by lock in ttwu_queue() -> __ttwu_queue_wakelist()
I suppose.
> But I suppose that's not actually better with my variant, since it keeps
> writing 0s. We can make it more complicated again, but perhaps it
> doesn't matter and your version is good enough.
I see, although I'm not the author of this patch :)

thanks,
Chenyu
> 
> But please update with a comment on why it needs to be after
> ttwu_do_activate().
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01 10:34 ` Peter Zijlstra
  2022-11-01 13:51   ` Chen Yu
@ 2022-11-02  6:40   ` Tianchen Ding
  1 sibling, 0 replies; 11+ messages in thread
From: Tianchen Ding @ 2022-11-02  6:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On 2022/11/1 18:34, Peter Zijlstra wrote:
> On Tue, Nov 01, 2022 at 03:36:30PM +0800, Tianchen Ding wrote:
>> We found a long tail latency in schbench whem m*t is close to nr_cpus.
>> (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
>>
>> This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
>> too early, and idle_cpu() will return true until the wakee task enqueued.
>> This will mislead the waker when selecting idle cpu, and wake multiple
>> worker threads on the same wakee cpu. This situation is enlarged by
>> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
>> wakelist if wakee cpu is idle") because it tends to use wakelist.
>>
>> Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
>> (Intel(R) Xeon(R) Platinum 8369B).
>>
>> Latency percentiles (usec):
>>                  base      base+revert_f3dd3f674555   base+this_patch
>> 50.0000th:         9                            13                 9
>> 75.0000th:        12                            19                12
>> 90.0000th:        15                            22                15
>> 95.0000th:        18                            24                17
>> *99.0000th:       27                            31                24
>> 99.5000th:      3364                            33                27
>> 99.9000th:     12560                            36                30
> 
> Nice; but have you also ran other benchmarks and confirmed it doesn't
> negatively affect those?
> 
> If so; mentioning that is very helpful. If not; best go do so :-)
> 

Thanks for the review.

We've tested with unixbench and hackbench (they show the average scores), and 
the performance result seems no difference.
We don't mention here because what we found is a specific case in schbench 
(where m*t==nr_cpus). It only affect long tail latency, so the problem and the 
fix should also take effects on only this case, not the average scores.

>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> ---
>>   kernel/sched/core.c | 8 +-------
>>   1 file changed, 1 insertion(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 87c9cdf37a26..b07de1753be5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
>>   	if (!llist)
>>   		return;
>>   
>> -	/*
>> -	 * rq::ttwu_pending racy indication of out-standing wakeups.
>> -	 * Races such that false-negatives are possible, since they
>> -	 * are shorter lived that false-positives would be.
>> -	 */
>> -	WRITE_ONCE(rq->ttwu_pending, 0);
>> -
>>   	rq_lock_irqsave(rq, &rf);
>>   	update_rq_clock(rq);
>>   
> 
> Could you try the below instead? Also note the comment; since you did
> the work to figure out why -- best record that for posterity.
> 

It works well for me. But I have the same thought with Chen Yu, and will explain 
in detail in my next reply.

Thanks.

> @@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
>   			set_task_cpu(p, cpu_of(rq));
>   
>   		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> +		/*
> +		 * Must be after enqueueing at least once task such that
> +		 * idle_cpu() does not observe a false-negative -- if it does,
> +		 * it is possible for select_idle_siblings() to stack a number
> +		 * of tasks on this CPU during that window.
> +		 */
> +		WRITE_ONCE(rq->ttwu_pending, 0);
>   	}
>   
>   	rq_unlock_irqrestore(rq, &rf);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Clear ttwu_pending after enqueue_task
  2022-11-01 14:59     ` Peter Zijlstra
  2022-11-02  3:01       ` Chen Yu
@ 2022-11-02  6:40       ` Tianchen Ding
  1 sibling, 0 replies; 11+ messages in thread
From: Tianchen Ding @ 2022-11-02  6:40 UTC (permalink / raw)
  To: Peter Zijlstra, Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On 2022/11/1 22:59, Peter Zijlstra wrote:
> On Tue, Nov 01, 2022 at 09:51:25PM +0800, Chen Yu wrote:
> 
>>> Could you try the below instead? Also note the comment; since you did
>>> the work to figure out why -- best record that for posterity.
>>>
>>> @@ -3737,6 +3730,13 @@ void sched_ttwu_pending(void *arg)
>>>   			set_task_cpu(p, cpu_of(rq));
>>>   
>>>   		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
>>> +		/*
>>> +		 * Must be after enqueueing at least once task such that
>>> +		 * idle_cpu() does not observe a false-negative -- if it does,
>>> +		 * it is possible for select_idle_siblings() to stack a number
>>> +		 * of tasks on this CPU during that window.
>>> +		 */
>>> +		WRITE_ONCE(rq->ttwu_pending, 0);
>> Just curious why do we put above code inside llist_for_each_entry_safe loop?
> 
>> My understanding is that once 1 task is queued, select_idle_cpu() would not
>> treat this rq as idle anymore because nr_running is not 0. But would this bring
>> overhead to write the rq->ttwu_pending multiple times, do I miss something?
> 
> So the consideration is that by clearing it late, you might also clear a
> next set; consider something like:
> 
> 
> 	cpu0			cpu1			cpu2
> 
> 	ttwu_queue()
> 	  ->ttwu_pending = 1;
> 	  llist_add()
> 
> 				sched_ttwu_pending()
> 				  llist_del_all()
> 				  ... long ...
> 							ttwu_queue()
> 							  ->ttwu_pending = 1
> 							  llist_add()
> 
> 				  ... time ...
> 				  ->ttwu_pending = 0
> 
> Which leaves you with a non-empty list but with ttwu_pending == 0.
> 
> But I suppose that's not actually better with my variant, since it keeps
> writing 0s. We can make it more complicated again, but perhaps it
> doesn't matter and your version is good enough.
> 

Yeah. Since your version repeats writting 0 to ttwu_pending, it finally reaches 
the same effect with mine. Although the performance results in my tests seem to 
be no difference, it may still bring more overhead.

IMO, according to the latest linux-next code, all callers querying 
rq->ttwu_pending only take cares about whether the cpu is idle because they 
always combine with querying nr_running. Actually no one cares about whether 
wake_entry.llist is empty. So for the use of checking cpu idle state, move 
rq->ttwu_pending=0 after enqueuing task can help fully cover the whole state.

For your case, although ttwu_pending is set to 0 with some tasks really pending, 
at this time nr_running is sure to be >0, so callers who query both ttwu_pending 
and nr_running will know this cpu is not idle.
(Now the callers querying these two values are lockless, so there may be race in 
a really small window? But this case is extremely rare, I think we should not 
make it more complicated.)

> But please update with a comment on why it needs to be after
> ttwu_do_activate().

OK. Should I send v2 or you directly add the comment?

Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] sched: Clear ttwu_pending after enqueue_task
  2022-11-01  7:36 [PATCH] sched: Clear ttwu_pending after enqueue_task Tianchen Ding
  2022-11-01 10:34 ` Peter Zijlstra
@ 2022-11-04  2:36 ` Tianchen Ding
  2022-11-04  8:00   ` Chen Yu
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Tianchen Ding @ 2022-11-04  2:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: linux-kernel

We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f674555   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

We also tested on unixbench and hackbench, and saw no performance
change.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Update commit log about other benchmarks.
Add comment in code.
Move the code before rq_unlock. This can make ttwu_pending updated a bit
earlier than v1 so that it can reflect the real condition more timely,
maybe.

v1: https://lore.kernel.org/all/20221101073630.2797-1-dtcccc@linux.alibaba.com/
---
 kernel/sched/core.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87c9cdf37a26..7a04b5565389 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
 	if (!llist)
 		return;
 
-	/*
-	 * rq::ttwu_pending racy indication of out-standing wakeups.
-	 * Races such that false-negatives are possible, since they
-	 * are shorter lived that false-positives would be.
-	 */
-	WRITE_ONCE(rq->ttwu_pending, 0);
-
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg)
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 	}
 
+	/*
+	 * Must be after enqueueing at least once task such that
+	 * idle_cpu() does not observe a false-negative -- if it does,
+	 * it is possible for select_idle_siblings() to stack a number
+	 * of tasks on this CPU during that window.
+	 *
+	 * It is ok to clear ttwu_pending when another task pending.
+	 * We will receive IPI after local irq enabled and then enqueue it.
+	 * Since now nr_running > 0, idle_cpu() will always get correct result.
+	 */
+	WRITE_ONCE(rq->ttwu_pending, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] sched: Clear ttwu_pending after enqueue_task
  2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
@ 2022-11-04  8:00   ` Chen Yu
  2022-11-14 15:27   ` Mel Gorman
  2022-11-16  9:22   ` [tip: sched/core] sched: Clear ttwu_pending after enqueue_task() tip-bot2 for Tianchen Ding
  2 siblings, 0 replies; 11+ messages in thread
From: Chen Yu @ 2022-11-04  8:00 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On 2022-11-04 at 10:36:01 +0800, Tianchen Ding wrote:
> We found a long tail latency in schbench whem m*t is close to nr_cpus.
> (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
> 
> This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
> too early, and idle_cpu() will return true until the wakee task enqueued.
> This will mislead the waker when selecting idle cpu, and wake multiple
> worker threads on the same wakee cpu. This situation is enlarged by
> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
> wakelist if wakee cpu is idle") because it tends to use wakelist.
> 
> Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
> (Intel(R) Xeon(R) Platinum 8369B).
> 
> Latency percentiles (usec):
>                 base      base+revert_f3dd3f674555   base+this_patch
> 50.0000th:         9                            13                 9
> 75.0000th:        12                            19                12
> 90.0000th:        15                            22                15
> 95.0000th:        18                            24                17
> *99.0000th:       27                            31                24
> 99.5000th:      3364                            33                27
> 99.9000th:     12560                            36                30
> 
> We also tested on unixbench and hackbench, and saw no performance
> change.
> 
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> ---
> v2:
> Update commit log about other benchmarks.
> Add comment in code.
> Move the code before rq_unlock. This can make ttwu_pending updated a bit
> earlier than v1 so that it can reflect the real condition more timely,
> maybe.
> 
> v1: https://lore.kernel.org/all/20221101073630.2797-1-dtcccc@linux.alibaba.com/
> ---
>  kernel/sched/core.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..7a04b5565389 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
>  	if (!llist)
>  		return;
>  
> -	/*
> -	 * rq::ttwu_pending racy indication of out-standing wakeups.
> -	 * Races such that false-negatives are possible, since they
> -	 * are shorter lived that false-positives would be.
> -	 */
> -	WRITE_ONCE(rq->ttwu_pending, 0);
> -
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  
> @@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg)
>  		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
>  	}
>  
> +	/*
> +	 * Must be after enqueueing at least once task such that
> +	 * idle_cpu() does not observe a false-negative -- if it does,
> +	 * it is possible for select_idle_siblings() to stack a number
> +	 * of tasks on this CPU during that window.
> +	 *
> +	 * It is ok to clear ttwu_pending when another task pending.
> +	 * We will receive IPI after local irq enabled and then enqueue it.
> +	 * Since now nr_running > 0, idle_cpu() will always get correct result.
> +	 */
> +	WRITE_ONCE(rq->ttwu_pending, 0);
>  	rq_unlock_irqrestore(rq, &rf);
>  }
>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu
  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] sched: Clear ttwu_pending after enqueue_task
  2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
  2022-11-04  8:00   ` Chen Yu
@ 2022-11-14 15:27   ` Mel Gorman
  2022-11-16  9:22   ` [tip: sched/core] sched: Clear ttwu_pending after enqueue_task() tip-bot2 for Tianchen Ding
  2 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2022-11-14 15:27 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Fri, Nov 04, 2022 at 10:36:01AM +0800, Tianchen Ding wrote:
> We found a long tail latency in schbench whem m*t is close to nr_cpus.
> (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
> 
> This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
> too early, and idle_cpu() will return true until the wakee task enqueued.
> This will mislead the waker when selecting idle cpu, and wake multiple
> worker threads on the same wakee cpu. This situation is enlarged by
> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
> wakelist if wakee cpu is idle") because it tends to use wakelist.
> 
> Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
> (Intel(R) Xeon(R) Platinum 8369B).
> 
> Latency percentiles (usec):
>                 base      base+revert_f3dd3f674555   base+this_patch
> 50.0000th:         9                            13                 9
> 75.0000th:        12                            19                12
> 90.0000th:        15                            22                15
> 95.0000th:        18                            24                17
> *99.0000th:       27                            31                24
> 99.5000th:      3364                            33                27
> 99.9000th:     12560                            36                30
> 
> We also tested on unixbench and hackbench, and saw no performance
> change.
> 
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

I tested this on bare metal across a range of machines. The impact of the
patch is nowhere near as obvious as it is on a VM but even then, schbench
generally benefits (not by as much and not always at all percentiles). The
only workload that appeared to suffer was specjbb2015 but *only* on NUMA
machines, on UMA it was fine and the benchmark can be a little flaky for
getting stable results anyway. In the few cases where it showed a problem,
the NUMA balancing behaviour was also different so I think it can be ignored.

In most cases it was better than vanilla and better than a revert or at
least made marginal differences that were borderline noise. However, avoiding
stacking tasks due to false positives is also important because even though
that can help performance in some cases (strictly sync wakeups), it's not
necessarily a good idea. So while it's not a universal win, it wins more
than it loses and it appears to be more clearly a win on VMs so on that basis

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip: sched/core] sched: Clear ttwu_pending after enqueue_task()
  2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
  2022-11-04  8:00   ` Chen Yu
  2022-11-14 15:27   ` Mel Gorman
@ 2022-11-16  9:22   ` tip-bot2 for Tianchen Ding
  2 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Tianchen Ding @ 2022-11-16  9:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tianchen Ding, Peter Zijlstra (Intel), Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Gitweb:        https://git.kernel.org/tip/d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Author:        Tianchen Ding <dtcccc@linux.alibaba.com>
AuthorDate:    Fri, 04 Nov 2022 10:36:01 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 16 Nov 2022 10:13:05 +01:00

sched: Clear ttwu_pending after enqueue_task()

We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f674555   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

We also tested on unixbench and hackbench, and saw no performance
change.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com
---
 kernel/sched/core.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07ac08c..314c2c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
 	if (!llist)
 		return;
 
-	/*
-	 * rq::ttwu_pending racy indication of out-standing wakeups.
-	 * Races such that false-negatives are possible, since they
-	 * are shorter lived that false-positives would be.
-	 */
-	WRITE_ONCE(rq->ttwu_pending, 0);
-
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg)
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 	}
 
+	/*
+	 * Must be after enqueueing at least once task such that
+	 * idle_cpu() does not observe a false-negative -- if it does,
+	 * it is possible for select_idle_siblings() to stack a number
+	 * of tasks on this CPU during that window.
+	 *
+	 * It is ok to clear ttwu_pending when another task pending.
+	 * We will receive IPI after local irq enabled and then enqueue it.
+	 * Since now nr_running > 0, idle_cpu() will always get correct result.
+	 */
+	WRITE_ONCE(rq->ttwu_pending, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-11-16  9:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-01  7:36 [PATCH] sched: Clear ttwu_pending after enqueue_task Tianchen Ding
2022-11-01 10:34 ` Peter Zijlstra
2022-11-01 13:51   ` Chen Yu
2022-11-01 14:59     ` Peter Zijlstra
2022-11-02  3:01       ` Chen Yu
2022-11-02  6:40       ` Tianchen Ding
2022-11-02  6:40   ` Tianchen Ding
2022-11-04  2:36 ` [PATCH v2] " Tianchen Ding
2022-11-04  8:00   ` Chen Yu
2022-11-14 15:27   ` Mel Gorman
2022-11-16  9:22   ` [tip: sched/core] sched: Clear ttwu_pending after enqueue_task() tip-bot2 for Tianchen Ding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).