linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
@ 2022-05-27  9:05 Tianchen Ding
  2022-05-30 16:24 ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Tianchen Ding @ 2022-05-27  9:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: linux-kernel

The main idea of wakelist is to avoid cache bouncing. However,
commit 518cd6234178 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
  Queuing the task on idle cpu can help improving performance on waker cpu
  and utilization on wakee cpu, and further improve locality because
  the wakee cpu can handle its own rq. This patch helps improving rt on
  our real java workloads where wakeup happens frequently.

  Consider the normal condition (CPU0 and CPU1 share same llc)
  Before this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    resched()

  After this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    rq_lock(CPU1->rq)
                                                    enqueue_task(CPU1->rq)
                                                    resched()

  We see CPU0 can finish its work earlier. It only needs to put task to
  wakelist and return.
  While CPU1 is idle, so let itself handle its own runqueue data.

This patch brings no difference about IPI.
  This patch only takes effect when the wakee cpu is:
  1) idle polling
  2) idle not polling

  For 1), there will be no IPI with or without this patch.

  For 2), there will always be an IPI before or after this patch.
  Before this patch: waker cpu will enqueue task and check preempt. Since
  "idle" will be sure to be preempted, waker cpu must send an resched IPI.
  After this patch: waker cpu will put the task to the wakelist of wakee
  cpu, and send an IPI.

Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.

On x86 (Intel Xeon Platinum 8269CY):
  schbench -m 2 -t 8

    Latency percentiles (usec)             before         after
        50.0000th:                            8             6
        75.0000th:                           10             7
        90.0000th:                           11             8
        95.0000th:                           12             8
        *99.0000th:                          15            10
        99.5000th:                           16            11
        99.9000th:                           20            14

  Unixbench with full threads (104)
                                            before        after
    Dhrystone 2 using register variables  3004614211    3004725417   0.00%
    Double-Precision Whetstone              616764.3      617355.9   0.10%
    Execl Throughput                         26449.2       26468.6   0.07%
    File Copy 1024 bufsize 2000 maxblocks   832763.3      824099.4  -1.04%
    File Copy 256 bufsize 500 maxblocks     210718.7      211775.1   0.50%
    File Copy 4096 bufsize 8000 maxblocks  2393528.2     2398755.4   0.22%
    Pipe Throughput                      144559102.7   144605068.8   0.03%
    Pipe-based Context Switching           3192192.9     3571238.1  11.87%
    Process Creation                         95270.5       98865.4   3.77%
    Shell Scripts (1 concurrent)            113780.6      113924.7   0.13%
    Shell Scripts (8 concurrent)             15557.2       15508.9  -0.31%
    System Call Overhead                   5359984.1     5356711.4  -0.06%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     3.246        2.251

On arm64 (Ampere Altra):
  schbench -m 2 -t 8

    Latency percentiles (usec)             before         after
        50.0000th:                           14            10
        75.0000th:                           19            14
        90.0000th:                           22            16
        95.0000th:                           23            16
        *99.0000th:                          24            17
        99.5000th:                           24            17
        99.9000th:                           31            25

  Unixbench with full threads (80)
                                            before        after
    Dhrystone 2 using register variables  3536787968    3536476016  -0.01%
    Double-Precision Whetstone              629370.6      629333.3  -0.01%
    Execl Throughput                         66615.9       66288.8  -0.49%
    File Copy 1024 bufsize 2000 maxblocks  1038402.1     1050181.2   1.13%
    File Copy 256 bufsize 500 maxblocks     311054.2      310317.2  -0.24%
    File Copy 4096 bufsize 8000 maxblocks  2276795.6       2297703   0.92%
    Pipe Throughput                      130409359.9   130390848.7  -0.01%
    Pipe-based Context Switching           3148440.7     3383705.1   7.47%
    Process Creation                        111574.3      119728.6   7.31%
    Shell Scripts (1 concurrent)            122980.7      122657.4  -0.26%
    Shell Scripts (8 concurrent)             17482.8       17476.8  -0.03%
    System Call Overhead                   4424103.4     4430062.6   0.13%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     4.217        2.916

Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Modify commit log to describe key point in detail.
Add more benchmark results on more archs.

v1: https://lore.kernel.org/all/20220513062427.2375743-1-dtcccc@linux.alibaba.com/

---
 kernel/sched/core.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfa7452ca92e..8764ad152f6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3817,6 +3817,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 	if (!cpu_active(cpu))
 		return false;
 
+	if (cpu == smp_processor_id())
+		return false;
+
 	/*
 	 * If the CPU does not share cache, then queue the task on the
 	 * remote rqs wakelist to avoid accessing remote data.
@@ -3824,6 +3827,12 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 	if (!cpus_share_cache(smp_processor_id(), cpu))
 		return true;
 
+	/*
+	 * If the CPU is idle, let itself do activation to improve utilization.
+	 */
+	if (available_idle_cpu(cpu))
+		return true;
+
 	/*
 	 * If the task is descheduling and the only running task on the
 	 * CPU then use the wakelist to offload the task activation to
@@ -3839,9 +3848,6 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
-		if (WARN_ON_ONCE(cpu == smp_processor_id()))
-			return false;
-
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
 		__ttwu_queue_wakelist(p, cpu, wake_flags);
 		return true;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-27  9:05 [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle Tianchen Ding
@ 2022-05-30 16:24 ` Valentin Schneider
  2022-05-31  7:20   ` Tianchen Ding
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2022-05-30 16:24 UTC (permalink / raw)
  To: Tianchen Ding, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira
  Cc: linux-kernel

On 27/05/22 17:05, Tianchen Ding wrote:
> The main idea of wakelist is to avoid cache bouncing. However,
> commit 518cd6234178 ("sched: Only queue remote wakeups when
> crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.

[...]

> Our patch has improvement on schbench, hackbench
> and Pipe-based Context Switching of unixbench
> when there exists idle cpus,
> and no obvious regression on other tests of unixbench.
> This can help improve rt in scenes where wakeup happens frequently.
>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

This feels a bit like a generalization of

  2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

Given rq->curr is updated before prev->on_cpu is cleared, the waker
executing ttwu_queue_cond() can observe:

  p->on_rq=0
  p->on_cpu=1
  rq->curr=swapper/x (aka idle task)

So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
matches that when invoked via:

        if (smp_load_acquire(&p->on_cpu) &&
            ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
                goto unlock;

but it also affects

        ttwu_queue(p, cpu, wake_flags);

at the tail end of try_to_wake_up().

With all that in mind, I'm curious whether your patch is functionaly close
to the below. 

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c4e5922fe1..ffd43264722a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 	 * the soon-to-be-idle CPU as the current CPU is likely busy.
 	 * nr_running is checked to avoid unnecessary task stacking.
 	 */
-	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
+	if (cpu_rq(cpu)->nr_running <= 1)
 		return true;
 
 	return false;


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-30 16:24 ` Valentin Schneider
@ 2022-05-31  7:20   ` Tianchen Ding
  2022-05-31 11:50     ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Tianchen Ding @ 2022-05-31  7:20 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 2022/5/31 00:24, Valentin Schneider wrote:
> On 27/05/22 17:05, Tianchen Ding wrote:
>> The main idea of wakelist is to avoid cache bouncing. However,
>> commit 518cd6234178 ("sched: Only queue remote wakeups when
>> crossing cache boundaries") disabled queuing tasks on wakelist when
>> the cpus share llc. This is because, at that time, the scheduler must
>> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
>> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
>> in idle polling.
> 
> [...]
> 
>> Our patch has improvement on schbench, hackbench
>> and Pipe-based Context Switching of unixbench
>> when there exists idle cpus,
>> and no obvious regression on other tests of unixbench.
>> This can help improve rt in scenes where wakeup happens frequently.
>>
>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> 
> This feels a bit like a generalization of
> 
>    2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
> 
> Given rq->curr is updated before prev->on_cpu is cleared, the waker
> executing ttwu_queue_cond() can observe:
> 
>    p->on_rq=0
>    p->on_cpu=1
>    rq->curr=swapper/x (aka idle task)
> 
> So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
> matches that when invoked via:
> 
>          if (smp_load_acquire(&p->on_cpu) &&
>              ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
>                  goto unlock;
> 
> but it also affects
> 
>          ttwu_queue(p, cpu, wake_flags);
> 
> at the tail end of try_to_wake_up().

Yes. This part is what we mainly want to affect. The above WF_ON_CPU is 
not our point.

> 
> With all that in mind, I'm curious whether your patch is functionaly close
> to the below.
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66c4e5922fe1..ffd43264722a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>   	 * nr_running is checked to avoid unnecessary task stacking.
>   	 */
> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> +	if (cpu_rq(cpu)->nr_running <= 1)
>   		return true;
>   
>   	return false;

It's a little different. This may bring extra IPIs when nr_running == 1 
and the current task on wakee cpu is not the target wakeup task (i.e., 
rq->curr == another_task && rq->curr != p). Then this another_task may 
be disturbed by IPI which is not expected. So IMO the promise by 
WF_ON_CPU is necessary.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-31  7:20   ` Tianchen Ding
@ 2022-05-31 11:50     ` Valentin Schneider
  2022-05-31 13:55       ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2022-05-31 11:50 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 31/05/22 15:20, Tianchen Ding wrote:
> On 2022/5/31 00:24, Valentin Schneider wrote:
>> 
>> This feels a bit like a generalization of
>> 
>>    2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
>> 
>> Given rq->curr is updated before prev->on_cpu is cleared, the waker
>> executing ttwu_queue_cond() can observe:
>> 
>>    p->on_rq=0
>>    p->on_cpu=1
>>    rq->curr=swapper/x (aka idle task)
>> 
>> So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
>> matches that when invoked via:
>> 
>>          if (smp_load_acquire(&p->on_cpu) &&
>>              ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
>>                  goto unlock;
>> 
>> but it also affects
>> 
>>          ttwu_queue(p, cpu, wake_flags);
>> 
>> at the tail end of try_to_wake_up().
>
> Yes. This part is what we mainly want to affect. The above WF_ON_CPU is 
> not our point.
>

Right.

>> 
>> With all that in mind, I'm curious whether your patch is functionaly close
>> to the below.
>> 
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 66c4e5922fe1..ffd43264722a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>>   	 * nr_running is checked to avoid unnecessary task stacking.
>>   	 */
>> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> +	if (cpu_rq(cpu)->nr_running <= 1)
>>   		return true;
>>   
>>   	return false;
>
> It's a little different. This may bring extra IPIs when nr_running == 1 
> and the current task on wakee cpu is not the target wakeup task (i.e., 
> rq->curr == another_task && rq->curr != p). Then this another_task may 
> be disturbed by IPI which is not expected. So IMO the promise by 
> WF_ON_CPU is necessary.

You're right, actually taking a second look at that WF_ON_CPU path,
shouldn't the existing condition be:

	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)

? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
we must have !p->on_rq, so the deactivate has happened, thus the task
being alone on the rq implies nr_running==0.

@Mel, do you remember why you went for <=1 here? I couldn't find any clues
on the original posting.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-31 11:50     ` Valentin Schneider
@ 2022-05-31 13:55       ` Mel Gorman
  2022-05-31 15:38         ` Tianchen Ding
  2022-05-31 15:56         ` Valentin Schneider
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2022-05-31 13:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Tianchen Ding, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
> >> With all that in mind, I'm curious whether your patch is functionaly close
> >> to the below.
> >> 
> >> ---
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 66c4e5922fe1..ffd43264722a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
> >>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
> >>   	 * nr_running is checked to avoid unnecessary task stacking.
> >>   	 */
> >> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> >> +	if (cpu_rq(cpu)->nr_running <= 1)
> >>   		return true;
> >>   
> >>   	return false;
> >
> > It's a little different. This may bring extra IPIs when nr_running == 1 
> > and the current task on wakee cpu is not the target wakeup task (i.e., 
> > rq->curr == another_task && rq->curr != p). Then this another_task may 
> > be disturbed by IPI which is not expected. So IMO the promise by 
> > WF_ON_CPU is necessary.
> 
> You're right, actually taking a second look at that WF_ON_CPU path,
> shouldn't the existing condition be:
> 
> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
> 
> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
> we must have !p->on_rq, so the deactivate has happened, thus the task
> being alone on the rq implies nr_running==0.
> 
> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
> on the original posting.
> 

I don't recall exactly why I went with <= 1 there but I may not have
considered the memory ordering of on_rq and nr_running and the comment
above it is literally what I was thinking at the time. I think you're
right and that check can be !cpu_rq(cpu)->nr_running.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-31 13:55       ` Mel Gorman
@ 2022-05-31 15:38         ` Tianchen Ding
  2022-05-31 15:56         ` Valentin Schneider
  1 sibling, 0 replies; 10+ messages in thread
From: Tianchen Ding @ 2022-05-31 15:38 UTC (permalink / raw)
  To: Mel Gorman, Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On 2022/5/31 21:55, Mel Gorman wrote:
> On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
>>>> With all that in mind, I'm curious whether your patch is functionaly close
>>>> to the below.
>>>>
>>>> ---
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 66c4e5922fe1..ffd43264722a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>>>    	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>>>>    	 * nr_running is checked to avoid unnecessary task stacking.
>>>>    	 */
>>>> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>>>> +	if (cpu_rq(cpu)->nr_running <= 1)
>>>>    		return true;
>>>>    
>>>>    	return false;
>>>
>>> It's a little different. This may bring extra IPIs when nr_running == 1
>>> and the current task on wakee cpu is not the target wakeup task (i.e.,
>>> rq->curr == another_task && rq->curr != p). Then this another_task may
>>> be disturbed by IPI which is not expected. So IMO the promise by
>>> WF_ON_CPU is necessary.
>>
>> You're right, actually taking a second look at that WF_ON_CPU path,
>> shouldn't the existing condition be:
>>
>> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>
>> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
>> we must have !p->on_rq, so the deactivate has happened, thus the task
>> being alone on the rq implies nr_running==0.
>>
>> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
>> on the original posting.
>>
> 
> I don't recall exactly why I went with <= 1 there but I may not have
> considered the memory ordering of on_rq and nr_running and the comment
> above it is literally what I was thinking at the time. I think you're
> right and that check can be !cpu_rq(cpu)->nr_running.
> 

If the check becomes !cpu_rq(cpu)->nr_running
My patch would change, too.
Shall we remove WF_ON_CPU completely?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-31 13:55       ` Mel Gorman
  2022-05-31 15:38         ` Tianchen Ding
@ 2022-05-31 15:56         ` Valentin Schneider
  2022-06-01  5:54           ` Tianchen Ding
  1 sibling, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2022-05-31 15:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tianchen Ding, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On 31/05/22 14:55, Mel Gorman wrote:
> On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
>> >> With all that in mind, I'm curious whether your patch is functionaly close
>> >> to the below.
>> >> 
>> >> ---
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index 66c4e5922fe1..ffd43264722a 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>> >>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>> >>   	 * nr_running is checked to avoid unnecessary task stacking.
>> >>   	 */
>> >> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> >> +	if (cpu_rq(cpu)->nr_running <= 1)
>> >>   		return true;
>> >>   
>> >>   	return false;
>> >
>> > It's a little different. This may bring extra IPIs when nr_running == 1 
>> > and the current task on wakee cpu is not the target wakeup task (i.e., 
>> > rq->curr == another_task && rq->curr != p). Then this another_task may 
>> > be disturbed by IPI which is not expected. So IMO the promise by 
>> > WF_ON_CPU is necessary.
>> 
>> You're right, actually taking a second look at that WF_ON_CPU path,
>> shouldn't the existing condition be:
>> 
>> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>> 
>> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
>> we must have !p->on_rq, so the deactivate has happened, thus the task
>> being alone on the rq implies nr_running==0.
>> 
>> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
>> on the original posting.
>> 
>
> I don't recall exactly why I went with <= 1 there but I may not have
> considered the memory ordering of on_rq and nr_running and the comment
> above it is literally what I was thinking at the time. I think you're
> right and that check can be !cpu_rq(cpu)->nr_running.
>

Thanks!

So I'm thinking we could first make that into

	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)

Then building on this, we can generalize using the wakelist to any remote
idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
depending on how deeply idle the CPU is...)

We need the cpu != this_cpu check, as that's currently served by the
WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
tasks).

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c4e5922fe1..60038743f2f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 	if (!cpus_share_cache(smp_processor_id(), cpu))
 		return true;
 
+	if (cpu == smp_processor_id())
+		return false;
+
 	/*
 	 * If the task is descheduling and the only running task on the
 	 * CPU then use the wakelist to offload the task activation to
 	 * the soon-to-be-idle CPU as the current CPU is likely busy.
 	 * nr_running is checked to avoid unnecessary task stacking.
+	 *
+	 * Note that we can only get here with (wakee) p->on_rq=0,
+	 * p->on_cpu can be whatever, we've done the dequeue, so
+	 * the wakee has been accounted out of ->nr_running
 	 */
-	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
+	if (!cpu_rq(cpu)->nr_running)
 		return true;
 
 	return false;


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-05-31 15:56         ` Valentin Schneider
@ 2022-06-01  5:54           ` Tianchen Ding
  2022-06-01 10:58             ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Tianchen Ding @ 2022-06-01  5:54 UTC (permalink / raw)
  To: Valentin Schneider, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On 2022/5/31 23:56, Valentin Schneider wrote:

> Thanks!
> 
> So I'm thinking we could first make that into
> 
> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
> 
> Then building on this, we can generalize using the wakelist to any remote
> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
> depending on how deeply idle the CPU is...)
> 
> We need the cpu != this_cpu check, as that's currently served by the
> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
> tasks).
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66c4e5922fe1..60038743f2f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>   	if (!cpus_share_cache(smp_processor_id(), cpu))
>   		return true;
>   
> +	if (cpu == smp_processor_id())
> +		return false;
> +
>   	/*
>   	 * If the task is descheduling and the only running task on the
>   	 * CPU then use the wakelist to offload the task activation to
>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>   	 * nr_running is checked to avoid unnecessary task stacking.
> +	 *
> +	 * Note that we can only get here with (wakee) p->on_rq=0,
> +	 * p->on_cpu can be whatever, we've done the dequeue, so
> +	 * the wakee has been accounted out of ->nr_running
>   	 */
> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> +	if (!cpu_rq(cpu)->nr_running)
>   		return true;
>   
>   	return false;

Hi Valentin. I've done a simple unixbench test (Pipe-based Context 
Switching) on my x86 machine with full threads (104).

              old            patch1           patch1+patch2
score       7825.4     7500(more)-8000          9061.6

patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
patch2: ignore WF_ON_CPU check

The score of patch1 is not stable. I've tested for many times and the 
score is floating between about 7500-8000 (more at 7500).

patch1 means more strict limit on using wakelist. But it may cause 
performance regression.

It seems that, using wakelist properly can help improve wakeup 
performance, but using it too much may cause more IPIs. It's a trade-off 
about how strict the ttwu_queue_cond() is.

Anyhow, I think patch2 should be a pure improvement. What's your idea?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-06-01  5:54           ` Tianchen Ding
@ 2022-06-01 10:58             ` Valentin Schneider
  2022-06-01 12:02               ` Tianchen Ding
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2022-06-01 10:58 UTC (permalink / raw)
  To: Tianchen Ding, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On 01/06/22 13:54, Tianchen Ding wrote:
> On 2022/5/31 23:56, Valentin Schneider wrote:
>
>> Thanks!
>> 
>> So I'm thinking we could first make that into
>> 
>> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>> 
>> Then building on this, we can generalize using the wakelist to any remote
>> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
>> depending on how deeply idle the CPU is...)
>> 
>> We need the cpu != this_cpu check, as that's currently served by the
>> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
>> tasks).
>> 
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 66c4e5922fe1..60038743f2f1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>   	if (!cpus_share_cache(smp_processor_id(), cpu))
>>   		return true;
>>   
>> +	if (cpu == smp_processor_id())
>> +		return false;
>> +
>>   	/*
>>   	 * If the task is descheduling and the only running task on the
>>   	 * CPU then use the wakelist to offload the task activation to
>>   	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>>   	 * nr_running is checked to avoid unnecessary task stacking.
>> +	 *
>> +	 * Note that we can only get here with (wakee) p->on_rq=0,
>> +	 * p->on_cpu can be whatever, we've done the dequeue, so
>> +	 * the wakee has been accounted out of ->nr_running
>>   	 */
>> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> +	if (!cpu_rq(cpu)->nr_running)
>>   		return true;
>>   
>>   	return false;
>
> Hi Valentin. I've done a simple unixbench test (Pipe-based Context 
> Switching) on my x86 machine with full threads (104).
>
>               old            patch1           patch1+patch2
> score       7825.4     7500(more)-8000          9061.6
>
> patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
> patch2: ignore WF_ON_CPU check
>
> The score of patch1 is not stable. I've tested for many times and the 
> score is floating between about 7500-8000 (more at 7500).
>
> patch1 means more strict limit on using wakelist. But it may cause 
> performance regression.
>
> It seems that, using wakelist properly can help improve wakeup 
> performance, but using it too much may cause more IPIs. It's a trade-off 
> about how strict the ttwu_queue_cond() is.
>
> Anyhow, I think patch2 should be a pure improvement. What's your idea?

Thanks for separately testing these two.

I take it the results for patch1 are noticeably more swingy than the
baseline? (FWIW boxplots are usually a nice way to summarize those sort of
results).

WF_ON_CPU && nr_running == 1 means the wakee is scheduling out *and* there
is another task queued, I'm guessing that's relatively common in your
unixbench scenario...

Either way, I think we want to keep the two changes separate for the sake
of testing and bisecting.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle
  2022-06-01 10:58             ` Valentin Schneider
@ 2022-06-01 12:02               ` Tianchen Ding
  0 siblings, 0 replies; 10+ messages in thread
From: Tianchen Ding @ 2022-06-01 12:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Mel Gorman, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, linux-kernel

On 2022/6/1 18:58, Valentin Schneider wrote:
> On 01/06/22 13:54, Tianchen Ding wrote:
>> On 2022/5/31 23:56, Valentin Schneider wrote:
>>
>>> Thanks!
>>>
>>> So I'm thinking we could first make that into
>>>
>>> 	if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>>
>>> Then building on this, we can generalize using the wakelist to any remote
>>> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
>>> depending on how deeply idle the CPU is...)
>>>
>>> We need the cpu != this_cpu check, as that's currently served by the
>>> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
>>> tasks).
>>>
>>> ---
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 66c4e5922fe1..60038743f2f1 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>>    	if (!cpus_share_cache(smp_processor_id(), cpu))
>>>    		return true;
>>>    
>>> +	if (cpu == smp_processor_id())
>>> +		return false;
>>> +
>>>    	/*
>>>    	 * If the task is descheduling and the only running task on the
>>>    	 * CPU then use the wakelist to offload the task activation to
>>>    	 * the soon-to-be-idle CPU as the current CPU is likely busy.
>>>    	 * nr_running is checked to avoid unnecessary task stacking.
>>> +	 *
>>> +	 * Note that we can only get here with (wakee) p->on_rq=0,
>>> +	 * p->on_cpu can be whatever, we've done the dequeue, so
>>> +	 * the wakee has been accounted out of ->nr_running
>>>    	 */
>>> -	if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>>> +	if (!cpu_rq(cpu)->nr_running)
>>>    		return true;
>>>    
>>>    	return false;
>>
>> Hi Valentin. I've done a simple unixbench test (Pipe-based Context
>> Switching) on my x86 machine with full threads (104).
>>
>>                old            patch1           patch1+patch2
>> score       7825.4     7500(more)-8000          9061.6
>>
>> patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
>> patch2: ignore WF_ON_CPU check
>>
>> The score of patch1 is not stable. I've tested for many times and the
>> score is floating between about 7500-8000 (more at 7500).
>>
>> patch1 means more strict limit on using wakelist. But it may cause
>> performance regression.
>>
>> It seems that, using wakelist properly can help improve wakeup
>> performance, but using it too much may cause more IPIs. It's a trade-off
>> about how strict the ttwu_queue_cond() is.
>>
>> Anyhow, I think patch2 should be a pure improvement. What's your idea?
> 
> Thanks for separately testing these two.
> 
> I take it the results for patch1 are noticeably more swingy than the
> baseline? (FWIW boxplots are usually a nice way to summarize those sort of
> results).
> 

Hmm... I'm not familiar with this...
T want to say that I'm not sure about the performance impact about 
patch1. While from the view of logic, patch1 should be correct.

> WF_ON_CPU && nr_running == 1 means the wakee is scheduling out *and* there
> is another task queued, I'm guessing that's relatively common in your
> unixbench scenario...
> 
> Either way, I think we want to keep the two changes separate for the sake
> of testing and bisecting.

Yes. I'll split the patch to 2 parts. One for logic fix and another for 
performance improvement.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-01 12:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27  9:05 [PATCH v2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle Tianchen Ding
2022-05-30 16:24 ` Valentin Schneider
2022-05-31  7:20   ` Tianchen Ding
2022-05-31 11:50     ` Valentin Schneider
2022-05-31 13:55       ` Mel Gorman
2022-05-31 15:38         ` Tianchen Ding
2022-05-31 15:56         ` Valentin Schneider
2022-06-01  5:54           ` Tianchen Ding
2022-06-01 10:58             ` Valentin Schneider
2022-06-01 12:02               ` Tianchen Ding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).