Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle

From: Tianchen Ding <dtcccc@linux.alibaba.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Date: Thu, 9 Jun 2022 07:30:54 +0800	[thread overview]
Message-ID: <3f87bc5c-1611-8e8d-0ab1-288b516530b2@linux.alibaba.com> (raw)
In-Reply-To: <20220608163518.324276-3-dtcccc@linux.alibaba.com>

On 2022/6/9 00:35, Tianchen Ding wrote:
> Wakelist can help avoid cache bouncing and offload the overhead of waker
> cpu. So far, using wakelist within the same llc only happens on
> WF_ON_CPU, and this limitation could be removed to further improve
> wakeup performance.
> 
> The commit 518cd6234178 ("sched: Only queue remote wakeups when
> crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.
> 
> Benefits:
>    Queuing the task on idle cpu can help improving performance on waker cpu
>    and utilization on wakee cpu, and further improve locality because
>    the wakee cpu can handle its own rq. This patch helps improving rt on
>    our real java workloads where wakeup happens frequently.
> 
>    Consider the normal condition (CPU0 and CPU1 share same llc)
>    Before this patch:
> 
>           CPU0                                       CPU1
> 
>      select_task_rq()                                idle
>      rq_lock(CPU1->rq)
>      enqueue_task(CPU1->rq)
>      notify CPU1 (by sending IPI or CPU1 polling)
> 
>                                                      resched()
> 
>    After this patch:
> 
>           CPU0                                       CPU1
> 
>      select_task_rq()                                idle
>      add to wakelist of CPU1
>      notify CPU1 (by sending IPI or CPU1 polling)
> 
>                                                      rq_lock(CPU1->rq)
>                                                      enqueue_task(CPU1->rq)
>                                                      resched()
> 
>    We see CPU0 can finish its work earlier. It only needs to put task to
>    wakelist and return.
>    While CPU1 is idle, so let itself handle its own runqueue data.
> 
> This patch brings no difference about IPI.
>    This patch only takes effect when the wakee cpu is:
>    1) idle polling
>    2) idle not polling
> 
>    For 1), there will be no IPI with or without this patch.
> 
>    For 2), there will always be an IPI before or after this patch.
>    Before this patch: waker cpu will enqueue task and check preempt. Since
>    "idle" will be sure to be preempted, waker cpu must send a resched IPI.
>    After this patch: waker cpu will put the task to the wakelist of wakee
>    cpu, and send an IPI.
> 
> Benchmark:
> We've tested schbench, unixbench, and hachbench on both x86 and arm64.
> 
> On x86 (Intel Xeon Platinum 8269CY):
>    schbench -m 2 -t 8
> 
>      Latency percentiles (usec)             before         after
>          50.0000th:                            8             6
>          75.0000th:                           10             7
>          90.0000th:                           11             8
>          95.0000th:                           12             8
>          *99.0000th:                          13            10
>          99.5000th:                           15            11
>          99.9000th:                           18            14
> 
>    Unixbench with full threads (104)
>                                              before        after
>      Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
>      Double-Precision Whetstone              617119.3      617298.5   0.03%
>      Execl Throughput                         27667.3       27627.3  -0.14%
>      File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
>      File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
>      File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
>      Pipe Throughput                      145535622.8   145323033.2  -0.15%
>      Pipe-based Context Switching           3221686.4     3583975.4  11.25%
>      Process Creation                        101347.1      103345.4   1.97%
>      Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
>      Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
>      System Call Overhead                   5300604.8     5312213.6   0.22%
> 
>    hackbench -g 1 -l 100000
>                                              before        after
>      Time                                     3.246        2.251
> 
> On arm64 (Ampere Altra):
>    schbench -m 2 -t 8
> 
>      Latency percentiles (usec)             before         after
>          50.0000th:                           14            10
>          75.0000th:                           19            14
>          90.0000th:                           22            16
>          95.0000th:                           23            16
>          *99.0000th:                          24            17
>          99.5000th:                           24            17
>          99.9000th:                           28            25
> 
>    Unixbench with full threads (80)
>                                              before        after
>      Dhrystone 2 using register variables  3536194249    3536476016  -0.01%
>      Double-Precision Whetstone              629383.6      629333.3  -0.01%
>      Execl Throughput                         65920.5       66288.8  -0.49%
>      File Copy 1024 bufsize 2000 maxblocks  1038402.1     1050181.2   1.13%
>      File Copy 256 bufsize 500 maxblocks     311054.2      310317.2  -0.24%
>      File Copy 4096 bufsize 8000 maxblocks  2276795.6       2297703   0.92%
>      Pipe Throughput                      130409359.9   130390848.7  -0.01%
>      Pipe-based Context Switching           3148440.7     3383705.1   7.47%
>      Process Creation                        111574.3      119728.6   7.31%
>      Shell Scripts (1 concurrent)            122980.7      122657.4  -0.26%
>      Shell Scripts (8 concurrent)             17482.8       17476.8  -0.03%
>      System Call Overhead                   4424103.4     4430062.6   0.13%
> 
>      Dhrystone 2 using register variables  3536194249    3537019613   0.02%
>      Double-Precision Whetstone              629383.6      629431.6   0.01%
>      Execl Throughput                         65920.5       65846.2  -0.11%
>      File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
>      File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
>      File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
>      Pipe Throughput                      133542875.3   131619389.8  -1.44%
>      Pipe-based Context Switching           3215356.1     3576945.1  11.25%
>      Process Creation                        108520.5      120184.6  10.75%
>      Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
>      Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
>      System Call Overhead                   4429998.9     4435006.7   0.11%

Oops... I forgot to remove the previous result.
Let me resend one.