All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tianchen Ding <dtcccc@linux.alibaba.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Date: Thu, 9 Jun 2022 07:30:54 +0800	[thread overview]
Message-ID: <3f87bc5c-1611-8e8d-0ab1-288b516530b2@linux.alibaba.com> (raw)
In-Reply-To: <20220608163518.324276-3-dtcccc@linux.alibaba.com>

On 2022/6/9 00:35, Tianchen Ding wrote:
> Wakelist can help avoid cache bouncing and offload the overhead of waker
> cpu. So far, using wakelist within the same llc only happens on
> WF_ON_CPU, and this limitation could be removed to further improve
> wakeup performance.
> 
> The commit 518cd6234178 ("sched: Only queue remote wakeups when
> crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.
> 
> Benefits:
>    Queuing the task on idle cpu can help improving performance on waker cpu
>    and utilization on wakee cpu, and further improve locality because
>    the wakee cpu can handle its own rq. This patch helps improving rt on
>    our real java workloads where wakeup happens frequently.
> 
>    Consider the normal condition (CPU0 and CPU1 share same llc)
>    Before this patch:
> 
>           CPU0                                       CPU1
> 
>      select_task_rq()                                idle
>      rq_lock(CPU1->rq)
>      enqueue_task(CPU1->rq)
>      notify CPU1 (by sending IPI or CPU1 polling)
> 
>                                                      resched()
> 
>    After this patch:
> 
>           CPU0                                       CPU1
> 
>      select_task_rq()                                idle
>      add to wakelist of CPU1
>      notify CPU1 (by sending IPI or CPU1 polling)
> 
>                                                      rq_lock(CPU1->rq)
>                                                      enqueue_task(CPU1->rq)
>                                                      resched()
> 
>    We see CPU0 can finish its work earlier. It only needs to put task to
>    wakelist and return.
>    While CPU1 is idle, so let itself handle its own runqueue data.
> 
> This patch brings no difference about IPI.
>    This patch only takes effect when the wakee cpu is:
>    1) idle polling
>    2) idle not polling
> 
>    For 1), there will be no IPI with or without this patch.
> 
>    For 2), there will always be an IPI before or after this patch.
>    Before this patch: waker cpu will enqueue task and check preempt. Since
>    "idle" will be sure to be preempted, waker cpu must send a resched IPI.
>    After this patch: waker cpu will put the task to the wakelist of wakee
>    cpu, and send an IPI.
> 
> Benchmark:
> We've tested schbench, unixbench, and hachbench on both x86 and arm64.
> 
> On x86 (Intel Xeon Platinum 8269CY):
>    schbench -m 2 -t 8
> 
>      Latency percentiles (usec)             before         after
>          50.0000th:                            8             6
>          75.0000th:                           10             7
>          90.0000th:                           11             8
>          95.0000th:                           12             8
>          *99.0000th:                          13            10
>          99.5000th:                           15            11
>          99.9000th:                           18            14
> 
>    Unixbench with full threads (104)
>                                              before        after
>      Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
>      Double-Precision Whetstone              617119.3      617298.5   0.03%
>      Execl Throughput                         27667.3       27627.3  -0.14%
>      File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
>      File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
>      File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
>      Pipe Throughput                      145535622.8   145323033.2  -0.15%
>      Pipe-based Context Switching           3221686.4     3583975.4  11.25%
>      Process Creation                        101347.1      103345.4   1.97%
>      Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
>      Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
>      System Call Overhead                   5300604.8     5312213.6   0.22%
> 
>    hackbench -g 1 -l 100000
>                                              before        after
>      Time                                     3.246        2.251
> 
> On arm64 (Ampere Altra):
>    schbench -m 2 -t 8
> 
>      Latency percentiles (usec)             before         after
>          50.0000th:                           14            10
>          75.0000th:                           19            14
>          90.0000th:                           22            16
>          95.0000th:                           23            16
>          *99.0000th:                          24            17
>          99.5000th:                           24            17
>          99.9000th:                           28            25
> 
>    Unixbench with full threads (80)
>                                              before        after
>      Dhrystone 2 using register variables  3536194249    3536476016  -0.01%
>      Double-Precision Whetstone              629383.6      629333.3  -0.01%
>      Execl Throughput                         65920.5       66288.8  -0.49%
>      File Copy 1024 bufsize 2000 maxblocks  1038402.1     1050181.2   1.13%
>      File Copy 256 bufsize 500 maxblocks     311054.2      310317.2  -0.24%
>      File Copy 4096 bufsize 8000 maxblocks  2276795.6       2297703   0.92%
>      Pipe Throughput                      130409359.9   130390848.7  -0.01%
>      Pipe-based Context Switching           3148440.7     3383705.1   7.47%
>      Process Creation                        111574.3      119728.6   7.31%
>      Shell Scripts (1 concurrent)            122980.7      122657.4  -0.26%
>      Shell Scripts (8 concurrent)             17482.8       17476.8  -0.03%
>      System Call Overhead                   4424103.4     4430062.6   0.13%
> 
>      Dhrystone 2 using register variables  3536194249    3537019613   0.02%
>      Double-Precision Whetstone              629383.6      629431.6   0.01%
>      Execl Throughput                         65920.5       65846.2  -0.11%
>      File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026.8   0.03%
>      File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
>      File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
>      Pipe Throughput                      133542875.3   131619389.8  -1.44%
>      Pipe-based Context Switching           3215356.1     3576945.1  11.25%
>      Process Creation                        108520.5      120184.6  10.75%
>      Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
>      Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
>      System Call Overhead                   4429998.9     4435006.7   0.11%

Oops... I forgot to remove the previous result.
Let me resend one.


  reply	other threads:[~2022-06-08 23:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-08 16:35 [PATCH v4 0/2] sched: Queue task on wakelist in the same llc if the wakee cpu is idle Tianchen Ding
2022-06-08 16:35 ` [PATCH v4 1/2] sched: Fix the check of nr_running at queue wakelist Tianchen Ding
2022-06-08 16:35 ` [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Tianchen Ding
2022-06-08 23:30   ` Tianchen Ding [this message]
2022-06-19 14:34   ` [sched] 5e442b260e: aim7.jobs-per-min 9.3% improvement kernel test robot
2022-06-19 14:34     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f87bc5c-1611-8e8d-0ab1-288b516530b2@linux.alibaba.com \
    --to=dtcccc@linux.alibaba.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.