From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33475C433EF for ; Wed, 8 Jun 2022 23:31:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232587AbiFHXbE (ORCPT ); Wed, 8 Jun 2022 19:31:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232434AbiFHXbA (ORCPT ); Wed, 8 Jun 2022 19:31:00 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9D4013D24 for ; Wed, 8 Jun 2022 16:30:58 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0VFoXUN2_1654731054; Received: from 192.168.0.205(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0VFoXUN2_1654731054) by smtp.aliyun-inc.com; Thu, 09 Jun 2022 07:30:56 +0800 Message-ID: <3f87bc5c-1611-8e8d-0ab1-288b516530b2@linux.alibaba.com> Date: Thu, 9 Jun 2022 07:30:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Content-Language: en-US From: Tianchen Ding To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider Cc: linux-kernel@vger.kernel.org References: <20220608163518.324276-1-dtcccc@linux.alibaba.com> <20220608163518.324276-3-dtcccc@linux.alibaba.com> In-Reply-To: <20220608163518.324276-3-dtcccc@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/6/9 00:35, Tianchen Ding wrote: > Wakelist can help avoid cache bouncing and offload the overhead of waker > cpu. So far, using wakelist within the same llc only happens on > WF_ON_CPU, and this limitation could be removed to further improve > wakeup performance. > > The commit 518cd6234178 ("sched: Only queue remote wakeups when > crossing cache boundaries") disabled queuing tasks on wakelist when > the cpus share llc. This is because, at that time, the scheduler must > send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also > supports TIF_POLLING, so this is not a problem now when the wakee cpu is > in idle polling. > > Benefits: > Queuing the task on idle cpu can help improving performance on waker cpu > and utilization on wakee cpu, and further improve locality because > the wakee cpu can handle its own rq. This patch helps improving rt on > our real java workloads where wakeup happens frequently. > > Consider the normal condition (CPU0 and CPU1 share same llc) > Before this patch: > > CPU0 CPU1 > > select_task_rq() idle > rq_lock(CPU1->rq) > enqueue_task(CPU1->rq) > notify CPU1 (by sending IPI or CPU1 polling) > > resched() > > After this patch: > > CPU0 CPU1 > > select_task_rq() idle > add to wakelist of CPU1 > notify CPU1 (by sending IPI or CPU1 polling) > > rq_lock(CPU1->rq) > enqueue_task(CPU1->rq) > resched() > > We see CPU0 can finish its work earlier. It only needs to put task to > wakelist and return. > While CPU1 is idle, so let itself handle its own runqueue data. > > This patch brings no difference about IPI. > This patch only takes effect when the wakee cpu is: > 1) idle polling > 2) idle not polling > > For 1), there will be no IPI with or without this patch. > > For 2), there will always be an IPI before or after this patch. > Before this patch: waker cpu will enqueue task and check preempt. Since > "idle" will be sure to be preempted, waker cpu must send a resched IPI. > After this patch: waker cpu will put the task to the wakelist of wakee > cpu, and send an IPI. > > Benchmark: > We've tested schbench, unixbench, and hachbench on both x86 and arm64. > > On x86 (Intel Xeon Platinum 8269CY): > schbench -m 2 -t 8 > > Latency percentiles (usec) before after > 50.0000th: 8 6 > 75.0000th: 10 7 > 90.0000th: 11 8 > 95.0000th: 12 8 > *99.0000th: 13 10 > 99.5000th: 15 11 > 99.9000th: 18 14 > > Unixbench with full threads (104) > before after > Dhrystone 2 using register variables 3011862938 3009935994 -0.06% > Double-Precision Whetstone 617119.3 617298.5 0.03% > Execl Throughput 27667.3 27627.3 -0.14% > File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12% > File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20% > File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36% > Pipe Throughput 145535622.8 145323033.2 -0.15% > Pipe-based Context Switching 3221686.4 3583975.4 11.25% > Process Creation 101347.1 103345.4 1.97% > Shell Scripts (1 concurrent) 120193.5 123977.8 3.15% > Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55% > System Call Overhead 5300604.8 5312213.6 0.22% > > hackbench -g 1 -l 100000 > before after > Time 3.246 2.251 > > On arm64 (Ampere Altra): > schbench -m 2 -t 8 > > Latency percentiles (usec) before after > 50.0000th: 14 10 > 75.0000th: 19 14 > 90.0000th: 22 16 > 95.0000th: 23 16 > *99.0000th: 24 17 > 99.5000th: 24 17 > 99.9000th: 28 25 > > Unixbench with full threads (80) > before after > Dhrystone 2 using register variables 3536194249 3536476016 -0.01% > Double-Precision Whetstone 629383.6 629333.3 -0.01% > Execl Throughput 65920.5 66288.8 -0.49% > File Copy 1024 bufsize 2000 maxblocks 1038402.1 1050181.2 1.13% > File Copy 256 bufsize 500 maxblocks 311054.2 310317.2 -0.24% > File Copy 4096 bufsize 8000 maxblocks 2276795.6 2297703 0.92% > Pipe Throughput 130409359.9 130390848.7 -0.01% > Pipe-based Context Switching 3148440.7 3383705.1 7.47% > Process Creation 111574.3 119728.6 7.31% > Shell Scripts (1 concurrent) 122980.7 122657.4 -0.26% > Shell Scripts (8 concurrent) 17482.8 17476.8 -0.03% > System Call Overhead 4424103.4 4430062.6 0.13% > > Dhrystone 2 using register variables 3536194249 3537019613 0.02% > Double-Precision Whetstone 629383.6 629431.6 0.01% > Execl Throughput 65920.5 65846.2 -0.11% > File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03% > File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23% > File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83% > Pipe Throughput 133542875.3 131619389.8 -1.44% > Pipe-based Context Switching 3215356.1 3576945.1 11.25% > Process Creation 108520.5 120184.6 10.75% > Shell Scripts (1 concurrent) 122636.3 121888 -0.61% > Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46% > System Call Overhead 4429998.9 4435006.7 0.11% Oops... I forgot to remove the previous result. Let me resend one.