All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Yu <yu.c.chen@intel.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: <yangyicong@hisilicon.com>, <peterz@infradead.org>,
	<mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <tim.c.chen@linux.intel.com>,
	<gautham.shenoy@amd.com>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
	<bsegall@google.com>, <bristot@redhat.com>,
	<prime.zeng@huawei.com>, <jonathan.cameron@huawei.com>,
	<ego@linux.vnet.ibm.com>, <srikar@linux.vnet.ibm.com>,
	<linuxarm@huawei.com>, <21cnbao@gmail.com>,
	<guodong.xu@linaro.org>, <hesham.almatary@huawei.com>,
	<john.garry@huawei.com>, <shenyang39@huawei.com>,
	<kprateek.nayak@amd.com>, <wuyun.abel@bytedance.com>,
	<tim.c.chen@intel.com>
Subject: Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
Date: Tue, 23 May 2023 21:44:14 +0800	[thread overview]
Message-ID: <ZGzDLuVaHR1PAYDt@chenyu5-mobl1> (raw)
In-Reply-To: <46e8d4fc-993b-e1d6-5e4c-cb33513d7888@huawei.com>

On 2023-05-22 at 20:42:19 +0800, Yicong Yang wrote:
> Hi Chen,
> 
> On 2023/5/22 14:29, Chen Yu wrote:
> > Hi Yicong,
> > On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
> >> From: Barry Song <song.bao.hua@hisilicon.com>
> >>
[snip...]
> 
> Thanks for the further information. The result of netperf/tbench looks good as we
> image, the cluster wakeup expects to gain more benefit when the system is under
> loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla?
>
There are 4 CPUs per cluster on Jacobsville.
[snip...]
> >> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	/*
> >>  	 * If the previous CPU is cache affine and idle, don't be stupid:
> >>  	 */
> >> -	if (prev != target && cpus_share_cache(prev, target) &&
> >> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
> > This change impacts hackbench in socket mode a bit. It seems that for hackbench even
> > putting the wakee on its previous CPU in the same LLC is better than putting it on
> > current cluster. But it seems to be hackbench specific.
> > 
> 
> ...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU
> numbers) for tbench/netperf? 
> The idea here is to always try to wakeup in the same cluster of the
> target to benefit from the cluster cache but the early test for the prev and recent used cpu may break
> that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take
> less chance to benefit from the cluster and gain less performance improvement.
> 
Right. Without above change I saw lower improvement at lightly load case for netperf/tbench.
> In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL
> will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same
> cluster with target, we're about to scanning the cluster and when found no idle CPU and has
> run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason
> why observed more wakeups on target rather than previous CPU.
> 
Looks reasonable. When the budget of scanning number is low, we can not find an idle target
on local cluster and terminates scanning for an idle prev on remote cluster, although that
prev could be a better choice than target cpu.
> In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help
> the situation here, like the snippet below (kinds of messy though).
> 
This change makes sense to me. I only modified it a little bit to only give prev a second
chance. With your patch applied, the improvement of netperf/tbench remains while the
hackbench big regress was gone.

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  2.35)	 -0.65 (  1.81)
process-pipe    	2-groups	 1.00 (  0.42)	 -2.16 (  1.12)
process-pipe    	4-groups	 1.00 (  1.84)	 +0.72 (  1.34)
process-pipe    	8-groups	 1.00 (  2.81)	 +1.12 (  3.88)
process-sockets 	1-groups	 1.00 (  1.88)	 -0.99 (  4.84)
process-sockets 	2-groups	 1.00 (  5.49)	 -4.50 (  4.09)
process-sockets 	4-groups	 1.00 (  3.54)	 +2.28 (  3.13)
process-sockets 	8-groups	 1.00 (  0.79)	 -0.13 (  1.28)
threads-pipe    	1-groups	 1.00 (  1.73)	 -2.39 (  0.40)
threads-pipe    	2-groups	 1.00 (  0.73)	 +2.88 (  1.94)
threads-pipe    	4-groups	 1.00 (  0.64)	 +1.12 (  1.82)
threads-pipe    	8-groups	 1.00 (  1.55)	 -1.59 (  1.20)
threads-sockets 	1-groups	 1.00 (  3.76)	 +3.21 (  3.56)
threads-sockets 	2-groups	 1.00 (  1.20)	 -5.56 (  2.64)
threads-sockets 	4-groups	 1.00 (  2.65)	 +1.48 (  4.91)
threads-sockets 	8-groups	 1.00 (  0.08)	 +0.18 (  0.15)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.91)	 +2.87 (  0.83)
TCP_RR          	12-threads	 1.00 (  0.22)	 +3.48 (  0.31)
TCP_RR          	18-threads	 1.00 (  0.41)	 +7.81 (  0.48)
TCP_RR          	24-threads	 1.00 (  1.02)	 -0.32 (  1.25)
TCP_RR          	30-threads	 1.00 (  4.67)	 -0.04 (  5.14)
TCP_RR          	36-threads	 1.00 (  4.53)	 -0.13 (  4.37)
TCP_RR          	42-threads	 1.00 (  3.92)	 -0.15 (  3.07)
TCP_RR          	48-threads	 1.00 (  2.07)	 -0.17 (  1.52)
UDP_RR          	6-threads	 1.00 (  0.98)	 +4.50 (  2.38)
UDP_RR          	12-threads	 1.00 (  0.26)	 +3.64 (  0.25)
UDP_RR          	18-threads	 1.00 (  0.27)	 +9.93 (  0.55)
UDP_RR          	24-threads	 1.00 (  1.22)	 +0.13 (  1.33)
UDP_RR          	30-threads	 1.00 (  3.86)	 -0.03 (  5.05)
UDP_RR          	36-threads	 1.00 (  2.81)	 +0.10 (  3.37)
UDP_RR          	42-threads	 1.00 (  3.51)	 -0.26 (  2.94)
UDP_RR          	48-threads	 1.00 ( 12.54)	 +0.74 (  9.44)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.04)	 +2.94 (  0.26)
loopback        	12-threads	 1.00 (  0.30)	 +4.58 (  0.12)
loopback        	18-threads	 1.00 (  0.37)	+12.38 (  0.10)
loopback        	24-threads	 1.00 (  0.56)	 -0.27 (  0.50)
loopback        	30-threads	 1.00 (  0.17)	 -0.18 (  0.06)
loopback        	36-threads	 1.00 (  0.25)	 -0.73 (  0.44)
loopback        	42-threads	 1.00 (  0.10)	 -0.22 (  0.18)
loopback        	48-threads	 1.00 (  0.29)	 -0.48 (  0.19)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 (  6.80)	 +2.78 (  8.08)
normal          	8-mthreads	 1.00 (  3.65)	 -0.23 (  4.30)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989116b0796..07495b44c68f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7127,7 +7127,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	bool has_idle_core = false;
 	struct sched_domain *sd;
 	unsigned long task_util, util_min, util_max;
-	int i, recent_used_cpu;
+	int i, recent_used_cpu, prev_aff = -1;
 
 	/*
 	 * On asymmetric system, update task utilization because we will check
@@ -7152,10 +7152,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_lowest_cache(prev, target) &&
+	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev))
-		return prev;
+	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+		if (cpus_share_lowest_cache(prev, target))
+			return prev;
+		prev_aff = prev;
+	}
 
 	/*
 	 * Allow a per-cpu kthread to stack with the wakee if the
@@ -7223,6 +7226,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * Give prev another chance, in case prev has not been
+	 * scanned in select_idle_cpu() due to nr constrain.
+	 */
+	if (prev_aff != -1)
+		return prev_aff;
+
 	return target;
 }
 

thanks,
Chenyu

WARNING: multiple messages have this Message-ID (diff)
From: Chen Yu <yu.c.chen@intel.com>
To: Yicong Yang <yangyicong@huawei.com>
Cc: <yangyicong@hisilicon.com>, <peterz@infradead.org>,
	<mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <tim.c.chen@linux.intel.com>,
	<gautham.shenoy@amd.com>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
	<bsegall@google.com>, <bristot@redhat.com>,
	<prime.zeng@huawei.com>, <jonathan.cameron@huawei.com>,
	<ego@linux.vnet.ibm.com>, <srikar@linux.vnet.ibm.com>,
	<linuxarm@huawei.com>, <21cnbao@gmail.com>,
	<guodong.xu@linaro.org>, <hesham.almatary@huawei.com>,
	<john.garry@huawei.com>, <shenyang39@huawei.com>,
	<kprateek.nayak@amd.com>, <wuyun.abel@bytedance.com>,
	<tim.c.chen@intel.com>
Subject: Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
Date: Tue, 23 May 2023 21:44:14 +0800	[thread overview]
Message-ID: <ZGzDLuVaHR1PAYDt@chenyu5-mobl1> (raw)
In-Reply-To: <46e8d4fc-993b-e1d6-5e4c-cb33513d7888@huawei.com>

On 2023-05-22 at 20:42:19 +0800, Yicong Yang wrote:
> Hi Chen,
> 
> On 2023/5/22 14:29, Chen Yu wrote:
> > Hi Yicong,
> > On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
> >> From: Barry Song <song.bao.hua@hisilicon.com>
> >>
[snip...]
> 
> Thanks for the further information. The result of netperf/tbench looks good as we
> image, the cluster wakeup expects to gain more benefit when the system is under
> loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla?
>
There are 4 CPUs per cluster on Jacobsville.
[snip...]
> >> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	/*
> >>  	 * If the previous CPU is cache affine and idle, don't be stupid:
> >>  	 */
> >> -	if (prev != target && cpus_share_cache(prev, target) &&
> >> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
> > This change impacts hackbench in socket mode a bit. It seems that for hackbench even
> > putting the wakee on its previous CPU in the same LLC is better than putting it on
> > current cluster. But it seems to be hackbench specific.
> > 
> 
> ...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU
> numbers) for tbench/netperf? 
> The idea here is to always try to wakeup in the same cluster of the
> target to benefit from the cluster cache but the early test for the prev and recent used cpu may break
> that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take
> less chance to benefit from the cluster and gain less performance improvement.
> 
Right. Without above change I saw lower improvement at lightly load case for netperf/tbench.
> In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL
> will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same
> cluster with target, we're about to scanning the cluster and when found no idle CPU and has
> run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason
> why observed more wakeups on target rather than previous CPU.
> 
Looks reasonable. When the budget of scanning number is low, we can not find an idle target
on local cluster and terminates scanning for an idle prev on remote cluster, although that
prev could be a better choice than target cpu.
> In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help
> the situation here, like the snippet below (kinds of messy though).
> 
This change makes sense to me. I only modified it a little bit to only give prev a second
chance. With your patch applied, the improvement of netperf/tbench remains while the
hackbench big regress was gone.

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  2.35)	 -0.65 (  1.81)
process-pipe    	2-groups	 1.00 (  0.42)	 -2.16 (  1.12)
process-pipe    	4-groups	 1.00 (  1.84)	 +0.72 (  1.34)
process-pipe    	8-groups	 1.00 (  2.81)	 +1.12 (  3.88)
process-sockets 	1-groups	 1.00 (  1.88)	 -0.99 (  4.84)
process-sockets 	2-groups	 1.00 (  5.49)	 -4.50 (  4.09)
process-sockets 	4-groups	 1.00 (  3.54)	 +2.28 (  3.13)
process-sockets 	8-groups	 1.00 (  0.79)	 -0.13 (  1.28)
threads-pipe    	1-groups	 1.00 (  1.73)	 -2.39 (  0.40)
threads-pipe    	2-groups	 1.00 (  0.73)	 +2.88 (  1.94)
threads-pipe    	4-groups	 1.00 (  0.64)	 +1.12 (  1.82)
threads-pipe    	8-groups	 1.00 (  1.55)	 -1.59 (  1.20)
threads-sockets 	1-groups	 1.00 (  3.76)	 +3.21 (  3.56)
threads-sockets 	2-groups	 1.00 (  1.20)	 -5.56 (  2.64)
threads-sockets 	4-groups	 1.00 (  2.65)	 +1.48 (  4.91)
threads-sockets 	8-groups	 1.00 (  0.08)	 +0.18 (  0.15)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.91)	 +2.87 (  0.83)
TCP_RR          	12-threads	 1.00 (  0.22)	 +3.48 (  0.31)
TCP_RR          	18-threads	 1.00 (  0.41)	 +7.81 (  0.48)
TCP_RR          	24-threads	 1.00 (  1.02)	 -0.32 (  1.25)
TCP_RR          	30-threads	 1.00 (  4.67)	 -0.04 (  5.14)
TCP_RR          	36-threads	 1.00 (  4.53)	 -0.13 (  4.37)
TCP_RR          	42-threads	 1.00 (  3.92)	 -0.15 (  3.07)
TCP_RR          	48-threads	 1.00 (  2.07)	 -0.17 (  1.52)
UDP_RR          	6-threads	 1.00 (  0.98)	 +4.50 (  2.38)
UDP_RR          	12-threads	 1.00 (  0.26)	 +3.64 (  0.25)
UDP_RR          	18-threads	 1.00 (  0.27)	 +9.93 (  0.55)
UDP_RR          	24-threads	 1.00 (  1.22)	 +0.13 (  1.33)
UDP_RR          	30-threads	 1.00 (  3.86)	 -0.03 (  5.05)
UDP_RR          	36-threads	 1.00 (  2.81)	 +0.10 (  3.37)
UDP_RR          	42-threads	 1.00 (  3.51)	 -0.26 (  2.94)
UDP_RR          	48-threads	 1.00 ( 12.54)	 +0.74 (  9.44)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.04)	 +2.94 (  0.26)
loopback        	12-threads	 1.00 (  0.30)	 +4.58 (  0.12)
loopback        	18-threads	 1.00 (  0.37)	+12.38 (  0.10)
loopback        	24-threads	 1.00 (  0.56)	 -0.27 (  0.50)
loopback        	30-threads	 1.00 (  0.17)	 -0.18 (  0.06)
loopback        	36-threads	 1.00 (  0.25)	 -0.73 (  0.44)
loopback        	42-threads	 1.00 (  0.10)	 -0.22 (  0.18)
loopback        	48-threads	 1.00 (  0.29)	 -0.48 (  0.19)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 (  6.80)	 +2.78 (  8.08)
normal          	8-mthreads	 1.00 (  3.65)	 -0.23 (  4.30)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989116b0796..07495b44c68f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7127,7 +7127,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	bool has_idle_core = false;
 	struct sched_domain *sd;
 	unsigned long task_util, util_min, util_max;
-	int i, recent_used_cpu;
+	int i, recent_used_cpu, prev_aff = -1;
 
 	/*
 	 * On asymmetric system, update task utilization because we will check
@@ -7152,10 +7152,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_lowest_cache(prev, target) &&
+	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev))
-		return prev;
+	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+		if (cpus_share_lowest_cache(prev, target))
+			return prev;
+		prev_aff = prev;
+	}
 
 	/*
 	 * Allow a per-cpu kthread to stack with the wakee if the
@@ -7223,6 +7226,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * Give prev another chance, in case prev has not been
+	 * scanned in select_idle_cpu() due to nr constrain.
+	 */
+	if (prev_aff != -1)
+		return prev_aff;
+
 	return target;
 }
 

thanks,
Chenyu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-05-23 13:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15  7:34 [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-09-15  7:34 ` Yicong Yang
2022-09-15  7:34 ` [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
2022-09-15  7:34   ` Yicong Yang
2022-09-15  7:34 ` [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-09-15  7:34   ` Yicong Yang
2023-05-22  6:29   ` Chen Yu
2023-05-22  6:29     ` Chen Yu
2023-05-22 12:42     ` Yicong Yang
2023-05-22 12:42       ` Yicong Yang
2023-05-23 13:44       ` Chen Yu [this message]
2023-05-23 13:44         ` Chen Yu
2023-05-24  8:05         ` Yicong Yang
2023-05-24  8:05           ` Yicong Yang
2022-09-26 14:52 ` [RESEND PATCH v7 0/2] " Yicong Yang
2022-09-26 14:52   ` Yicong Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGzDLuVaHR1PAYDt@chenyu5-mobl1 \
    --to=yu.c.chen@intel.com \
    --cc=21cnbao@gmail.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=guodong.xu@linaro.org \
    --cc=hesham.almatary@huawei.com \
    --cc=john.garry@huawei.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@huawei.com \
    --cc=rostedt@goodmis.org \
    --cc=shenyang39@huawei.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yangyicong@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.