linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] sched/fair: SIS improvements and cleanups
@ 2022-07-12  8:20 Abel Wu
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

The first patch ignores the new sched_feat SIS_UTIL when
there are idle cores available to make full use of cpus
as possible.

The other patches come from the SIS filter patchset [1],
since they are irrelevant to the filter.

[1] https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/

Abel Wu (5):
  sched/fair: ignore SIS_UTIL when has idle core
  sched/fair: default to false in test_idle_cores
  sched/fair: remove redundant check in select_idle_smt
  sched/fair: avoid double search on same cpu
  sched/fair: remove useless check in select_idle_core

 kernel/sched/fair.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
@ 2022-07-12  8:20 ` Abel Wu
  2022-07-13  3:47   ` Chen Yu
                     ` (3 more replies)
  2022-07-12  8:20 ` [PATCH 2/5] sched/fair: default to false in test_idle_cores Abel Wu
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

When SIS_UTIL is enabled, SIS domain scan will be skipped if
the LLC is overloaded. Since the overloaded status is checked
in the load balancing at LLC level, the interval is llc_size
miliseconds. The duration might be long enough to affect the
overall system throughput if idle cores are out of reach in
SIS domain scan.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a78d2e3b9d49..cd758b3616bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct sched_domain *this_sd;
 	u64 time = 0;
 
-	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-	if (!this_sd)
-		return -1;
-
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP) && !has_idle_core) {
+	if (has_idle_core)
+		goto scan;
+
+	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
 
+		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+		if (!this_sd)
+			return -1;
+
 		/*
 		 * If we're busy, the assumption that the last idle period
 		 * predicts the future is flawed; age away the remaining
@@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 				return -1;
 		}
 	}
-
+scan:
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] sched/fair: default to false in test_idle_cores
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
@ 2022-07-12  8:20 ` Abel Wu
  2022-08-29 12:36   ` Mel Gorman
  2022-07-12  8:20 ` [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt Abel Wu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

It's uncertain whether idle cores exist or not if shared sched-
domains are not ready, so returning "no idle cores" usually
makes sense.

While __update_idle_core() is an exception, it checks status
of this core and set to shared sched-domain if necessary. So
the whole logic depends on the existence of sds, and can bail
out early if !sds.

It's somehow a little tricky, and as Josh suggested that it
should be transient while the domain isn't ready. So remove
the self-defined default value to make things more clearer,
while introduce little overhead in idle path.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Josh Don <joshdon@google.com>
---
 kernel/sched/fair.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd758b3616bd..c72093f282ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1592,11 +1592,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
 
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
+static inline bool test_idle_cores(int cpu);
 static inline int numa_idle_core(int idle_core, int cpu)
 {
 	if (!static_branch_likely(&sched_smt_present) ||
-	    idle_core >= 0 || !test_idle_cores(cpu, false))
+	    idle_core >= 0 || !test_idle_cores(cpu))
 		return idle_core;
 
 	/*
@@ -6260,7 +6260,7 @@ static inline void set_idle_cores(int cpu, int val)
 		WRITE_ONCE(sds->has_idle_cores, val);
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
 
@@ -6268,7 +6268,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 	if (sds)
 		return READ_ONCE(sds->has_idle_cores);
 
-	return def;
+	return false;
 }
 
 /*
@@ -6284,7 +6284,7 @@ void __update_idle_core(struct rq *rq)
 	int cpu;
 
 	rcu_read_lock();
-	if (test_idle_cores(core, true))
+	if (test_idle_cores(core))
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6360,9 +6360,9 @@ static inline void set_idle_cores(int cpu, int val)
 {
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool test_idle_cores(int cpu)
 {
-	return def;
+	return false;
 }
 
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
@@ -6604,7 +6604,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return target;
 
 	if (sched_smt_active()) {
-		has_idle_core = test_idle_cores(target, false);
+		has_idle_core = test_idle_cores(target);
 
 		if (!has_idle_core && cpus_share_cache(prev, target)) {
 			i = select_idle_smt(p, sd, prev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
  2022-07-12  8:20 ` [PATCH 2/5] sched/fair: default to false in test_idle_cores Abel Wu
@ 2022-07-12  8:20 ` Abel Wu
  2022-08-29 12:36   ` Mel Gorman
  2022-07-12  8:20 ` [PATCH 4/5] sched/fair: avoid double search on same cpu Abel Wu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

If two cpus share LLC cache, then the two cores they belong to
are also in the same LLC domain.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Josh Don <joshdon@google.com>
---
 kernel/sched/fair.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72093f282ec..0d7e8555bcf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6339,14 +6339,11 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_smt(struct task_struct *p, int target)
 {
 	int cpu;
 
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-		    !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-			continue;
+	for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
 			return cpu;
 	}
@@ -6370,7 +6367,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
 	return __select_idle_cpu(core, p);
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static inline int select_idle_smt(struct task_struct *p, int target)
 {
 	return -1;
 }
@@ -6607,7 +6604,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		has_idle_core = test_idle_cores(target);
 
 		if (!has_idle_core && cpus_share_cache(prev, target)) {
-			i = select_idle_smt(p, sd, prev);
+			i = select_idle_smt(p, prev);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] sched/fair: avoid double search on same cpu
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
                   ` (2 preceding siblings ...)
  2022-07-12  8:20 ` [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt Abel Wu
@ 2022-07-12  8:20 ` Abel Wu
  2022-08-29 12:36   ` Mel Gorman
  2022-07-12  8:20 ` [PATCH 5/5] sched/fair: remove useless check in select_idle_core Abel Wu
  2022-08-15 13:31 ` [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
  5 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

The prev cpu is checked at the beginning of SIS, and it's unlikely
to be idle before the second check in select_idle_smt(). So we'd
better focus on its SMT siblings.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Josh Don <joshdon@google.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d7e8555bcf9..e4cf000604fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6344,6 +6344,8 @@ static int select_idle_smt(struct task_struct *p, int target)
 	int cpu;
 
 	for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
+		if (cpu == target)
+			continue;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
 			return cpu;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] sched/fair: remove useless check in select_idle_core
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
                   ` (3 preceding siblings ...)
  2022-07-12  8:20 ` [PATCH 4/5] sched/fair: avoid double search on same cpu Abel Wu
@ 2022-07-12  8:20 ` Abel Wu
  2022-08-29 12:37   ` Mel Gorman
  2022-08-15 13:31 ` [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
  5 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-07-12  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, linux-kernel, Abel Wu

The function only gets called when sds->has_idle_cores is true
which can be possible only when sched_smt_present is enabled.
This change also aligns select_idle_core with select_idle_smt
that the caller do the check if necessary.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4cf000604fc..7c47621ccd40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6310,9 +6310,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	bool idle = true;
 	int cpu;
 
-	if (!static_branch_likely(&sched_smt_present))
-		return __select_idle_cpu(core, p);
-
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (!available_idle_cpu(cpu)) {
 			idle = false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
@ 2022-07-13  3:47   ` Chen Yu
  2022-07-13 16:14     ` Abel Wu
  2022-07-14  6:19   ` Yicong Yang
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2022-07-13  3:47 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, linux-kernel

On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
The idle core scan was skipped in SIS_UTIL because we saw better
improvement in some benchmarks. But yes, we could make has_idle_core
to scan anyway no matter what the system load is, if we have some
data to support it. I'll test this patch on top of latest sched/core
branch to see if this makes a difference.

thanks,
Chenyu
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a78d2e3b9d49..cd758b3616bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	struct sched_domain *this_sd;
>  	u64 time = 0;
>  
> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -	if (!this_sd)
> -		return -1;
> -
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> +	if (has_idle_core)
> +		goto scan;
> +
> +	if (sched_feat(SIS_PROP)) {
>  		u64 avg_cost, avg_idle, span_avg;
>  		unsigned long now = jiffies;
>  
> +		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> +		if (!this_sd)
> +			return -1;
> +
>  		/*
>  		 * If we're busy, the assumption that the last idle period
>  		 * predicts the future is flawed; age away the remaining
> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  				return -1;
>  		}
>  	}
> -
> +scan:
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-13  3:47   ` Chen Yu
@ 2022-07-13 16:14     ` Abel Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-07-13 16:14 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, linux-kernel


On 7/13/22 11:47 AM, Chen Yu Wrote:
> On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
> The idle core scan was skipped in SIS_UTIL because we saw better
> improvement in some benchmarks. But yes, we could make has_idle_core
> to scan anyway no matter what the system load is, if we have some
> data to support it. I'll test this patch on top of latest sched/core
> branch to see if this makes a difference.

I did some benchmark in my machine: netperf showed ~3% regression
and considerable improvement in hackbench pipe.

Maybe there are some workload benefit from selecting the recent
cpus rather than idle ones, but I don't think the recent_cpu can
out-perform the idle core much, since the two cpus still share
LLC cache and idle core has more capacity than idle cpu. Besides
this case, idle cores usually are better choices I think. So it
might worth a full scan if idle cores are available.

Thanks,
Abel

-------8<---

Tests are done in an Intel Xeon(R) Platinum 8260 CPU@2.40GHz machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core c02d5546ea34 (v5.19-rc2).

1) hackbench-process-pipes

Amean     1        0.2307 (   0.00%)      0.2277 (   1.30%)
Amean     4        0.6277 (   0.00%)      0.6037 (   3.82%)
Amean     7        0.7897 (   0.00%)      0.7710 *   2.36%*
Amean     12       1.2067 (   0.00%)      1.3087 (  -8.45%)
Amean     21       2.6717 (   0.00%)      2.5077 (   6.14%)
Amean     30       5.0893 (   0.00%)      3.9610 (  22.17%)
Amean     48       9.2407 (   0.00%)      7.4030 *  19.89%*
Amean     79       9.5607 (   0.00%)      9.3960 (   1.72%)
Amean     110     10.6053 (   0.00%)     10.6453 (  -0.38%)
Amean     141     12.8210 (   0.00%)     12.5253 (   2.31%)
Amean     172     14.9777 (   0.00%)     14.4317 (   3.65%)
Amean     203     18.1320 (   0.00%)     16.9753 *   6.38%*
Amean     234     20.1207 (   0.00%)     19.0730 *   5.21%*
Amean     265     22.7037 (   0.00%)     21.7953 (   4.00%)
Amean     296     25.6753 (   0.00%)     23.6877 *   7.74%*

2) hackbench-process-sockets

Amean     1        0.4223 (   0.00%)      0.4300 (  -1.82%)
Amean     4        1.4470 (   0.00%)      1.4593 *  -0.85%*
Amean     7        2.4803 (   0.00%)      2.4843 (  -0.16%)
Amean     12       4.1457 (   0.00%)      4.1170 *   0.69%*
Amean     21       7.0223 (   0.00%)      7.0053 (   0.24%)
Amean     30       9.9570 (   0.00%)      9.8683 *   0.89%*
Amean     48      16.0213 (   0.00%)     15.6657 *   2.22%*
Amean     79      26.8140 (   0.00%)     26.3657 *   1.67%*
Amean     110     37.3530 (   0.00%)     36.8800 *   1.27%*
Amean     141     47.6123 (   0.00%)     47.1627 (   0.94%)
Amean     172     57.6757 (   0.00%)     57.1430 *   0.92%*
Amean     203     68.2310 (   0.00%)     67.6030 (   0.92%)
Amean     234     78.1990 (   0.00%)     77.7073 *   0.63%*
Amean     265     88.9657 (   0.00%)     87.9833 *   1.10%*
Amean     296     99.8617 (   0.00%)     98.1073 *   1.76%*

3) hackbench-thread-pipes

Amean     1        0.2650 (   0.00%)      0.2807 (  -5.91%)
Amean     4        0.6650 (   0.00%)      0.6257 *   5.91%*
Amean     7        0.8283 (   0.00%)      0.8207 (   0.93%)
Amean     12       1.3343 (   0.00%)      1.3050 (   2.20%)
Amean     21       3.7053 (   0.00%)      2.7530 *  25.70%*
Amean     30       7.2817 (   0.00%)      4.2013 *  42.30%*
Amean     48       9.9037 (   0.00%)      8.8483 *  10.66%*
Amean     79      10.0790 (   0.00%)     10.1603 (  -0.81%)
Amean     110     11.5837 (   0.00%)     11.1297 (   3.92%)
Amean     141     13.4760 (   0.00%)     12.9903 (   3.60%)
Amean     172     16.2357 (   0.00%)     15.5903 (   3.97%)
Amean     203     18.2873 (   0.00%)     17.8690 (   2.29%)
Amean     234     21.2920 (   0.00%)     20.4680 (   3.87%)
Amean     265     23.9393 (   0.00%)     22.3933 *   6.46%*
Amean     296     26.6683 (   0.00%)     26.1260 (   2.03%)

4) hackbench-thread-sockets

Amean     1        0.4700 (   0.00%)      0.4437 (   5.60%)
Amean     4        1.4837 (   0.00%)      1.4960 (  -0.83%)
Amean     7        2.5497 (   0.00%)      2.5240 *   1.01%*
Amean     12       4.2473 (   0.00%)      4.2137 *   0.79%*
Amean     21       7.2530 (   0.00%)      7.1800 *   1.01%*
Amean     30      10.1973 (   0.00%)     10.1483 (   0.48%)
Amean     48      16.2163 (   0.00%)     16.0870 *   0.80%*
Amean     79      27.4460 (   0.00%)     27.0770 *   1.34%*
Amean     110     38.1103 (   0.00%)     37.5573 *   1.45%*
Amean     141     48.4513 (   0.00%)     48.0347 (   0.86%)
Amean     172     59.4410 (   0.00%)     58.4020 *   1.75%*
Amean     203     70.0873 (   0.00%)     69.0470 *   1.48%*
Amean     234     80.3570 (   0.00%)     79.4163 *   1.17%*
Amean     265     91.8550 (   0.00%)     90.3417 *   1.65%*
Amean     296    102.7420 (   0.00%)    100.8933 *   1.80%*

5) netperf-udp

Hmean     send-64         210.14 (   0.00%)      202.26 *  -3.75%*
Hmean     send-128        418.23 (   0.00%)      404.14 *  -3.37%*
Hmean     send-256        821.31 (   0.00%)      789.94 *  -3.82%*
Hmean     send-1024      3161.96 (   0.00%)     3025.23 *  -4.32%*
Hmean     send-2048      5959.67 (   0.00%)     5725.35 *  -3.93%*
Hmean     send-3312      9196.99 (   0.00%)     8830.47 *  -3.99%*
Hmean     send-4096     11061.30 (   0.00%)    10675.83 *  -3.48%*
Hmean     send-8192     17320.16 (   0.00%)    17601.11 *   1.62%*
Hmean     send-16384    27936.12 (   0.00%)    27859.52 (  -0.27%)
Hmean     recv-64         210.14 (   0.00%)      202.26 *  -3.75%*
Hmean     recv-128        418.23 (   0.00%)      404.14 *  -3.37%*
Hmean     recv-256        821.31 (   0.00%)      789.94 *  -3.82%*
Hmean     recv-1024      3161.95 (   0.00%)     3025.23 *  -4.32%*
Hmean     recv-2048      5959.44 (   0.00%)     5725.35 *  -3.93%*
Hmean     recv-3312      9196.99 (   0.00%)     8830.44 *  -3.99%*
Hmean     recv-4096     11061.26 (   0.00%)    10675.83 *  -3.48%*
Hmean     recv-8192     17319.62 (   0.00%)    17601.10 *   1.63%*
Hmean     recv-16384    27935.00 (   0.00%)    27859.40 (  -0.27%)

6) netperf-tcp

Hmean     64        1222.35 (   0.00%)     1195.62 *  -2.19%*
Hmean     128       2372.82 (   0.00%)     2311.15 *  -2.60%*
Hmean     256       4305.82 (   0.00%)     4244.00 (  -1.44%)
Hmean     1024     13645.83 (   0.00%)    13299.13 *  -2.54%*
Hmean     2048     21119.56 (   0.00%)    21042.51 (  -0.36%)
Hmean     3312     24835.87 (   0.00%)    25282.73 *   1.80%*
Hmean     4096     26705.57 (   0.00%)    26851.09 (   0.54%)
Hmean     8192     30889.25 (   0.00%)    31436.74 *   1.77%*
Hmean     16384    35108.55 (   0.00%)    35172.76 (   0.18%)

7) tbench4 Throughput

Hmean     1        289.23 (   0.00%)      289.58 (   0.12%)
Hmean     2        577.35 (   0.00%)      576.40 *  -0.17%*
Hmean     4       1141.02 (   0.00%)     1155.58 *   1.28%*
Hmean     8       2258.36 (   0.00%)     2287.82 *   1.30%*
Hmean     16      4410.02 (   0.00%)     4510.76 *   2.28%*
Hmean     32      7414.89 (   0.00%)     7474.74 *   0.81%*
Hmean     64      9011.49 (   0.00%)     8973.50 *  -0.42%*
Hmean     128    19892.18 (   0.00%)    19913.82 *   0.11%*
Hmean     256    19854.73 (   0.00%)    20239.21 *   1.94%*
Hmean     384    19808.80 (   0.00%)    19709.59 *  -0.50%*



> 
> thanks,
> Chenyu
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 15 +++++++++------
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a78d2e3b9d49..cd758b3616bd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   	struct sched_domain *this_sd;
>>   	u64 time = 0;
>>   
>> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -	if (!this_sd)
>> -		return -1;
>> -
>>   	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>   
>> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
>> +	if (has_idle_core)
>> +		goto scan;
>> +
>> +	if (sched_feat(SIS_PROP)) {
>>   		u64 avg_cost, avg_idle, span_avg;
>>   		unsigned long now = jiffies;
>>   
>> +		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> +		if (!this_sd)
>> +			return -1;
>> +
>>   		/*
>>   		 * If we're busy, the assumption that the last idle period
>>   		 * predicts the future is flawed; age away the remaining
>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   				return -1;
>>   		}
>>   	}
>> -
>> +scan:
>>   	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>   		if (has_idle_core) {
>>   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> -- 
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
  2022-07-13  3:47   ` Chen Yu
@ 2022-07-14  6:19   ` Yicong Yang
  2022-07-14  6:58     ` Abel Wu
  2022-08-10 13:50   ` Chen Yu
  2022-08-29 13:08   ` Mel Gorman
  3 siblings, 1 reply; 34+ messages in thread
From: Yicong Yang @ 2022-07-14  6:19 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel

On 2022/7/12 16:20, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a78d2e3b9d49..cd758b3616bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	struct sched_domain *this_sd;
>  	u64 time = 0;
>  
> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -	if (!this_sd)
> -		return -1;
> -
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> +	if (has_idle_core)
> +		goto scan;
> +
> +	if (sched_feat(SIS_PROP)) {
>  		u64 avg_cost, avg_idle, span_avg;
>  		unsigned long now = jiffies;
>  
> +		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> +		if (!this_sd)
> +			return -1;
> +

I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
commit. Does the position of this make any performance difference?

Thanks.

>  		/*
>  		 * If we're busy, the assumption that the last idle period
>  		 * predicts the future is flawed; age away the remaining
> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  				return -1;
>  		}
>  	}
> -
> +scan:
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  6:19   ` Yicong Yang
@ 2022-07-14  6:58     ` Abel Wu
  2022-07-14  7:15       ` Yicong Yang
  2022-08-04  9:59       ` Chen Yu
  0 siblings, 2 replies; 34+ messages in thread
From: Abel Wu @ 2022-07-14  6:58 UTC (permalink / raw)
  To: Yicong Yang, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel


On 7/14/22 2:19 PM, Yicong Yang Wrote:
> On 2022/7/12 16:20, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 15 +++++++++------
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a78d2e3b9d49..cd758b3616bd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   	struct sched_domain *this_sd;
>>   	u64 time = 0;
>>   
>> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -	if (!this_sd)
>> -		return -1;
>> -
>>   	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>   
>> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
>> +	if (has_idle_core)
>> +		goto scan;
>> +
>> +	if (sched_feat(SIS_PROP)) {
>>   		u64 avg_cost, avg_idle, span_avg;
>>   		unsigned long now = jiffies;
>>   
>> +		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> +		if (!this_sd)
>> +			return -1;
>> +
> 
> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
> commit. Does the position of this make any performance difference?

No, this change doesn't make much difference to performance. Are
you suggesting that I should make this a separate patch?

Thanks,
Abel

> 
> Thanks.
> 
>>   		/*
>>   		 * If we're busy, the assumption that the last idle period
>>   		 * predicts the future is flawed; age away the remaining
>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   				return -1;
>>   		}
>>   	}
>> -
>> +scan:
>>   	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>   		if (has_idle_core) {
>>   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  6:58     ` Abel Wu
@ 2022-07-14  7:15       ` Yicong Yang
  2022-07-14  8:00         ` Abel Wu
  2022-08-04  9:59       ` Chen Yu
  1 sibling, 1 reply; 34+ messages in thread
From: Yicong Yang @ 2022-07-14  7:15 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel

On 2022/7/14 14:58, Abel Wu wrote:
> 
> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>> On 2022/7/12 16:20, Abel Wu wrote:
>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>> the LLC is overloaded. Since the overloaded status is checked
>>> in the load balancing at LLC level, the interval is llc_size
>>> miliseconds. The duration might be long enough to affect the
>>> overall system throughput if idle cores are out of reach in
>>> SIS domain scan.
>>>
>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>> ---
>>>   kernel/sched/fair.c | 15 +++++++++------
>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a78d2e3b9d49..cd758b3616bd 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>       struct sched_domain *this_sd;
>>>       u64 time = 0;
>>>   -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> -    if (!this_sd)
>>> -        return -1;
>>> -
>>>       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>   -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>> +    if (has_idle_core)
>>> +        goto scan;
>>> +
>>> +    if (sched_feat(SIS_PROP)) {
>>>           u64 avg_cost, avg_idle, span_avg;
>>>           unsigned long now = jiffies;
>>>   +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> +        if (!this_sd)
>>> +            return -1;
>>> +
>>
>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>> commit. Does the position of this make any performance difference?
> 
> No, this change doesn't make much difference to performance. Are
> you suggesting that I should make this a separate patch?
> 

It just makes me think that dereference is unnecessary if this_cpu and target locates in
the same LLC, since it's already been passed. But since you noticed no difference it may
have little effect. :)

> Thanks,
> Abel
> 
>>
>> Thanks.
>>
>>>           /*
>>>            * If we're busy, the assumption that the last idle period
>>>            * predicts the future is flawed; age away the remaining
>>> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>                   return -1;
>>>           }
>>>       }
>>> -
>>> +scan:
>>>       for_each_cpu_wrap(cpu, cpus, target + 1) {
>>>           if (has_idle_core) {
>>>               i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>
> .

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  7:15       ` Yicong Yang
@ 2022-07-14  8:00         ` Abel Wu
  2022-07-14  8:16           ` Yicong Yang
  0 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-07-14  8:00 UTC (permalink / raw)
  To: Yicong Yang, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel


On 7/14/22 3:15 PM, Yicong Yang Wrote:
> On 2022/7/14 14:58, Abel Wu wrote:
>>
>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>> the LLC is overloaded. Since the overloaded status is checked
>>>> in the load balancing at LLC level, the interval is llc_size
>>>> miliseconds. The duration might be long enough to affect the
>>>> overall system throughput if idle cores are out of reach in
>>>> SIS domain scan.
>>>>
>>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>>> ---
>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>        struct sched_domain *this_sd;
>>>>        u64 time = 0;
>>>>    -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> -    if (!this_sd)
>>>> -        return -1;
>>>> -
>>>>        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>    -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>> +    if (has_idle_core)
>>>> +        goto scan;
>>>> +
>>>> +    if (sched_feat(SIS_PROP)) {
>>>>            u64 avg_cost, avg_idle, span_avg;
>>>>            unsigned long now = jiffies;
>>>>    +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> +        if (!this_sd)
>>>> +            return -1;
>>>> +
>>>
>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>> commit. Does the position of this make any performance difference?
>>
>> No, this change doesn't make much difference to performance. Are
>> you suggesting that I should make this a separate patch?
>>
> 
> It just makes me think that dereference is unnecessary if this_cpu and target locates in
> the same LLC, since it's already been passed. But since you noticed no difference it may
> have little effect. :)
> 

Hmm.. Not exactly. The sched-domains are cpu private, and this_cpu can
be in another LLC than target.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  8:00         ` Abel Wu
@ 2022-07-14  8:16           ` Yicong Yang
  2022-07-14  8:34             ` Yicong Yang
  0 siblings, 1 reply; 34+ messages in thread
From: Yicong Yang @ 2022-07-14  8:16 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel

On 2022/7/14 16:00, Abel Wu wrote:
> 
> On 7/14/22 3:15 PM, Yicong Yang Wrote:
>> On 2022/7/14 14:58, Abel Wu wrote:
>>>
>>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>>> the LLC is overloaded. Since the overloaded status is checked
>>>>> in the load balancing at LLC level, the interval is llc_size
>>>>> miliseconds. The duration might be long enough to affect the
>>>>> overall system throughput if idle cores are out of reach in
>>>>> SIS domain scan.
>>>>>
>>>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>>>> ---
>>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>>        struct sched_domain *this_sd;
>>>>>        u64 time = 0;
>>>>>    -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>> -    if (!this_sd)
>>>>> -        return -1;
>>>>> -
>>>>>        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>>    -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>>> +    if (has_idle_core)
>>>>> +        goto scan;
>>>>> +
>>>>> +    if (sched_feat(SIS_PROP)) {
>>>>>            u64 avg_cost, avg_idle, span_avg;
>>>>>            unsigned long now = jiffies;
>>>>>    +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>> +        if (!this_sd)
>>>>> +            return -1;
>>>>> +
>>>>
>>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>>> commit. Does the position of this make any performance difference?
>>>
>>> No, this change doesn't make much difference to performance. Are
>>> you suggesting that I should make this a separate patch?
>>>
>>
>> It just makes me think that dereference is unnecessary if this_cpu and target locates in
>> the same LLC, since it's already been passed. But since you noticed no difference it may
>> have little effect. :)
>>
> 
> Hmm.. Not exactly. The sched-domains are cpu private, and this_cpu can
> be in another LLC than target.
> .

yes. you're right. sorry for get this messed.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  8:16           ` Yicong Yang
@ 2022-07-14  8:34             ` Yicong Yang
  0 siblings, 0 replies; 34+ messages in thread
From: Yicong Yang @ 2022-07-14  8:34 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, linux-kernel

On 2022/7/14 16:16, Yicong Yang wrote:
> On 2022/7/14 16:00, Abel Wu wrote:
>>
>> On 7/14/22 3:15 PM, Yicong Yang Wrote:
>>> On 2022/7/14 14:58, Abel Wu wrote:
>>>>
>>>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>>>> the LLC is overloaded. Since the overloaded status is checked
>>>>>> in the load balancing at LLC level, the interval is llc_size
>>>>>> miliseconds. The duration might be long enough to affect the
>>>>>> overall system throughput if idle cores are out of reach in
>>>>>> SIS domain scan.
>>>>>>
>>>>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>>>>> ---
>>>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>>>        struct sched_domain *this_sd;
>>>>>>        u64 time = 0;
>>>>>>    -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>>> -    if (!this_sd)
>>>>>> -        return -1;
>>>>>> -
>>>>>>        cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>>>    -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>>>> +    if (has_idle_core)
>>>>>> +        goto scan;
>>>>>> +
>>>>>> +    if (sched_feat(SIS_PROP)) {
>>>>>>            u64 avg_cost, avg_idle, span_avg;
>>>>>>            unsigned long now = jiffies;
>>>>>>    +        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>>>> +        if (!this_sd)
>>>>>> +            return -1;
>>>>>> +
>>>>>
>>>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>>>> commit. Does the position of this make any performance difference?
>>>>
>>>> No, this change doesn't make much difference to performance. Are
>>>> you suggesting that I should make this a separate patch?
>>>>
>>>
>>> It just makes me think that dereference is unnecessary if this_cpu and target locates in
>>> the same LLC, since it's already been passed. But since you noticed no difference it may
>>> have little effect. :)
>>>
>>
>> Hmm.. Not exactly. The sched-domains are cpu private

>>, and this_cpu can be in another LLC than target.

This is not the condition I meant to, I would have thought it would make some sense only when they're in
the same LLC. Anyway since no difference for dereference or not, it doesn't matter at all.
Thanks for the explanation.

>> .
> 
> yes. you're right. sorry for get this messed.
> 
> .
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-14  6:58     ` Abel Wu
  2022-07-14  7:15       ` Yicong Yang
@ 2022-08-04  9:59       ` Chen Yu
  2022-08-15  2:54         ` Abel Wu
  1 sibling, 1 reply; 34+ messages in thread
From: Chen Yu @ 2022-08-04  9:59 UTC (permalink / raw)
  To: Abel Wu
  Cc: Yicong Yang, Peter Zijlstra, Mel Gorman, Vincent Guittot,
	Yicong Yang, Josh Don, Chen Yu, Linux Kernel Mailing List

On Thu, Jul 14, 2022 at 3:11 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
>
> On 7/14/22 2:19 PM, Yicong Yang Wrote:
> > On 2022/7/12 16:20, Abel Wu wrote:
> >> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> >> the LLC is overloaded. Since the overloaded status is checked
> >> in the load balancing at LLC level, the interval is llc_size
> >> miliseconds. The duration might be long enough to affect the
> >> overall system throughput if idle cores are out of reach in
> >> SIS domain scan.
> >>
> >> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> >> ---
> >>   kernel/sched/fair.c | 15 +++++++++------
> >>   1 file changed, 9 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index a78d2e3b9d49..cd758b3616bd 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>      struct sched_domain *this_sd;
> >>      u64 time = 0;
> >>
> >> -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> >> -    if (!this_sd)
> >> -            return -1;
> >> -
> >>      cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>
> >> -    if (sched_feat(SIS_PROP) && !has_idle_core) {
> >> +    if (has_idle_core)
> >> +            goto scan;
> >> +
> >> +    if (sched_feat(SIS_PROP)) {
> >>              u64 avg_cost, avg_idle, span_avg;
> >>              unsigned long now = jiffies;
> >>
> >> +            this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> >> +            if (!this_sd)
> >> +                    return -1;
> >> +
> >
> > I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
> > commit. Does the position of this make any performance difference?
>
> No, this change doesn't make much difference to performance. Are
> you suggesting that I should make this a separate patch?
>
I took a look at this patch again before I start a OLTP test. I
thought the position change of
dereference sd_llc might not be closely connected with current change
as Yicong mentioned.
Besides, after moving the dereference inside SIS_PROP, we might do
cpumask_and() no matter whether
sd_llc is valid or not, which might be of extra cost?

thanks,
Chenyu

> Thanks,
> Abel
>
> >
> > Thanks.
> >
> >>              /*
> >>               * If we're busy, the assumption that the last idle period
> >>               * predicts the future is flawed; age away the remaining
> >> @@ -6436,7 +6439,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>                              return -1;
> >>              }
> >>      }
> >> -
> >> +scan:
> >>      for_each_cpu_wrap(cpu, cpus, target + 1) {
> >>              if (has_idle_core) {
> >>                      i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >>



-- 
Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
  2022-07-13  3:47   ` Chen Yu
  2022-07-14  6:19   ` Yicong Yang
@ 2022-08-10 13:50   ` Chen Yu
  2022-08-15  2:44     ` Abel Wu
  2022-08-29 13:08   ` Mel Gorman
  3 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2022-08-10 13:50 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	Linux Kernel Mailing List

On Tue, Jul 12, 2022 at 4:45 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>
Tested schbench and netperf on latest 5.19 vanilla, it seems that there
is latency performance improvement when the load is low in schbench,
and no performance difference on netperf.

 ./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t schbench

schbench
========
case            load    baseline(std%) compare%( std%)
normal          mthread-1 1.00 (  0.00) +7.69 (  0.00)
normal          mthread-2 1.00 (  0.00) +13.24 (  0.00)
normal          mthread-4 1.00 (  0.00) -5.88 (  0.00)
normal          mthread-8 1.00 (  0.00) -0.25 (  0.00)


./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t netperf
netperf
=======
case            load    baseline(std%) compare%( std%)
TCP_RR          thread-28 1.00 (  0.62) +0.15 (  0.55)
TCP_RR          thread-56 1.00 (  0.42) -0.26 (  0.40)
TCP_RR          thread-84 1.00 (  0.29) +0.39 (  0.29)
TCP_RR          thread-112 1.00 (  0.22) +0.44 (  0.23)
TCP_RR          thread-140 1.00 (  0.17) +0.33 (  0.18)
TCP_RR          thread-168 1.00 (  0.17) +0.19 (  0.16)
TCP_RR          thread-196 1.00 ( 13.65) -0.62 ( 14.83)
TCP_RR          thread-224 1.00 (  9.80) -0.65 (  9.67)
UDP_RR          thread-28 1.00 (  0.89) +0.92 (  0.81)
UDP_RR          thread-56 1.00 (  0.78) +0.38 (  0.73)
UDP_RR          thread-84 1.00 ( 14.03) +0.78 ( 16.85)
UDP_RR          thread-112 1.00 ( 12.26) -0.42 ( 11.95)
UDP_RR          thread-140 1.00 (  9.86) -0.89 (  6.93)
UDP_RR          thread-168 1.00 ( 11.62) -0.82 (  8.80)
UDP_RR          thread-196 1.00 ( 19.47) +0.42 ( 16.50)
UDP_RR          thread-224 1.00 ( 18.68) +0.72 ( 18.50)


Tested-by: Chen Yu <yu.c.chen@intel.com>

-- 
Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-10 13:50   ` Chen Yu
@ 2022-08-15  2:44     ` Abel Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-08-15  2:44 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	Linux Kernel Mailing List

Hi Chen, thanks for your testing!

On 8/10/22 9:50 PM, Chen Yu Wrote:
> On Tue, Jul 12, 2022 at 4:45 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>
> Tested schbench and netperf on latest 5.19 vanilla, it seems that there
> is latency performance improvement when the load is low in schbench,
> and no performance difference on netperf.
> 
>   ./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t schbench
> 
> schbench
> ========
> case            load    baseline(std%) compare%( std%)
> normal          mthread-1 1.00 (  0.00) +7.69 (  0.00)
> normal          mthread-2 1.00 (  0.00) +13.24 (  0.00)
> normal          mthread-4 1.00 (  0.00) -5.88 (  0.00)
> normal          mthread-8 1.00 (  0.00) -0.25 (  0.00)
> 
> 
> ./report.py -b 5.19.0+ -c 5.19.0-skip-sis-util+ -t netperf
> netperf
> =======
> case            load    baseline(std%) compare%( std%)
> TCP_RR          thread-28 1.00 (  0.62) +0.15 (  0.55)
> TCP_RR          thread-56 1.00 (  0.42) -0.26 (  0.40)
> TCP_RR          thread-84 1.00 (  0.29) +0.39 (  0.29)
> TCP_RR          thread-112 1.00 (  0.22) +0.44 (  0.23)
> TCP_RR          thread-140 1.00 (  0.17) +0.33 (  0.18)
> TCP_RR          thread-168 1.00 (  0.17) +0.19 (  0.16)
> TCP_RR          thread-196 1.00 ( 13.65) -0.62 ( 14.83)
> TCP_RR          thread-224 1.00 (  9.80) -0.65 (  9.67)
> UDP_RR          thread-28 1.00 (  0.89) +0.92 (  0.81)
> UDP_RR          thread-56 1.00 (  0.78) +0.38 (  0.73)
> UDP_RR          thread-84 1.00 ( 14.03) +0.78 ( 16.85)
> UDP_RR          thread-112 1.00 ( 12.26) -0.42 ( 11.95)
> UDP_RR          thread-140 1.00 (  9.86) -0.89 (  6.93)
> UDP_RR          thread-168 1.00 ( 11.62) -0.82 (  8.80)
> UDP_RR          thread-196 1.00 ( 19.47) +0.42 ( 16.50)
> UDP_RR          thread-224 1.00 ( 18.68) +0.72 ( 18.50)
> 
> 
> Tested-by: Chen Yu <yu.c.chen@intel.com>
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-04  9:59       ` Chen Yu
@ 2022-08-15  2:54         ` Abel Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-08-15  2:54 UTC (permalink / raw)
  To: Chen Yu
  Cc: Yicong Yang, Peter Zijlstra, Mel Gorman, Vincent Guittot,
	Yicong Yang, Josh Don, Chen Yu, Linux Kernel Mailing List

On 8/4/22 5:59 PM, Chen Yu Wrote:
> On Thu, Jul 14, 2022 at 3:11 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>>
>> On 7/14/22 2:19 PM, Yicong Yang Wrote:
>>> On 2022/7/12 16:20, Abel Wu wrote:
>>>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>>>> the LLC is overloaded. Since the overloaded status is checked
>>>> in the load balancing at LLC level, the interval is llc_size
>>>> miliseconds. The duration might be long enough to affect the
>>>> overall system throughput if idle cores are out of reach in
>>>> SIS domain scan.
>>>>
>>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>>> ---
>>>>    kernel/sched/fair.c | 15 +++++++++------
>>>>    1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index a78d2e3b9d49..cd758b3616bd 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6392,16 +6392,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>       struct sched_domain *this_sd;
>>>>       u64 time = 0;
>>>>
>>>> -    this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> -    if (!this_sd)
>>>> -            return -1;
>>>> -
>>>>       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>
>>>> -    if (sched_feat(SIS_PROP) && !has_idle_core) {
>>>> +    if (has_idle_core)
>>>> +            goto scan;
>>>> +
>>>> +    if (sched_feat(SIS_PROP)) {
>>>>               u64 avg_cost, avg_idle, span_avg;
>>>>               unsigned long now = jiffies;
>>>>
>>>> +            this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>>> +            if (!this_sd)
>>>> +                    return -1;
>>>> +
>>>
>>> I don't follow the change here. True that this_sd is used only in SIS_PROP, but it seems irrelevant with your
>>> commit. Does the position of this make any performance difference?
>>
>> No, this change doesn't make much difference to performance. Are
>> you suggesting that I should make this a separate patch?
>>
> I took a look at this patch again before I start a OLTP test. I
> thought the position change of
> dereference sd_llc might not be closely connected with current change
> as Yicong mentioned.

OK, I will make it a separate patch. But before that I'd prefer wait
for more comments :)

> Besides, after moving the dereference inside SIS_PROP, we might do
> cpumask_and() no matter whether
> sd_llc is valid or not, which might be of extra cost?
> 
I think it might be irrelevant whether the local sd_llc is valid or
not, since all we care about is target sd_llc if !SIS_PROP.

Best Regards,
Abel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/5] sched/fair: SIS improvements and cleanups
  2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
                   ` (4 preceding siblings ...)
  2022-07-12  8:20 ` [PATCH 5/5] sched/fair: remove useless check in select_idle_core Abel Wu
@ 2022-08-15 13:31 ` Abel Wu
  5 siblings, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-08-15 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Chen Yu, linux-kernel, Mel Gorman, Vincent Guittot

Hi Peter, it would be appreciate if you can offer some advice!

Thanks & Best,
Abel

On 7/12/22 4:20 PM, Abel Wu Wrote:
> The first patch ignores the new sched_feat SIS_UTIL when
> there are idle cores available to make full use of cpus
> as possible.
> 
> The other patches come from the SIS filter patchset [1],
> since they are irrelevant to the filter.
> 
> [1] https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
> 
> Abel Wu (5):
>    sched/fair: ignore SIS_UTIL when has idle core
>    sched/fair: default to false in test_idle_cores
>    sched/fair: remove redundant check in select_idle_smt
>    sched/fair: avoid double search on same cpu
>    sched/fair: remove useless check in select_idle_core
> 
>   kernel/sched/fair.c | 45 ++++++++++++++++++++++-----------------------
>   1 file changed, 22 insertions(+), 23 deletions(-)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] sched/fair: default to false in test_idle_cores
  2022-07-12  8:20 ` [PATCH 2/5] sched/fair: default to false in test_idle_cores Abel Wu
@ 2022-08-29 12:36   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 12:36 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Tue, Jul 12, 2022 at 04:20:33PM +0800, Abel Wu wrote:
> It's uncertain whether idle cores exist or not if shared sched-
> domains are not ready, so returning "no idle cores" usually
> makes sense.
> 
> While __update_idle_core() is an exception, it checks status
> of this core and set to shared sched-domain if necessary. So
> the whole logic depends on the existence of sds, and can bail
> out early if !sds.
> 
> It's somehow a little tricky, and as Josh suggested that it
> should be transient while the domain isn't ready. So remove
> the self-defined default value to make things more clearer,
> while introduce little overhead in idle path.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> Reviewed-by: Josh Don <joshdon@google.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt
  2022-07-12  8:20 ` [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt Abel Wu
@ 2022-08-29 12:36   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 12:36 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Tue, Jul 12, 2022 at 04:20:34PM +0800, Abel Wu wrote:
> If two cpus share LLC cache, then the two cores they belong to
> are also in the same LLC domain.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> Reviewed-by: Josh Don <joshdon@google.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] sched/fair: avoid double search on same cpu
  2022-07-12  8:20 ` [PATCH 4/5] sched/fair: avoid double search on same cpu Abel Wu
@ 2022-08-29 12:36   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 12:36 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Tue, Jul 12, 2022 at 04:20:35PM +0800, Abel Wu wrote:
> The prev cpu is checked at the beginning of SIS, and it's unlikely
> to be idle before the second check in select_idle_smt(). So we'd
> better focus on its SMT siblings.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> Reviewed-by: Josh Don <joshdon@google.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: remove useless check in select_idle_core
  2022-07-12  8:20 ` [PATCH 5/5] sched/fair: remove useless check in select_idle_core Abel Wu
@ 2022-08-29 12:37   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 12:37 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Tue, Jul 12, 2022 at 04:20:36PM +0800, Abel Wu wrote:
> The function only gets called when sds->has_idle_cores is true
> which can be possible only when sched_smt_present is enabled.
> This change also aligns select_idle_core with select_idle_smt
> that the caller do the check if necessary.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
                     ` (2 preceding siblings ...)
  2022-08-10 13:50   ` Chen Yu
@ 2022-08-29 13:08   ` Mel Gorman
  2022-08-29 14:11     ` Abel Wu
  3 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 13:08 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if
> the LLC is overloaded. Since the overloaded status is checked
> in the load balancing at LLC level, the interval is llc_size
> miliseconds. The duration might be long enough to affect the
> overall system throughput if idle cores are out of reach in
> SIS domain scan.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

Split this patch to move the this_sd lookup into the SIS_PROP section.

Otherwise, this is the most controversial patch in the series and the
most likely to cause problems where it wins on some machines and
workloads and loses on others.

The corner case to worry about is a workload that is rapidly idling and
the has_idle_core hint is often wrong. This can happen with hackbench for
example, or at least this was true when I last looked which is quite some
time ago. If this hint is often wrong, then there will be full scan cost
incurred regardless of SIS_UTIL that often fails to find a core.

So, first suggestion is to move this patch to the end of the series as
the other patches are relatively harmless. They could even be merged in
isolation as a cleanup.

Second, using the other patches as your baseline, include in the
changelog what you tested that showed a benefit, what type of machine
it was and in particular include the number of cores, nodes and the
span of the LLC.  If you measured any regressions, include that as well
and make a call on whether you think the patch wins more than it loses.
The reason to include that information is because the worst corner case
(all CPUs scanned uselessly) costs more the larger the span of LLC is.
If all the testing was done on a 2-core SMT-2 machine, the overhead of the
patch would be negligible but very different if the LLC span is 20 cores.
While the patch is not obviously wrong, it definitely needs better data,
Even if you do not have a large test machine available, it's still helpful
to have it in the changelog because a reviewer like me can decide "this
needs testing on a larger machine".

I did queue this up the entire series for testing and while it sometimes
benefitted, there were large counter-examples. tbench4 on Zen3 showed some
large regressions (e.g. 41% loss on 64 clients with a massive increase in
variability) which has multiple small L3 caches per node. tbench4 (slightly
modified in mmtests to produce a usable metric) in general showed issues
across multiple x86-64 machines both AMD and Intel, multiple generations
with a noticable increase in system CPU usage when the client counts reach
the stage where the machine is close to saturated. perfpipe for some
weird reason showed a large regression apparent regresion on Broadwell
but the variability was also crazy so probably can be ignored. hackbench
overall looked ok so it's possible I'm wrong about the idle_cores hint
being often wrong on that workload and I should double check that. It's
possible the hint is wrong some of the times but right often enough to
either benefit from using an idle core or by finding an idle sibling which
may be preferable to stacking tasks on the same CPU.

The lack of data and the lack of a note on the potential downside is the
main reason why I'm not acking patch. tbench4 was a particular concern on
my own tests and it's possible a better patch would be a hybrid approach
where a limited search of two cores (excluding target) is allowed even if
SIS_UTIL indicates overload but has_idle_cores is true.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-29 13:08   ` Mel Gorman
@ 2022-08-29 14:11     ` Abel Wu
  2022-08-29 14:56       ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-08-29 14:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

Hi Mel, thanks for reviewing!

On 8/29/22 9:08 PM, Mel Gorman Wrote:
> On Tue, Jul 12, 2022 at 04:20:32PM +0800, Abel Wu wrote:
>> When SIS_UTIL is enabled, SIS domain scan will be skipped if
>> the LLC is overloaded. Since the overloaded status is checked
>> in the load balancing at LLC level, the interval is llc_size
>> miliseconds. The duration might be long enough to affect the
>> overall system throughput if idle cores are out of reach in
>> SIS domain scan.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> 
> Split this patch to move the this_sd lookup into the SIS_PROP section.

OK

> 
> Otherwise, this is the most controversial patch in the series and the
> most likely to cause problems where it wins on some machines and
> workloads and loses on others.
> 
> The corner case to worry about is a workload that is rapidly idling and
> the has_idle_core hint is often wrong. This can happen with hackbench for
> example, or at least this was true when I last looked which is quite some
> time ago. If this hint is often wrong, then there will be full scan cost
> incurred regardless of SIS_UTIL that often fails to find a core.

Yes, I can't agree more. And the situation can become worse when LLC
gets bigger as you mentioned below. I will exclude this part from v2.

> 
> So, first suggestion is to move this patch to the end of the series as
> the other patches are relatively harmless. They could even be merged in
> isolation as a cleanup.
> 
> Second, using the other patches as your baseline, include in the
> changelog what you tested that showed a benefit, what type of machine
> it was and in particular include the number of cores, nodes and the
> span of the LLC.  If you measured any regressions, include that as well
> and make a call on whether you think the patch wins more than it loses.
> The reason to include that information is because the worst corner case
> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> patch would be negligible but very different if the LLC span is 20 cores.
> While the patch is not obviously wrong, it definitely needs better data,
> Even if you do not have a large test machine available, it's still helpful
> to have it in the changelog because a reviewer like me can decide "this
> needs testing on a larger machine".

Thanks for your detailed suggestions. I will attach benchmark results
along with some analysis next time when posting performance related
patches.

> 
> I did queue this up the entire series for testing and while it sometimes
> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> large regressions (e.g. 41% loss on 64 clients with a massive increase in
> variability) which has multiple small L3 caches per node. tbench4 (slightly
> modified in mmtests to produce a usable metric) in general showed issues
> across multiple x86-64 machines both AMD and Intel, multiple generations
> with a noticable increase in system CPU usage when the client counts reach
> the stage where the machine is close to saturated. perfpipe for some
> weird reason showed a large regression apparent regresion on Broadwell
> but the variability was also crazy so probably can be ignored. hackbench
> overall looked ok so it's possible I'm wrong about the idle_cores hint
> being often wrong on that workload and I should double check that. It's
> possible the hint is wrong some of the times but right often enough to
> either benefit from using an idle core or by finding an idle sibling which
> may be preferable to stacking tasks on the same CPU.

I attached my test result in one of my replies[1]: netperf showed ~3.5%
regression, hackbench improved a lot, and tbench4 drew. I tested several
times and the results didn't seem vary much.

> 
> The lack of data and the lack of a note on the potential downside is the
> main reason why I'm not acking patch. tbench4 was a particular concern on
> my own tests and it's possible a better patch would be a hybrid approach
> where a limited search of two cores (excluding target) is allowed even if
> SIS_UTIL indicates overload but has_idle_cores is true.
> 

Agreed. And the reason I will exclude this part in v2 is that I plan to
make it part of another feature, SIS filter [2]. The latest version of
SIS filter (not posted yet but soon) will contain all the idle cpus so
we don't need a full scan when has_idle_core. Scanning the filter then
is enough. While it may still cost too much if too many false positive
bits in the filter. Does this direction make sense to you?

[1] 
https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@bytedance.com/
[2] 
https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/

Thanks & Best Regards,
Abel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-29 14:11     ` Abel Wu
@ 2022-08-29 14:56       ` Mel Gorman
  2022-09-01 13:08         ` Abel Wu
  2022-09-02  4:12         ` Abel Wu
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2022-08-29 14:56 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
> Hi Mel, thanks for reviewing!
> 

No problem, sorry it took so long.

> > So, first suggestion is to move this patch to the end of the series as
> > the other patches are relatively harmless. They could even be merged in
> > isolation as a cleanup.
> > 
> > Second, using the other patches as your baseline, include in the
> > changelog what you tested that showed a benefit, what type of machine
> > it was and in particular include the number of cores, nodes and the
> > span of the LLC.  If you measured any regressions, include that as well
> > and make a call on whether you think the patch wins more than it loses.
> > The reason to include that information is because the worst corner case
> > (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> > If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> > patch would be negligible but very different if the LLC span is 20 cores.
> > While the patch is not obviously wrong, it definitely needs better data,
> > Even if you do not have a large test machine available, it's still helpful
> > to have it in the changelog because a reviewer like me can decide "this
> > needs testing on a larger machine".
> 
> Thanks for your detailed suggestions. I will attach benchmark results
> along with some analysis next time when posting performance related
> patches.
> 

Thanks, include this in the changelog. While I had different figures for
hackbench, the figures are still fine. I had similar figures for netperf
(~3-4% regression on some machines but not universal). The tbench figures
are interesting because for you, it was mostly neutral but I did test
it with a CascadeLake machine and had worse results and that machine is
smaller in terms of core counts than yours.

> > 
> > I did queue this up the entire series for testing and while it sometimes
> > benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> > large regressions (e.g. 41% loss on 64 clients with a massive increase in
> > variability) which has multiple small L3 caches per node. tbench4 (slightly
> > modified in mmtests to produce a usable metric) in general showed issues
> > across multiple x86-64 machines both AMD and Intel, multiple generations
> > with a noticable increase in system CPU usage when the client counts reach
> > the stage where the machine is close to saturated. perfpipe for some
> > weird reason showed a large regression apparent regresion on Broadwell
> > but the variability was also crazy so probably can be ignored. hackbench
> > overall looked ok so it's possible I'm wrong about the idle_cores hint
> > being often wrong on that workload and I should double check that. It's
> > possible the hint is wrong some of the times but right often enough to
> > either benefit from using an idle core or by finding an idle sibling which
> > may be preferable to stacking tasks on the same CPU.
> 
> I attached my test result in one of my replies[1]: netperf showed ~3.5%
> regression, hackbench improved a lot, and tbench4 drew. I tested several
> times and the results didn't seem vary much.
> 
> > 
> > The lack of data and the lack of a note on the potential downside is the
> > main reason why I'm not acking patch. tbench4 was a particular concern on
> > my own tests and it's possible a better patch would be a hybrid approach
> > where a limited search of two cores (excluding target) is allowed even if
> > SIS_UTIL indicates overload but has_idle_cores is true.
> > 
> 
> Agreed. And the reason I will exclude this part in v2 is that I plan to
> make it part of another feature, SIS filter [2]. The latest version of
> SIS filter (not posted yet but soon) will contain all the idle cpus so
> we don't need a full scan when has_idle_core. Scanning the filter then
> is enough. While it may still cost too much if too many false positive
> bits in the filter. Does this direction make sense to you?
> 
> [1] https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@bytedance.com/
> [2]
> https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
> 

The filter idea is tricky, it will always struggle with "accurate
information" vs "cost of cpumask update" and what you link indicates that
it's updated on the tick boundary. That will be relatively cheap but could
mean that searches near the point of saturation or overload will have
false positives and negatives which you are already aware of given the
older series. It does not kill the idea but I would strongly suggest that
you do the simple thing first. That would yield potentially 3-4 patches
at the end of the series

1. Full scan for cores (this patch effectively)
2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
   (Compare 1 vs 2 in the changelog, not expected to be a universal win but
    should have better "worst case" behaviour to be worth merging)
3. Filtered scan tracking idle CPUs as a hint while removing the
   "limited scan"
   (Compare 2 vs 3)
4. The SIS "de-entrophy" patch that tries to cope with false positives
   and negatives
   (Compare 3 vs 4)

That way if the later patches do not work as expected then they can be
easily dropped without affecting the rest of the series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-29 14:56       ` Mel Gorman
@ 2022-09-01 13:08         ` Abel Wu
  2022-09-02  4:12         ` Abel Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-09-01 13:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On 8/29/22 10:56 PM, Mel Gorman Wrote:
> On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
>> Hi Mel, thanks for reviewing!
>>
> 
> No problem, sorry it took so long.
> 
>>> So, first suggestion is to move this patch to the end of the series as
>>> the other patches are relatively harmless. They could even be merged in
>>> isolation as a cleanup.
>>>
>>> Second, using the other patches as your baseline, include in the
>>> changelog what you tested that showed a benefit, what type of machine
>>> it was and in particular include the number of cores, nodes and the
>>> span of the LLC.  If you measured any regressions, include that as well
>>> and make a call on whether you think the patch wins more than it loses.
>>> The reason to include that information is because the worst corner case
>>> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
>>> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
>>> patch would be negligible but very different if the LLC span is 20 cores.
>>> While the patch is not obviously wrong, it definitely needs better data,
>>> Even if you do not have a large test machine available, it's still helpful
>>> to have it in the changelog because a reviewer like me can decide "this
>>> needs testing on a larger machine".
>>
>> Thanks for your detailed suggestions. I will attach benchmark results
>> along with some analysis next time when posting performance related
>> patches.
>>
> 
> Thanks, include this in the changelog. While I had different figures for
> hackbench, the figures are still fine. I had similar figures for netperf
> (~3-4% regression on some machines but not universal). The tbench figures
> are interesting because for you, it was mostly neutral but I did test
> it with a CascadeLake machine and had worse results and that machine is
> smaller in terms of core counts than yours.

Interesting, my test machine model is Intel(R) Xeon(R) Platinum 8260 CPU
@ 2.40GHz, which is based on Cascade Lake.  This time I re-tested with
SNC enabled, so each NUMA node now has 12 cores (was 24), but the tbench
result is still neutral..

I just remembered that in the Conclusion part of SIS filter v2 patchset
mentioned a suspicious 50%~90% improvement in netperf&tbench test. Seems
like the result can be misleading under certain conditions.

https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/

> 
>>>
>>> I did queue this up the entire series for testing and while it sometimes
>>> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
>>> large regressions (e.g. 41% loss on 64 clients with a massive increase in
>>> variability) which has multiple small L3 caches per node. tbench4 (slightly
>>> modified in mmtests to produce a usable metric) in general showed issues
>>> across multiple x86-64 machines both AMD and Intel, multiple generations
>>> with a noticable increase in system CPU usage when the client counts reach
>>> the stage where the machine is close to saturated. perfpipe for some
>>> weird reason showed a large regression apparent regresion on Broadwell
>>> but the variability was also crazy so probably can be ignored. hackbench
>>> overall looked ok so it's possible I'm wrong about the idle_cores hint
>>> being often wrong on that workload and I should double check that. It's
>>> possible the hint is wrong some of the times but right often enough to
>>> either benefit from using an idle core or by finding an idle sibling which
>>> may be preferable to stacking tasks on the same CPU.
>>
>> I attached my test result in one of my replies[1]: netperf showed ~3.5%
>> regression, hackbench improved a lot, and tbench4 drew. I tested several
>> times and the results didn't seem vary much.
>>
>>>
>>> The lack of data and the lack of a note on the potential downside is the
>>> main reason why I'm not acking patch. tbench4 was a particular concern on
>>> my own tests and it's possible a better patch would be a hybrid approach
>>> where a limited search of two cores (excluding target) is allowed even if
>>> SIS_UTIL indicates overload but has_idle_cores is true.
>>>
>>
>> Agreed. And the reason I will exclude this part in v2 is that I plan to
>> make it part of another feature, SIS filter [2]. The latest version of
>> SIS filter (not posted yet but soon) will contain all the idle cpus so
>> we don't need a full scan when has_idle_core. Scanning the filter then
>> is enough. While it may still cost too much if too many false positive
>> bits in the filter. Does this direction make sense to you?
>>
>> [1] https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@bytedance.com/
>> [2]
>> https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
>>
> 
> The filter idea is tricky, it will always struggle with "accurate
> information" vs "cost of cpumask update" and what you link indicates that
> it's updated on the tick boundary. That will be relatively cheap but could
> mean that searches near the point of saturation or overload will have
> false positives and negatives which you are already aware of given the

I didn't exclude the CPU_NEWLY_IDLE case, yet still the filter worked
the way unexpected as I found recently. The propagation of the idle or
sched-idle cpus can be skipped if rq->avg_idle is small enough (which
is sysctl_sched_migration_cost 500us by default). This can result in
the benchmarks of frequent wakeup type losing some throughput.

So the filter of my latest version will be updated at __update_idle_core
rather than load_balance, and reset when a full domain scan fails. I
will include more details in the log when posting, and I wish you can
give some advice then.

> older series. It does not kill the idea but I would strongly suggest that
> you do the simple thing first. That would yield potentially 3-4 patches
> at the end of the series
> 
> 1. Full scan for cores (this patch effectively)
> 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
>     (Compare 1 vs 2 in the changelog, not expected to be a universal win but
>      should have better "worst case" behaviour to be worth merging)
> 3. Filtered scan tracking idle CPUs as a hint while removing the
>     "limited scan"
>     (Compare 2 vs 3)
> 4. The SIS "de-entrophy" patch that tries to cope with false positives
>     and negatives
>     (Compare 3 vs 4)
> 
> That way if the later patches do not work as expected then they can be
> easily dropped without affecting the rest of the series.
> 

Copy. Thanks for your advice!

Best Regards,
Abel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-08-29 14:56       ` Mel Gorman
  2022-09-01 13:08         ` Abel Wu
@ 2022-09-02  4:12         ` Abel Wu
  2022-09-02 10:25           ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-09-02  4:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On 8/29/22 10:56 PM, Mel Gorman Wrote:
> On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
>> Hi Mel, thanks for reviewing!
>>
> 
> No problem, sorry it took so long.
> 
>>> So, first suggestion is to move this patch to the end of the series as
>>> the other patches are relatively harmless. They could even be merged in
>>> isolation as a cleanup.
>>>
>>> Second, using the other patches as your baseline, include in the
>>> changelog what you tested that showed a benefit, what type of machine
>>> it was and in particular include the number of cores, nodes and the
>>> span of the LLC.  If you measured any regressions, include that as well
>>> and make a call on whether you think the patch wins more than it loses.
>>> The reason to include that information is because the worst corner case
>>> (all CPUs scanned uselessly) costs more the larger the span of LLC is.
>>> If all the testing was done on a 2-core SMT-2 machine, the overhead of the
>>> patch would be negligible but very different if the LLC span is 20 cores.
>>> While the patch is not obviously wrong, it definitely needs better data,
>>> Even if you do not have a large test machine available, it's still helpful
>>> to have it in the changelog because a reviewer like me can decide "this
>>> needs testing on a larger machine".
>>
>> Thanks for your detailed suggestions. I will attach benchmark results
>> along with some analysis next time when posting performance related
>> patches.
>>
> 
> Thanks, include this in the changelog. While I had different figures for
> hackbench, the figures are still fine. I had similar figures for netperf
> (~3-4% regression on some machines but not universal). The tbench figures
> are interesting because for you, it was mostly neutral but I did test
> it with a CascadeLake machine and had worse results and that machine is
> smaller in terms of core counts than yours.
> 
>>>
>>> I did queue this up the entire series for testing and while it sometimes
>>> benefitted, there were large counter-examples. tbench4 on Zen3 showed some
>>> large regressions (e.g. 41% loss on 64 clients with a massive increase in
>>> variability) which has multiple small L3 caches per node. tbench4 (slightly
>>> modified in mmtests to produce a usable metric) in general showed issues
>>> across multiple x86-64 machines both AMD and Intel, multiple generations
>>> with a noticable increase in system CPU usage when the client counts reach
>>> the stage where the machine is close to saturated. perfpipe for some
>>> weird reason showed a large regression apparent regresion on Broadwell
>>> but the variability was also crazy so probably can be ignored. hackbench
>>> overall looked ok so it's possible I'm wrong about the idle_cores hint
>>> being often wrong on that workload and I should double check that. It's
>>> possible the hint is wrong some of the times but right often enough to
>>> either benefit from using an idle core or by finding an idle sibling which
>>> may be preferable to stacking tasks on the same CPU.
>>
>> I attached my test result in one of my replies[1]: netperf showed ~3.5%
>> regression, hackbench improved a lot, and tbench4 drew. I tested several
>> times and the results didn't seem vary much.
>>
>>>
>>> The lack of data and the lack of a note on the potential downside is the
>>> main reason why I'm not acking patch. tbench4 was a particular concern on
>>> my own tests and it's possible a better patch would be a hybrid approach
>>> where a limited search of two cores (excluding target) is allowed even if
>>> SIS_UTIL indicates overload but has_idle_cores is true.
>>>
>>
>> Agreed. And the reason I will exclude this part in v2 is that I plan to
>> make it part of another feature, SIS filter [2]. The latest version of
>> SIS filter (not posted yet but soon) will contain all the idle cpus so
>> we don't need a full scan when has_idle_core. Scanning the filter then
>> is enough. While it may still cost too much if too many false positive
>> bits in the filter. Does this direction make sense to you?
>>
>> [1] https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@bytedance.com/
>> [2]
>> https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
>>
> 
> The filter idea is tricky, it will always struggle with "accurate
> information" vs "cost of cpumask update" and what you link indicates that
> it's updated on the tick boundary. That will be relatively cheap but could
> mean that searches near the point of saturation or overload will have
> false positives and negatives which you are already aware of given the
> older series. It does not kill the idea but I would strongly suggest that
> you do the simple thing first. That would yield potentially 3-4 patches
> at the end of the series
> 
> 1. Full scan for cores (this patch effectively)
> 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
>     (Compare 1 vs 2 in the changelog, not expected to be a universal win but
>      should have better "worst case" behaviour to be worth merging)

I am afraid I had a hard time on this part, and it would be nice if you
can shed some light on this..

Currently the mainline behavior when has_idle_core:

	nr_idle_scan	0	1	2	...
	nr		0	max	max	...

while this patch makes:

	nr_idle_scan	0	1	2	...
	nr		max	max	max	...

and if limit scan:

	nr_idle_scan	0	1	2	...
	nr		2	2	max	...

anyway, the scan depth for has_idle_core case is not well aligned to
the load. And honestly I'm not sure whether the depth should align to
the load or not, since lower depth can easily put sds->has_idle_core
in vain. Thoughts?

Thanks,
Abel

> 3. Filtered scan tracking idle CPUs as a hint while removing the
>     "limited scan"
>     (Compare 2 vs 3)
> 4. The SIS "de-entrophy" patch that tries to cope with false positives
>     and negatives
>     (Compare 3 vs 4)
> 
> That way if the later patches do not work as expected then they can be
> easily dropped without affecting the rest of the series.
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-02  4:12         ` Abel Wu
@ 2022-09-02 10:25           ` Mel Gorman
  2022-09-05 14:40             ` Abel Wu
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2022-09-02 10:25 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Fri, Sep 02, 2022 at 12:12:31PM +0800, Abel Wu wrote:
> On 8/29/22 10:56 PM, Mel Gorman Wrote:
> > On Mon, Aug 29, 2022 at 10:11:39PM +0800, Abel Wu wrote:
> > > Hi Mel, thanks for reviewing!
> > > 
> > 
> > No problem, sorry it took so long.
> > 
> > > > So, first suggestion is to move this patch to the end of the series as
> > > > the other patches are relatively harmless. They could even be merged in
> > > > isolation as a cleanup.
> > > > 
> > > > Second, using the other patches as your baseline, include in the
> > > > changelog what you tested that showed a benefit, what type of machine
> > > > it was and in particular include the number of cores, nodes and the
> > > > span of the LLC.  If you measured any regressions, include that as well
> > > > and make a call on whether you think the patch wins more than it loses.
> > > > The reason to include that information is because the worst corner case
> > > > (all CPUs scanned uselessly) costs more the larger the span of LLC is.
> > > > If all the testing was done on a 2-core SMT-2 machine, the overhead of the
> > > > patch would be negligible but very different if the LLC span is 20 cores.
> > > > While the patch is not obviously wrong, it definitely needs better data,
> > > > Even if you do not have a large test machine available, it's still helpful
> > > > to have it in the changelog because a reviewer like me can decide "this
> > > > needs testing on a larger machine".
> > > 
> > > Thanks for your detailed suggestions. I will attach benchmark results
> > > along with some analysis next time when posting performance related
> > > patches.
> > > 
> > 
> > Thanks, include this in the changelog. While I had different figures for
> > hackbench, the figures are still fine. I had similar figures for netperf
> > (~3-4% regression on some machines but not universal). The tbench figures
> > are interesting because for you, it was mostly neutral but I did test
> > it with a CascadeLake machine and had worse results and that machine is
> > smaller in terms of core counts than yours.
> > 
> > > > 
> > > > I did queue this up the entire series for testing and while it sometimes
> > > > benefitted, there were large counter-examples. tbench4 on Zen3 showed some
> > > > large regressions (e.g. 41% loss on 64 clients with a massive increase in
> > > > variability) which has multiple small L3 caches per node. tbench4 (slightly
> > > > modified in mmtests to produce a usable metric) in general showed issues
> > > > across multiple x86-64 machines both AMD and Intel, multiple generations
> > > > with a noticable increase in system CPU usage when the client counts reach
> > > > the stage where the machine is close to saturated. perfpipe for some
> > > > weird reason showed a large regression apparent regresion on Broadwell
> > > > but the variability was also crazy so probably can be ignored. hackbench
> > > > overall looked ok so it's possible I'm wrong about the idle_cores hint
> > > > being often wrong on that workload and I should double check that. It's
> > > > possible the hint is wrong some of the times but right often enough to
> > > > either benefit from using an idle core or by finding an idle sibling which
> > > > may be preferable to stacking tasks on the same CPU.
> > > 
> > > I attached my test result in one of my replies[1]: netperf showed ~3.5%
> > > regression, hackbench improved a lot, and tbench4 drew. I tested several
> > > times and the results didn't seem vary much.
> > > 
> > > > 
> > > > The lack of data and the lack of a note on the potential downside is the
> > > > main reason why I'm not acking patch. tbench4 was a particular concern on
> > > > my own tests and it's possible a better patch would be a hybrid approach
> > > > where a limited search of two cores (excluding target) is allowed even if
> > > > SIS_UTIL indicates overload but has_idle_cores is true.
> > > > 
> > > 
> > > Agreed. And the reason I will exclude this part in v2 is that I plan to
> > > make it part of another feature, SIS filter [2]. The latest version of
> > > SIS filter (not posted yet but soon) will contain all the idle cpus so
> > > we don't need a full scan when has_idle_core. Scanning the filter then
> > > is enough. While it may still cost too much if too many false positive
> > > bits in the filter. Does this direction make sense to you?
> > > 
> > > [1] https://lore.kernel.org/lkml/eaa543fa-421d-2194-be94-6a5e24a33b37@bytedance.com/
> > > [2]
> > > https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
> > > 
> > 
> > The filter idea is tricky, it will always struggle with "accurate
> > information" vs "cost of cpumask update" and what you link indicates that
> > it's updated on the tick boundary. That will be relatively cheap but could
> > mean that searches near the point of saturation or overload will have
> > false positives and negatives which you are already aware of given the
> > older series. It does not kill the idea but I would strongly suggest that
> > you do the simple thing first. That would yield potentially 3-4 patches
> > at the end of the series
> > 
> > 1. Full scan for cores (this patch effectively)
> > 2. Limited scan of 2 cores when SIS_UTIL cuts off but has_idle_cores is true
> >     (Compare 1 vs 2 in the changelog, not expected to be a universal win but
> >      should have better "worst case" behaviour to be worth merging)
> 
> I am afraid I had a hard time on this part, and it would be nice if you
> can shed some light on this..
> 
> Currently the mainline behavior when has_idle_core:
> 
> 	nr_idle_scan	0	1	2	...
> 	nr		0	max	max	...
> 
> while this patch makes:
> 
> 	nr_idle_scan	0	1	2	...
> 	nr		max	max	max	...
> 
> and if limit scan:
> 
> 	nr_idle_scan	0	1	2	...
> 	nr		2	2	max	...
> 
> anyway, the scan depth for has_idle_core case is not well aligned to
> the load. And honestly I'm not sure whether the depth should align to
> the load or not, since lower depth can easily put sds->has_idle_core
> in vain. Thoughts?
> 

For the simple case, I was expecting the static depth to *not* match load
because it's unclear what the scaling should be for load or if it had a
benefit. If investigating scaling the scan depth to load, it would still
make sense to compare it to a static depth. The depth of 2 cores was to
partially match the old SIS_PROP behaviour of the minimum depth to scan.

                if (span_avg > 4*avg_cost)
                        nr = div_u64(span_avg, avg_cost);
                else
                        nr = 4;

nr is not proportional to cores although it could be
https://lore.kernel.org/all/20210726102247.21437-7-mgorman@techsingularity.net/

This is not tested or properly checked for correctness but for
illustrative purposes something like this should conduct a limited scan when
overloaded. It has a side-effect that the has_idle_cores hint gets cleared
for a partial scan for idle cores but the hint is probably wrong anyway.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6089251a4720..59b27a2ef465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		if (sd_share) {
 			/* because !--nr is the condition to stop scan */
 			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
-			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
-				return -1;
+
+			/*
+			 * Non-overloaded case: Scan full domain if there is
+			 * 	an idle core. Otherwise, scan for an idle
+			 * 	CPU based on nr_idle_scan
+			 * Overloaded case: Unlikely to have an idle CPU but
+			 * 	conduct a limited scan if there is potentially
+			 * 	an idle core.
+			 */
+			if (nr > 1) {
+				if (has_idle_core)
+					nr = sd->span_weight;
+			} else {
+				if (!has_idle_core)
+					return -1;
+				nr = 2;
+			}
 		}
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
+		if (!--nr)
+			break;
+
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 
 		} else {
-			if (!--nr)
-				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-02 10:25           ` Mel Gorman
@ 2022-09-05 14:40             ` Abel Wu
  2022-09-06  9:57               ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Abel Wu @ 2022-09-05 14:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On 9/2/22 6:25 PM, Mel Gorman Wrote:
> For the simple case, I was expecting the static depth to *not* match load
> because it's unclear what the scaling should be for load or if it had a
> benefit. If investigating scaling the scan depth to load, it would still
> make sense to compare it to a static depth. The depth of 2 cores was to
> partially match the old SIS_PROP behaviour of the minimum depth to scan.
> 
>                  if (span_avg > 4*avg_cost)
>                          nr = div_u64(span_avg, avg_cost);
>                  else
>                          nr = 4;
> 
> nr is not proportional to cores although it could be
> https://lore.kernel.org/all/20210726102247.21437-7-mgorman@techsingularity.net/
> 
> This is not tested or properly checked for correctness but for
> illustrative purposes something like this should conduct a limited scan when
> overloaded. It has a side-effect that the has_idle_cores hint gets cleared
> for a partial scan for idle cores but the hint is probably wrong anyway.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6089251a4720..59b27a2ef465 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   		if (sd_share) {
>   			/* because !--nr is the condition to stop scan */
>   			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> -			/* overloaded LLC is unlikely to have idle cpu/core */
> -			if (nr == 1)
> -				return -1;
> +
> +			/*
> +			 * Non-overloaded case: Scan full domain if there is
> +			 * 	an idle core. Otherwise, scan for an idle
> +			 * 	CPU based on nr_idle_scan
> +			 * Overloaded case: Unlikely to have an idle CPU but
> +			 * 	conduct a limited scan if there is potentially
> +			 * 	an idle core.
> +			 */
> +			if (nr > 1) {
> +				if (has_idle_core)
> +					nr = sd->span_weight;
> +			} else {
> +				if (!has_idle_core)
> +					return -1;
> +				nr = 2;
> +			}
>   		}
>   	}
>   
>   	for_each_cpu_wrap(cpu, cpus, target + 1) {
> +		if (!--nr)
> +			break;
> +
>   		if (has_idle_core) {
>   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>   			if ((unsigned int)i < nr_cpumask_bits)
>   				return i;
>   
>   		} else {
> -			if (!--nr)
> -				return -1;
>   			idle_cpu = __select_idle_cpu(cpu, p);
>   			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>   				break;

I spent last few days testing this, with 3 variations (assume
has_idle_core):

  a) full or limited (2cores) scan when !nr_idle_scan
  b) whether clear sds->has_idle_core when partial scan failed
  c) scale scan depth with load or not

some observations:

  1) It seems always bad if not clear sds->has_idle_core when
     partial scan fails. It is due to over partially scanned
     but still can not find an idle core. (Following ones are
     based on clearing has_idle_core even in partial scans.)

  2) Unconditionally full scan when has_idle_core is not good
     for netperf_{udp,tcp} and tbench4. It is probably because
     the SIS success rate of these workloads is already high
     enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
     hackbench ~= 3.5%) which negate a lot of the benefit full
     scan brings.

  3) Scaling scan depth with load seems good for the hackbench
     socket tests, and neutral in pipe tests. And I think this
     is just the case you mentioned before, under fast wake-up
     workloads the has_idle_core will become not that reliable,
     so a full scan won't always win.

Best Regards,
Abel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-05 14:40             ` Abel Wu
@ 2022-09-06  9:57               ` Mel Gorman
  2022-09-07  7:27                 ` Chen Yu
  2022-09-07  7:52                 ` Abel Wu
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2022-09-06  9:57 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6089251a4720..59b27a2ef465 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >   		if (sd_share) {
> >   			/* because !--nr is the condition to stop scan */
> >   			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > -			/* overloaded LLC is unlikely to have idle cpu/core */
> > -			if (nr == 1)
> > -				return -1;
> > +
> > +			/*
> > +			 * Non-overloaded case: Scan full domain if there is
> > +			 * 	an idle core. Otherwise, scan for an idle
> > +			 * 	CPU based on nr_idle_scan
> > +			 * Overloaded case: Unlikely to have an idle CPU but
> > +			 * 	conduct a limited scan if there is potentially
> > +			 * 	an idle core.
> > +			 */
> > +			if (nr > 1) {
> > +				if (has_idle_core)
> > +					nr = sd->span_weight;
> > +			} else {
> > +				if (!has_idle_core)
> > +					return -1;
> > +				nr = 2;
> > +			}
> >   		}
> >   	}
> >   	for_each_cpu_wrap(cpu, cpus, target + 1) {
> > +		if (!--nr)
> > +			break;
> > +
> >   		if (has_idle_core) {
> >   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >   			if ((unsigned int)i < nr_cpumask_bits)
> >   				return i;
> >   		} else {
> > -			if (!--nr)
> > -				return -1;
> >   			idle_cpu = __select_idle_cpu(cpu, p);
> >   			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >   				break;
> 
> I spent last few days testing this, with 3 variations (assume
> has_idle_core):
> 
>  a) full or limited (2cores) scan when !nr_idle_scan
>  b) whether clear sds->has_idle_core when partial scan failed
>  c) scale scan depth with load or not
> 
> some observations:
> 
>  1) It seems always bad if not clear sds->has_idle_core when
>     partial scan fails. It is due to over partially scanned
>     but still can not find an idle core. (Following ones are
>     based on clearing has_idle_core even in partial scans.)
> 

Ok, that's rational. There will be corner cases where there was no idle
CPU near the target when there is an idle core far away but it would be
counter to the purpose of SIS_UTIL to care about that corner case.

>  2) Unconditionally full scan when has_idle_core is not good
>     for netperf_{udp,tcp} and tbench4. It is probably because
>     the SIS success rate of these workloads is already high
>     enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
>     hackbench ~= 3.5%) which negate a lot of the benefit full
>     scan brings.
> 

That's also rational. For a single client/server on netperf, it's expected
that the SIS success rate is high and scanning is minimal. As the client
and server are sharing data on localhost and somewhat synchronous, it may
even partially benefit from SMT sharing.

So basic approach would be "always clear sds->has_idle_core" + "limit
scan even when has_idle_core is true", right?

If so, stick a changelog on it and resend!

>  3) Scaling scan depth with load seems good for the hackbench
>     socket tests, and neutral in pipe tests. And I think this
>     is just the case you mentioned before, under fast wake-up
>     workloads the has_idle_core will become not that reliable,
>     so a full scan won't always win.
> 

My recommendation is leave this out for now but assuming the rest of
the patches get picked up, consider posting it for the next major kernel
version (i.e. separate the basic and clever approaches by one major kernel
version). By separating them, there is less chance of a false positive
bisection pointing to the wrong patch. Any regression will not be perfectly
reproducible so the changes of a false positive bisection is relatively high.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-06  9:57               ` Mel Gorman
@ 2022-09-07  7:27                 ` Chen Yu
  2022-09-07  8:41                   ` Mel Gorman
  2022-09-07  7:52                 ` Abel Wu
  1 sibling, 1 reply; 34+ messages in thread
From: Chen Yu @ 2022-09-07  7:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	linux-kernel

Hi Mel,
On 2022-09-06 at 10:57:17 +0100, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6089251a4720..59b27a2ef465 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > >   		if (sd_share) {
> > >   			/* because !--nr is the condition to stop scan */
> > >   			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > > -			/* overloaded LLC is unlikely to have idle cpu/core */
> > > -			if (nr == 1)
> > > -				return -1;
> > > +
> > > +			/*
> > > +			 * Non-overloaded case: Scan full domain if there is
> > > +			 * 	an idle core. Otherwise, scan for an idle
> > > +			 * 	CPU based on nr_idle_scan
> > > +			 * Overloaded case: Unlikely to have an idle CPU but
> > > +			 * 	conduct a limited scan if there is potentially
> > > +			 * 	an idle core.
> > > +			 */
> > > +			if (nr > 1) {
> > > +				if (has_idle_core)
> > > +					nr = sd->span_weight;
> > > +			} else {
> > > +				if (!has_idle_core)
> > > +					return -1;
> > > +				nr = 2;
> > > +			}
> > >   		}
> > >   	}
> > >   	for_each_cpu_wrap(cpu, cpus, target + 1) {
> > > +		if (!--nr)
> > > +			break;
> > > +
> > >   		if (has_idle_core) {
> > >   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > >   			if ((unsigned int)i < nr_cpumask_bits)
> > >   				return i;
> > >   		} else {
> > > -			if (!--nr)
> > > -				return -1;
> > >   			idle_cpu = __select_idle_cpu(cpu, p);
> > >   			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > >   				break;
> > 
> > I spent last few days testing this, with 3 variations (assume
> > has_idle_core):
> > 
> >  a) full or limited (2cores) scan when !nr_idle_scan
> >  b) whether clear sds->has_idle_core when partial scan failed
> >  c) scale scan depth with load or not
> > 
> > some observations:
> > 
> >  1) It seems always bad if not clear sds->has_idle_core when
> >     partial scan fails. It is due to over partially scanned
> >     but still can not find an idle core. (Following ones are
> >     based on clearing has_idle_core even in partial scans.)
> > 
> 
> Ok, that's rational. There will be corner cases where there was no idle
> CPU near the target when there is an idle core far away but it would be
> counter to the purpose of SIS_UTIL to care about that corner case.
> 
> >  2) Unconditionally full scan when has_idle_core is not good
> >     for netperf_{udp,tcp} and tbench4. It is probably because
> >     the SIS success rate of these workloads is already high
> >     enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
> >     hackbench ~= 3.5%) which negate a lot of the benefit full
> >     scan brings.
> > 
> 
> That's also rational. For a single client/server on netperf, it's expected
> that the SIS success rate is high and scanning is minimal. As the client
> and server are sharing data on localhost and somewhat synchronous, it may
> even partially benefit from SMT sharing.
>
Maybe off-topic, since we monitor the success rate and also other metrics
for each optimization in SIS path, is it possible to merge your statistics
patch [1] into upstream so we don't need to rebase in the future(although
it is targeting kernel development)?

Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
[cut]


thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-06  9:57               ` Mel Gorman
  2022-09-07  7:27                 ` Chen Yu
@ 2022-09-07  7:52                 ` Abel Wu
  1 sibling, 0 replies; 34+ messages in thread
From: Abel Wu @ 2022-09-07  7:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	linux-kernel

On 9/6/22 5:57 PM, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 10:40:00PM +0800, Abel Wu wrote:
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6089251a4720..59b27a2ef465 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6427,21 +6427,36 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>    		if (sd_share) {
>>>    			/* because !--nr is the condition to stop scan */
>>>    			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
>>> -			/* overloaded LLC is unlikely to have idle cpu/core */
>>> -			if (nr == 1)
>>> -				return -1;
>>> +
>>> +			/*
>>> +			 * Non-overloaded case: Scan full domain if there is
>>> +			 * 	an idle core. Otherwise, scan for an idle
>>> +			 * 	CPU based on nr_idle_scan
>>> +			 * Overloaded case: Unlikely to have an idle CPU but
>>> +			 * 	conduct a limited scan if there is potentially
>>> +			 * 	an idle core.
>>> +			 */
>>> +			if (nr > 1) {
>>> +				if (has_idle_core)
>>> +					nr = sd->span_weight;
>>> +			} else {
>>> +				if (!has_idle_core)
>>> +					return -1;
>>> +				nr = 2;
>>> +			}
>>>    		}
>>>    	}
>>>    	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>> +		if (!--nr)
>>> +			break;
>>> +
>>>    		if (has_idle_core) {
>>>    			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>    			if ((unsigned int)i < nr_cpumask_bits)
>>>    				return i;
>>>    		} else {
>>> -			if (!--nr)
>>> -				return -1;
>>>    			idle_cpu = __select_idle_cpu(cpu, p);
>>>    			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>>    				break;
>>
>> I spent last few days testing this, with 3 variations (assume
>> has_idle_core):
>>
>>   a) full or limited (2cores) scan when !nr_idle_scan
>>   b) whether clear sds->has_idle_core when partial scan failed
>>   c) scale scan depth with load or not
>>
>> some observations:
>>
>>   1) It seems always bad if not clear sds->has_idle_core when
>>      partial scan fails. It is due to over partially scanned
>>      but still can not find an idle core. (Following ones are
>>      based on clearing has_idle_core even in partial scans.)
>>
> 
> Ok, that's rational. There will be corner cases where there was no idle
> CPU near the target when there is an idle core far away but it would be
> counter to the purpose of SIS_UTIL to care about that corner case.

Yes, and this corner case (that may become normal sometimes actually)
will be continuously exist if scanning in a linear fashion while scan
depth is limited (SIS_UTIL/SIS_PROP/...), especially when the LLC is
getting larger nowadays.

> 
>>   2) Unconditionally full scan when has_idle_core is not good
>>      for netperf_{udp,tcp} and tbench4. It is probably because
>>      the SIS success rate of these workloads is already high
>>      enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
>>      hackbench ~= 3.5%) which negate a lot of the benefit full
>>      scan brings.
>>
> 
> That's also rational. For a single client/server on netperf, it's expected
> that the SIS success rate is high and scanning is minimal. As the client
> and server are sharing data on localhost and somewhat synchronous, it may
> even partially benefit from SMT sharing.
> 
> So basic approach would be "always clear sds->has_idle_core" + "limit
> scan even when has_idle_core is true", right?

Yes, exactly the same as what you suggested at first place.

> 
> If so, stick a changelog on it and resend!

I will include this in the SIS_FILTER patchset.

> 
>>   3) Scaling scan depth with load seems good for the hackbench
>>      socket tests, and neutral in pipe tests. And I think this
>>      is just the case you mentioned before, under fast wake-up
>>      workloads the has_idle_core will become not that reliable,
>>      so a full scan won't always win.
>>
> 
> My recommendation is leave this out for now but assuming the rest of
> the patches get picked up, consider posting it for the next major kernel
> version (i.e. separate the basic and clever approaches by one major kernel
> version). By separating them, there is less chance of a false positive
> bisection pointing to the wrong patch. Any regression will not be perfectly
> reproducible so the changes of a false positive bisection is relatively high.

Makes sense. I will just include the basic part first.

Thanks,
Abel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core
  2022-09-07  7:27                 ` Chen Yu
@ 2022-09-07  8:41                   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2022-09-07  8:41 UTC (permalink / raw)
  To: Chen Yu
  Cc: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	linux-kernel

On Wed, Sep 07, 2022 at 03:27:41PM +0800, Chen Yu wrote:
> > Ok, that's rational. There will be corner cases where there was no idle
> > CPU near the target when there is an idle core far away but it would be
> > counter to the purpose of SIS_UTIL to care about that corner case.
> > 
> > >  2) Unconditionally full scan when has_idle_core is not good
> > >     for netperf_{udp,tcp} and tbench4. It is probably because
> > >     the SIS success rate of these workloads is already high
> > >     enough (netperf ~= 100%, tbench4 ~= 50%, compared to that
> > >     hackbench ~= 3.5%) which negate a lot of the benefit full
> > >     scan brings.
> > > 
> > 
> > That's also rational. For a single client/server on netperf, it's expected
> > that the SIS success rate is high and scanning is minimal. As the client
> > and server are sharing data on localhost and somewhat synchronous, it may
> > even partially benefit from SMT sharing.
> >
> Maybe off-topic, since we monitor the success rate and also other metrics
> for each optimization in SIS path, is it possible to merge your statistics
> patch [1] into upstream so we don't need to rebase in the future(although
> it is targeting kernel development)?
> 

I am doubtful it is a merge candidate. While it's very useful when modifying
SIS and gathering data on whether SIS is behaving as expected, it has little
practical benefit when debugging problems on normal systems.  Crude estimates
can be obtained by other methods. Probing when select_idle_sibling and
select_idle_cpu are called reveals the SIS fast and slow paths and the
ratio between time. Tracking the time spent in select_idle_cpu reveals
how much time is spent finding idle cores and cpus.

I would not object to someone trying but the changlog would benefit from
explaining why it's practically useful. Every time I've used it, it was to
justify another patch being merged or investigating various SIS corner cases.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-09-07  8:50 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-12  8:20 [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu
2022-07-12  8:20 ` [PATCH 1/5] sched/fair: ignore SIS_UTIL when has idle core Abel Wu
2022-07-13  3:47   ` Chen Yu
2022-07-13 16:14     ` Abel Wu
2022-07-14  6:19   ` Yicong Yang
2022-07-14  6:58     ` Abel Wu
2022-07-14  7:15       ` Yicong Yang
2022-07-14  8:00         ` Abel Wu
2022-07-14  8:16           ` Yicong Yang
2022-07-14  8:34             ` Yicong Yang
2022-08-04  9:59       ` Chen Yu
2022-08-15  2:54         ` Abel Wu
2022-08-10 13:50   ` Chen Yu
2022-08-15  2:44     ` Abel Wu
2022-08-29 13:08   ` Mel Gorman
2022-08-29 14:11     ` Abel Wu
2022-08-29 14:56       ` Mel Gorman
2022-09-01 13:08         ` Abel Wu
2022-09-02  4:12         ` Abel Wu
2022-09-02 10:25           ` Mel Gorman
2022-09-05 14:40             ` Abel Wu
2022-09-06  9:57               ` Mel Gorman
2022-09-07  7:27                 ` Chen Yu
2022-09-07  8:41                   ` Mel Gorman
2022-09-07  7:52                 ` Abel Wu
2022-07-12  8:20 ` [PATCH 2/5] sched/fair: default to false in test_idle_cores Abel Wu
2022-08-29 12:36   ` Mel Gorman
2022-07-12  8:20 ` [PATCH 3/5] sched/fair: remove redundant check in select_idle_smt Abel Wu
2022-08-29 12:36   ` Mel Gorman
2022-07-12  8:20 ` [PATCH 4/5] sched/fair: avoid double search on same cpu Abel Wu
2022-08-29 12:36   ` Mel Gorman
2022-07-12  8:20 ` [PATCH 5/5] sched/fair: remove useless check in select_idle_core Abel Wu
2022-08-29 12:37   ` Mel Gorman
2022-08-15 13:31 ` [PATCH 0/5] sched/fair: SIS improvements and cleanups Abel Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).