linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
@ 2022-03-10  0:52 Chen Yu
  2022-03-14  4:53 ` Abel Wu
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Chen Yu @ 2022-03-10  0:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, Chen Yu, K Prateek Nayak

[Problem Statement]
Currently select_idle_cpu() uses the percpu average idle time to
estimate the total LLC domain idle time, and calculate the number
of CPUs to be scanned. This might be inconsistent because idle time
of a CPU does not necessarily correlate with idle time of a domain.
As a result, the load could be underestimated and causes over searching
when the system is very busy.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instance of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)
UDP_RR          	thread-224	      242.83	(  6.29)

The netperf 99th latency(usec) above is comparable with the time spent in
select_idle_cpu(). That is to say, when the system is overloaded, searching
for idle CPU could be a bottleneck.

[Proposal]
The main idea is to replace percpu average idle time with the domain
based metric. Choose average CPU utilization(util_avg) as the candidate.
In general, the number of CPUs to be scanned should be inversely
proportional to the sum of util_avg in this domain. That is, the lower
the util_avg is, the more select_idle_cpu() should scan for idle CPU,
and vice versa. The benefit of choosing util_avg is that, it is a metric
of accumulated historic activity, which seems to be more accurate than
instantaneous metrics(such as rq->nr_running).

Furthermore, borrow the util_avg from periodic load balance,
which could offload the overhead of select_idle_cpu().

According to last discussion[1], introduced the linear function
for experimental purpose:

f(x) = a - bx

     llc_size
x = \Sum      util_avg[cpu] / llc_cpu_capacity
     1
f(x) is the number of CPUs to be scanned, x is the sum util_avg.
To decide a and b, the following condition should be met:

[1] f(0) = llc_size
[2] f(x) = 4,  x >= 50%

That is to say, when the util_avg is 0, we should search for
the whole LLC domain. And if util_avg ratio reaches 50% or higher,
it should search at most 4 CPUs.

Yes, there would be questions like:
Why using this linear function to calculate the number of CPUs to
be scanned? Why choosing 50% as the threshold? These questions will
be discussed in the [Limitations] section.

[Benchmark]
netperf, hackbench, schbench, tbench
were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
of CPU number (these ratios are not CPU utilization). Each test lasts
for 100 seconds, and repeats 3 times. The system would reboot into a
fresh environment for each benchmark.

The following is the benchmark result comparison between
baseline:vanilla and compare:patched kernel. Positive compare%
indicates better performance.

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)

hackbench
=========
(each group has 1/4 * 112 tasks)
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)

schbench
========
(each mthread group has 1/4 * 112 tasks)
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)

According to the results above, when the workloads is heavy, the throughput
of netperf improves a lot. It might be interesting to look into the reason
why this patch benefits netperf significantly. Further investigation has
shown that, this might be a 'side effect' of this patch. It is found that,
the CPU utilization is around 90% on vanilla kernel, while it is nearly
100% on patched kernel. According to the perf profile, with the patch
applied, the scheduler would likely to choose previous running CPU for the
waking task, thus reduces runqueue lock contention, so the CPU utilization
is higher and get better performance.

[Limitations]
Q:Why using 50% as the util_avg/capacity threshold to search at most 4 CPUs?

A: 50% is chosen as that corresponds to almost full CPU utilization, when
   the CPU is fixed to run at its base frequency, with turbo enabled.
   4 is the minimal number of CPUs to be scanned in current select_idle_cpu().

   A synthetic workload was used to simulate different level of
   load. This workload takes every 10ms as the sample period, and in
   each sample period:

   while (!timeout_10ms) {
        loop(busy_pct_ms);
   	sleep(10ms-busy_pct_ms)
   }

   to simulate busy_pct% of CPU utilization. When the workload is
   running, the percpu runqueue util_avg was monitored. The
   following is the result from turbostat's Busy% on CPU2 and
   cfs_rq[2].util_avg from /sys/kernel/debug/sched/debug:

   Busy%    util_avg   util_avg/cpu_capacity%
   10.06    35         3.42
   19.97    99         9.67
   29.93    154        15.04
   39.86    213        20.80
   49.79    256        25.00
   59.73    325        31.74
   69.77    437        42.68
   79.69    458        44.73
   89.62    519        50.68
   99.54    598        58.39

   The reason why util_avg ratio is not consistent with Busy% might be due
   to CPU frequency invariance. The CPU is running at fixed lower frequency
   than the turbo frequency, then the util_avg scales lower than
   SCHED_CAPACITY_SCALE. In our test platform, the base frequency is 1.9GHz,
   and the max turbo frequency is 3.7GHz, so 1.9/3.7 is around 50%.
   In the future maybe we could use arch_scale_freq_capacity()
   instead of sds->total_capacity, so as to remove the impact from frequency.
   Then the 50% could be adjusted higher. For now, 50% is an aggressive
   threshold to restric the idle CPU searching and shows benchmark
   improvement.

Q: Why using nr_scan = a - b * sum_util_avg to do linear search?

A: Ideally the nr_scan could be:

   nr_scan = sum_util_avg / pelt_avg_scan_cost

   However consider the overhead of calculating pelt on avg_scan_cost
   in each wake up, choosing heuristic search for evaluation seems to
   be an acceptable trade-off.

   The f(sum_util_avg) could be of any form, as long as it is a monotonically
   decreasing function. At first f(x) = a - 2^(bx) was chosen. Because when the
   sum_util_avg is low, the system should try very hard to find an idle CPU. And
   if sum_util_avg goes higher, the system dramatically lose its interest to search
   for the idle CPU. But exponential function does have its drawback:

   Consider a system with 112 CPUs, let f(x) =  112 when x = 0,
   f(x) = 4 when x = 50, x belongs to [0, 100], then we have:

   f1(x) = 113 - 2^(x / 7.35)
   and
   f2(x) = 112 - 2.16 * x

   Since kernel does not support floating point, above functions are converted into:
   nr_scan1(x) = 113 - 2^(x / 7)
   and
   nr_scan2(x) = 112 - 2 * x

   util_avg%      0     1     2  ...   8     9   ... 47   48   49
   nr_scan1     112   112   112      111   111       49   49    4
   nr_scan2     112   110   108       96    94       18   16   14

   According to above result, the granularity of exponential function
   is coarse-grained, while the linear function is fine-grained.

   So finally choose linear function. After all, it has shown benchmark
   benefit without noticeable regression so far.

Q: How to deal with the following corner case:

   It is possible that there is unbalanced tasks among CPUs due to CPU affinity.
   For example, suppose the LLC domain is composed of 6 CPUs, and 5 tasks are bound
   to CPU0~CPU4, while CPU5 is idle:

		CPU0	CPU1	CPU2	CPU3	CPU4	CPU5
   util_avg	1024	1024	1024	1024	1024	0

   Since the util_avg ratio is 83%( = 5/6 ), which is higher than 50%, select_idle_cpu()
   only searches 4 CPUs starting from CPU0, thus leaves idle CPU5 undetected.

   A possible workaround to mitigate this problem is that, the nr_scan should
   be increased by the number of idle CPUs found during periodic load balance
   in update_sd_lb_stats(). In above example, the nr_scan will be adjusted to
   4 + 1 = 5. Currently I don't have better solution in mind to deal with it
   gracefully.

Any comment is appreciated.

Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net/ # [1]
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c            | 107 +++++++++++++++++++--------------
 2 files changed, 63 insertions(+), 45 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..aae558459f00 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		nr_idle_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5146163bfabb..59f5f8432c21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6271,43 +6271,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	struct rq *this_rq = this_rq();
-	int this = smp_processor_id();
-	struct sched_domain *this_sd;
-	u64 time = 0;
-
-	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-	if (!this_sd)
-		return -1;
+	struct sched_domain_shared *sd_share;
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
-		u64 avg_cost, avg_idle, span_avg;
-		unsigned long now = jiffies;
-
-		/*
-		 * If we're busy, the assumption that the last idle period
-		 * predicts the future is flawed; age away the remaining
-		 * predicted idle time.
-		 */
-		if (unlikely(this_rq->wake_stamp < now)) {
-			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
-				this_rq->wake_stamp++;
-				this_rq->wake_avg_idle >>= 1;
-			}
-		}
-
-		avg_idle = this_rq->wake_avg_idle;
-		avg_cost = this_sd->avg_scan_cost + 1;
-
-		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
-			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
-
-		time = cpu_clock(this);
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+		if (sd_share)
+			nr = READ_ONCE(sd_share->nr_idle_scan);
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
@@ -6328,18 +6299,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (has_idle_core)
 		set_idle_cores(target, false);
 
-	if (sched_feat(SIS_PROP) && !has_idle_core) {
-		time = cpu_clock(this) - time;
-
-		/*
-		 * Account for the scan cost of wakeups against the average
-		 * idle time.
-		 */
-		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
-
-		update_avg(&this_sd->avg_scan_cost, time);
-	}
-
 	return idle_cpu;
 }
 
@@ -9199,6 +9158,60 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static inline void update_nr_idle_scan(struct lb_env *env, struct sd_lb_stats *sds,
+				       unsigned long sum_util)
+{
+	struct sched_domain_shared *sd_share;
+	int llc_size = per_cpu(sd_llc_size, env->dst_cpu);
+	int nr_scan;
+
+	/*
+	 * Update the number of CPUs to scan in LLC domain, which could
+	 * be used as a hint in select_idle_cpu(). The update of this hint
+	 * occurs during periodic load balancing, rather than frequent
+	 * newidle balance.
+	 */
+	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+	if (!sd_share)
+		return;
+
+	/*
+	 * In general, the number of cpus to be scanned should be
+	 * inversely proportional to the sum_util. That is, the lower
+	 * the sum_util is, the harder select_idle_cpu() should scan
+	 * for idle CPU, and vice versa. Let x be the sum_util ratio
+	 * [0-100] of the LLC domain, f(x) be the number of CPUs scanned:
+	 *
+	 * f(x) = a - bx                                              [1]
+	 *
+	 * Consider that f(x) = nr_llc when x = 0, and f(x) = 4 when
+	 * x >= threshold('h' below) then:
+	 *
+	 * a = llc_size;
+	 * b = (nr_llc - 4) / h                                       [2]
+	 *
+	 * then [2] becomes:
+	 *
+	 * f(x) = llc_size - (llc_size -4)x/h                         [3]
+	 *
+	 * Choose 50 (50%) for h as the threshold from experiment result.
+	 * And since x = 100 * sum_util / total_cap, [3] becomes:
+	 *
+	 * f(sum_util)
+	 *  = llc_size - (llc_size - 4) * 100 * sum_util / total_cap * 50
+	 *  = llc_size - (llc_size - 4) * 2 * sum_util / total_cap
+	 *
+	 */
+	nr_scan = llc_size - (llc_size - 4) * 2 * sum_util / sds->total_capacity;
+	if (nr_scan < 4)
+		nr_scan = 4;
+
+	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -9212,6 +9225,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	int sg_status = 0;
+	unsigned long sum_util = 0;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -9242,6 +9256,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
+		sum_util += sgs->group_util;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
@@ -9268,6 +9283,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
+
+	update_nr_idle_scan(env, sds, sum_util);
 }
 
 #define NUMA_IMBALANCE_MIN 2
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-10  0:52 [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg Chen Yu
@ 2022-03-14  4:53 ` Abel Wu
  2022-03-14 12:56   ` Chen Yu
  2022-03-15 11:37 ` K Prateek Nayak
  2022-03-17 17:39 ` Yicong Yang
  2 siblings, 1 reply; 15+ messages in thread
From: Abel Wu @ 2022-03-14  4:53 UTC (permalink / raw)
  To: Chen Yu, linux-kernel
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

Hi Chen,

在 3/10/22 8:52 AM, Chen Yu 写道:
> [Problem Statement]
> Currently select_idle_cpu() uses the percpu average idle time to
> estimate the total LLC domain idle time, and calculate the number
> of CPUs to be scanned. This might be inconsistent because idle time
> of a CPU does not necessarily correlate with idle time of a domain.
> As a result, the load could be underestimated and causes over searching
> when the system is very busy.
> 
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instance of netperf on a system with 112 CPUs
> per LLC domain:
> 
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
> 
> netperf latency:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> UDP_RR          	thread-224	      242.83	(  6.29)
> 
> The netperf 99th latency(usec) above is comparable with the time spent in
> select_idle_cpu(). That is to say, when the system is overloaded, searching
> for idle CPU could be a bottleneck.
> 
> [Proposal]
> The main idea is to replace percpu average idle time with the domain
> based metric. Choose average CPU utilization(util_avg) as the candidate.
> In general, the number of CPUs to be scanned should be inversely
> proportional to the sum of util_avg in this domain. That is, the lower
> the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> and vice versa. The benefit of choosing util_avg is that, it is a metric
> of accumulated historic activity, which seems to be more accurate than
> instantaneous metrics(such as rq->nr_running).
> 
> Furthermore, borrow the util_avg from periodic load balance,
> which could offload the overhead of select_idle_cpu().
> 
> According to last discussion[1], introduced the linear function
> for experimental purpose:

It would be better if you can prove it's a linear model by the
SIS efficiency statistics :)

> 
> f(x) = a - bx
> 
>       llc_size
> x = \Sum      util_avg[cpu] / llc_cpu_capacity
>       1
> f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> To decide a and b, the following condition should be met:
> 
> [1] f(0) = llc_size
> [2] f(x) = 4,  x >= 50%
> 
> That is to say, when the util_avg is 0, we should search for
> the whole LLC domain. And if util_avg ratio reaches 50% or higher,
> it should search at most 4 CPUs.
> 
> Yes, there would be questions like:
> Why using this linear function to calculate the number of CPUs to
> be scanned? Why choosing 50% as the threshold? These questions will
> be discussed in the [Limitations] section.
> 
> [Benchmark]
> netperf, hackbench, schbench, tbench
> were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
> of CPU number (these ratios are not CPU utilization). Each test lasts
> for 100 seconds, and repeats 3 times. The system would reboot into a
> fresh environment for each benchmark.
> 
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
> TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
> TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
> TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
> TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
> TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
> TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
> TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
> UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
> UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
> UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
> UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
> UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
> UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
> UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
> UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
> 
> hackbench
> =========
> (each group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
> process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
> process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
> process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
> process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
> process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
> process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
> process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
> process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
> process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
> process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
> process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
> process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
> process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
> process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
> process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
> threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
> threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
> threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
> threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
> threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
> threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
> threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
> threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
> threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
> threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
> threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
> threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
> threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
> threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
> threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
> threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
> loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
> loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
> loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
> loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
> loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
> loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
> loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
> 
> schbench
> ========
> (each mthread group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
> normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
> normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
> normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
> normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
> normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
> normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
> normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
> 
> According to the results above, when the workloads is heavy, the throughput
> of netperf improves a lot. It might be interesting to look into the reason
> why this patch benefits netperf significantly. Further investigation has
> shown that, this might be a 'side effect' of this patch. It is found that,
> the CPU utilization is around 90% on vanilla kernel, while it is nearly
> 100% on patched kernel. According to the perf profile, with the patch
> applied, the scheduler would likely to choose previous running CPU for the
> waking task, thus reduces runqueue lock contention, so the CPU utilization
> is higher and get better performance.
> 
> [Limitations]
> Q:Why using 50% as the util_avg/capacity threshold to search at most 4 CPUs?
> 
> A: 50% is chosen as that corresponds to almost full CPU utilization, when
>     the CPU is fixed to run at its base frequency, with turbo enabled.
>     4 is the minimal number of CPUs to be scanned in current select_idle_cpu().
> 
>     A synthetic workload was used to simulate different level of
>     load. This workload takes every 10ms as the sample period, and in
>     each sample period:
> 
>     while (!timeout_10ms) {
>          loop(busy_pct_ms);
>     	sleep(10ms-busy_pct_ms)
>     }
> 
>     to simulate busy_pct% of CPU utilization. When the workload is
>     running, the percpu runqueue util_avg was monitored. The
>     following is the result from turbostat's Busy% on CPU2 and
>     cfs_rq[2].util_avg from /sys/kernel/debug/sched/debug:
> 
>     Busy%    util_avg   util_avg/cpu_capacity%
>     10.06    35         3.42
>     19.97    99         9.67
>     29.93    154        15.04
>     39.86    213        20.80
>     49.79    256        25.00
>     59.73    325        31.74
>     69.77    437        42.68
>     79.69    458        44.73
>     89.62    519        50.68
>     99.54    598        58.39
> 
>     The reason why util_avg ratio is not consistent with Busy% might be due
>     to CPU frequency invariance. The CPU is running at fixed lower frequency
>     than the turbo frequency, then the util_avg scales lower than
>     SCHED_CAPACITY_SCALE. In our test platform, the base frequency is 1.9GHz,
>     and the max turbo frequency is 3.7GHz, so 1.9/3.7 is around 50%.
>     In the future maybe we could use arch_scale_freq_capacity()
>     instead of sds->total_capacity, so as to remove the impact from frequency.
>     Then the 50% could be adjusted higher. For now, 50% is an aggressive
>     threshold to restric the idle CPU searching and shows benchmark
>     improvement.
> 
> Q: Why using nr_scan = a - b * sum_util_avg to do linear search?
> 
> A: Ideally the nr_scan could be:
> 
>     nr_scan = sum_util_avg / pelt_avg_scan_cost
> 
>     However consider the overhead of calculating pelt on avg_scan_cost
>     in each wake up, choosing heuristic search for evaluation seems to
>     be an acceptable trade-off.
> 
>     The f(sum_util_avg) could be of any form, as long as it is a monotonically
>     decreasing function. At first f(x) = a - 2^(bx) was chosen. Because when the
>     sum_util_avg is low, the system should try very hard to find an idle CPU. And
>     if sum_util_avg goes higher, the system dramatically lose its interest to search
>     for the idle CPU. But exponential function does have its drawback:
> 
>     Consider a system with 112 CPUs, let f(x) =  112 when x = 0,
>     f(x) = 4 when x = 50, x belongs to [0, 100], then we have:
> 
>     f1(x) = 113 - 2^(x / 7.35)
>     and
>     f2(x) = 112 - 2.16 * x
> 
>     Since kernel does not support floating point, above functions are converted into:
>     nr_scan1(x) = 113 - 2^(x / 7)
>     and
>     nr_scan2(x) = 112 - 2 * x
> 
>     util_avg%      0     1     2  ...   8     9   ... 47   48   49
>     nr_scan1     112   112   112      111   111       49   49    4
>     nr_scan2     112   110   108       96    94       18   16   14
> 
>     According to above result, the granularity of exponential function
>     is coarse-grained, while the linear function is fine-grained.
> 
>     So finally choose linear function. After all, it has shown benchmark
>     benefit without noticeable regression so far.
> 
> Q: How to deal with the following corner case:
> 
>     It is possible that there is unbalanced tasks among CPUs due to CPU affinity.
>     For example, suppose the LLC domain is composed of 6 CPUs, and 5 tasks are bound
>     to CPU0~CPU4, while CPU5 is idle:
> 
> 		CPU0	CPU1	CPU2	CPU3	CPU4	CPU5
>     util_avg	1024	1024	1024	1024	1024	0
> 
>     Since the util_avg ratio is 83%( = 5/6 ), which is higher than 50%, select_idle_cpu()
>     only searches 4 CPUs starting from CPU0, thus leaves idle CPU5 undetected.
> 
>     A possible workaround to mitigate this problem is that, the nr_scan should
>     be increased by the number of idle CPUs found during periodic load balance
>     in update_sd_lb_stats(). In above example, the nr_scan will be adjusted to
>     4 + 1 = 5. Currently I don't have better solution in mind to deal with it
>     gracefully.
> 
> Any comment is appreciated.
> 
> Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net/ # [1]
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   include/linux/sched/topology.h |   1 +
>   kernel/sched/fair.c            | 107 +++++++++++++++++++--------------
>   2 files changed, 63 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..aae558459f00 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>   	atomic_t	ref;
>   	atomic_t	nr_busy_cpus;
>   	int		has_idle_cores;
> +	int		nr_idle_scan;
>   };
>   
>   struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5146163bfabb..59f5f8432c21 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6271,43 +6271,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   {
>   	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>   	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> -	struct rq *this_rq = this_rq();
> -	int this = smp_processor_id();
> -	struct sched_domain *this_sd;
> -	u64 time = 0;
> -
> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -	if (!this_sd)
> -		return -1;
> +	struct sched_domain_shared *sd_share;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>   
>   	if (sched_feat(SIS_PROP) && !has_idle_core) {
> -		u64 avg_cost, avg_idle, span_avg;
> -		unsigned long now = jiffies;
> -
> -		/*
> -		 * If we're busy, the assumption that the last idle period
> -		 * predicts the future is flawed; age away the remaining
> -		 * predicted idle time.
> -		 */
> -		if (unlikely(this_rq->wake_stamp < now)) {
> -			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
> -				this_rq->wake_stamp++;
> -				this_rq->wake_avg_idle >>= 1;
> -			}
> -		}
> -
> -		avg_idle = this_rq->wake_avg_idle;
> -		avg_cost = this_sd->avg_scan_cost + 1;
> -
> -		span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> -			nr = div_u64(span_avg, avg_cost);
> -		else
> -			nr = 4;
> -
> -		time = cpu_clock(this);
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share)
> +			nr = READ_ONCE(sd_share->nr_idle_scan);
>   	}
>   
>   	for_each_cpu_wrap(cpu, cpus, target + 1) {
> @@ -6328,18 +6299,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   	if (has_idle_core)
>   		set_idle_cores(target, false);
>   
> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> -		time = cpu_clock(this) - time;
> -
> -		/*
> -		 * Account for the scan cost of wakeups against the average
> -		 * idle time.
> -		 */
> -		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> -
> -		update_avg(&this_sd->avg_scan_cost, time);
> -	}
> -
>   	return idle_cpu;
>   }
>   
> @@ -9199,6 +9158,60 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>   	return idlest;
>   }
>   
> +static inline void update_nr_idle_scan(struct lb_env *env, struct sd_lb_stats *sds,
> +				       unsigned long sum_util)
> +{
> +	struct sched_domain_shared *sd_share;
> +	int llc_size = per_cpu(sd_llc_size, env->dst_cpu);
> +	int nr_scan;
> +
> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of this hint
> +	 * occurs during periodic load balancing, rather than frequent
> +	 * newidle balance.
> +	 */
> +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)

So nr_scan will probably be updated at llc-domain-lb-interval, which
is llc_size milliseconds. Since load can be varied a lot during such
a period, would this brought accuracy issues?

Best regards
Abel

> +		return;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * In general, the number of cpus to be scanned should be
> +	 * inversely proportional to the sum_util. That is, the lower
> +	 * the sum_util is, the harder select_idle_cpu() should scan
> +	 * for idle CPU, and vice versa. Let x be the sum_util ratio
> +	 * [0-100] of the LLC domain, f(x) be the number of CPUs scanned:
> +	 *
> +	 * f(x) = a - bx                                              [1]
> +	 *
> +	 * Consider that f(x) = nr_llc when x = 0, and f(x) = 4 when
> +	 * x >= threshold('h' below) then:
> +	 *
> +	 * a = llc_size;
> +	 * b = (nr_llc - 4) / h                                       [2]
> +	 *
> +	 * then [2] becomes:
> +	 *
> +	 * f(x) = llc_size - (llc_size -4)x/h                         [3]
> +	 *
> +	 * Choose 50 (50%) for h as the threshold from experiment result.
> +	 * And since x = 100 * sum_util / total_cap, [3] becomes:
> +	 *
> +	 * f(sum_util)
> +	 *  = llc_size - (llc_size - 4) * 100 * sum_util / total_cap * 50
> +	 *  = llc_size - (llc_size - 4) * 2 * sum_util / total_cap
> +	 *
> +	 */
> +	nr_scan = llc_size - (llc_size - 4) * 2 * sum_util / sds->total_capacity;
> +	if (nr_scan < 4)
> +		nr_scan = 4;
> +
> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
> +
>   /**
>    * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>    * @env: The load balancing environment.
> @@ -9212,6 +9225,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   	struct sg_lb_stats *local = &sds->local_stat;
>   	struct sg_lb_stats tmp_sgs;
>   	int sg_status = 0;
> +	unsigned long sum_util = 0;
>   
>   	do {
>   		struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -9242,6 +9256,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		/* Now, start updating sd_lb_stats */
>   		sds->total_load += sgs->group_load;
>   		sds->total_capacity += sgs->group_capacity;
> +		sum_util += sgs->group_util;
>   
>   		sg = sg->next;
>   	} while (sg != env->sd->groups);
> @@ -9268,6 +9283,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>   		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>   	}
> +
> +	update_nr_idle_scan(env, sds, sum_util);
>   }
>   
>   #define NUMA_IMBALANCE_MIN 2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-14  4:53 ` Abel Wu
@ 2022-03-14 12:56   ` Chen Yu
  2022-03-14 17:34     ` Tim Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Yu @ 2022-03-14 12:56 UTC (permalink / raw)
  To: Abel Wu
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

Hi Abel,
On Mon, Mar 14, 2022 at 12:53:54PM +0800, Abel Wu wrote:
> Hi Chen,
> 
> 在 3/10/22 8:52 AM, Chen Yu 写道:
> > [Problem Statement]
> > Currently select_idle_cpu() uses the percpu average idle time to
> > estimate the total LLC domain idle time, and calculate the number
> > of CPUs to be scanned. This might be inconsistent because idle time
> > of a CPU does not necessarily correlate with idle time of a domain.
> > As a result, the load could be underestimated and causes over searching
> > when the system is very busy.
> > 
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instance of netperf on a system with 112 CPUs
> > per LLC domain:
> > 
> > @usecs:
> > [0]                  533 |                                                    |
> > [1]                 5495 |                                                    |
> > [2, 4)             12008 |                                                    |
> > [4, 8)            239252 |                                                    |
> > [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> > [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> > [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> > [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> > [512, 1K)        2600472 |@@@@@@@@@                                           |
> > [1K, 2K)          927912 |@@@                                                 |
> > [2K, 4K)          218720 |                                                    |
> > [4K, 8K)           98161 |                                                    |
> > [8K, 16K)          37722 |                                                    |
> > [16K, 32K)          6715 |                                                    |
> > [32K, 64K)           477 |                                                    |
> > [64K, 128K)            7 |                                                    |
> > 
> > netperf latency:
> > =======
> > case            	load    	    Lat_99th	    std%
> > TCP_RR          	thread-224	      257.39	(  0.21)
> > UDP_RR          	thread-224	      242.83	(  6.29)
> > 
> > The netperf 99th latency(usec) above is comparable with the time spent in
> > select_idle_cpu(). That is to say, when the system is overloaded, searching
> > for idle CPU could be a bottleneck.
> > 
> > [Proposal]
> > The main idea is to replace percpu average idle time with the domain
> > based metric. Choose average CPU utilization(util_avg) as the candidate.
> > In general, the number of CPUs to be scanned should be inversely
> > proportional to the sum of util_avg in this domain. That is, the lower
> > the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> > and vice versa. The benefit of choosing util_avg is that, it is a metric
> > of accumulated historic activity, which seems to be more accurate than
> > instantaneous metrics(such as rq->nr_running).
> > 
> > Furthermore, borrow the util_avg from periodic load balance,
> > which could offload the overhead of select_idle_cpu().
> > 
> > According to last discussion[1], introduced the linear function
> > for experimental purpose:
> 
> It would be better if you can prove it's a linear model by the
> SIS efficiency statistics :)
>
Thanks for your comments!
 
Good point, the SIS efficiency could be used to verify if the real
search number fitting this linear mode well, I'll collect this statistics.
But TBH I'm not sure whether there is a convergent/accurate mapping
between the sum of util_avg and the number of CPUs to be scanned(unless we use
sum_util/pelt_scan_cost to approach it). Choosing a model seems to always be a
heuristic search policy.
> > 
> > f(x) = a - bx
> > 
> >       llc_size
> > x = \Sum      util_avg[cpu] / llc_cpu_capacity
> >       1
> > f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> > To decide a and b, the following condition should be met:
> > 
> > [1] f(0) = llc_size
> > [2] f(x) = 4,  x >= 50%
> > 
> > +	 * newidle balance.
> > +	 */
> > +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
> 
> So nr_scan will probably be updated at llc-domain-lb-interval, which
> is llc_size milliseconds. Since load can be varied a lot during such
> a period, would this brought accuracy issues?
>
I agree there might be delay in reflecting the latest utilization.
The sum_util calculated by periodic load balance after 112ms would be
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%.
But consider that this is a server platform, I have an impression that
the CPU utilization jitter during a small period of time is not a regular
scenario? It seems to be a trade-off. Checking the util_avg in newidle
load balance path would be more frequent, but it also brings overhead -
multiple CPUs write/read the per-LLC shared variable and introduces cache
false sharing. But to make this more robust, maybe we can add time interval
control in newidle load balance too.

thanks,
Chenyu


> Best regards
> Abel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-14 12:56   ` Chen Yu
@ 2022-03-14 17:34     ` Tim Chen
  2022-03-16 16:13       ` Chen Yu
  0 siblings, 1 reply; 15+ messages in thread
From: Tim Chen @ 2022-03-14 17:34 UTC (permalink / raw)
  To: Chen Yu, Abel Wu
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

On Mon, 2022-03-14 at 20:56 +0800, Chen Yu wrote:
> 
> > 
> > So nr_scan will probably be updated at llc-domain-lb-interval, which
> > is llc_size milliseconds. Since load can be varied a lot during such
> > a period, would this brought accuracy issues?
> > 
> I agree there might be delay in reflecting the latest utilization.
> The sum_util calculated by periodic load balance after 112ms would be
> decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%.
> But consider that this is a server platform, I have an impression that
> the CPU utilization jitter during a small period of time is not a regular
> scenario? It seems to be a trade-off. Checking the util_avg in newidle
> load balance path would be more frequent, but it also brings overhead -
> multiple CPUs write/read the per-LLC shared variable and introduces cache
> false sharing. But to make this more robust, maybe we can add time interval
> control in newidle load balance too.
> 
> 

Also the idea is we allow ourselves to be non-optimal in terms of
scheduling for the short term variations.  But we want to make sure that if
there's a long term trend in the load behavior, the scheduler should
adjust for that.  I think if you see high utilization and CPUs are
all close to fully busy for quite a while, that is a long term trend 
that overwhelms any short load jitters.

Tim


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-10  0:52 [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg Chen Yu
  2022-03-14  4:53 ` Abel Wu
@ 2022-03-15 11:37 ` K Prateek Nayak
  2022-03-16 11:54   ` Chen Yu
  2022-03-17 17:39 ` Yicong Yang
  2 siblings, 1 reply; 15+ messages in thread
From: K Prateek Nayak @ 2022-03-15 11:37 UTC (permalink / raw)
  To: Chen Yu, linux-kernel
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li

Hello Chenyu,

On 3/10/2022 6:22 AM, Chen Yu wrote:
> [Problem Statement]
> Currently select_idle_cpu() uses the percpu average idle time to
> estimate the total LLC domain idle time, and calculate the number
> of CPUs to be scanned. This might be inconsistent because idle time
> of a CPU does not necessarily correlate with idle time of a domain.
> As a result, the load could be underestimated and causes over searching
> when the system is very busy.
>
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instance of netperf on a system with 112 CPUs
> per LLC domain:
>
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
>
> netperf latency:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> UDP_RR          	thread-224	      242.83	(  6.29)
>
> The netperf 99th latency(usec) above is comparable with the time spent in
> select_idle_cpu(). That is to say, when the system is overloaded, searching
> for idle CPU could be a bottleneck.
>
> [Proposal]
> The main idea is to replace percpu average idle time with the domain
> based metric. Choose average CPU utilization(util_avg) as the candidate.
> In general, the number of CPUs to be scanned should be inversely
> proportional to the sum of util_avg in this domain. That is, the lower
> the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> and vice versa. The benefit of choosing util_avg is that, it is a metric
> of accumulated historic activity, which seems to be more accurate than
> instantaneous metrics(such as rq->nr_running).
>
> Furthermore, borrow the util_avg from periodic load balance,
> which could offload the overhead of select_idle_cpu().
>
> According to last discussion[1], introduced the linear function
> for experimental purpose:
>
> f(x) = a - bx
>
>      llc_size
> x = \Sum      util_avg[cpu] / llc_cpu_capacity
>      1
> f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> To decide a and b, the following condition should be met:
>
> [1] f(0) = llc_size
> [2] f(x) = 4,  x >= 50%
>
> That is to say, when the util_avg is 0, we should search for
> the whole LLC domain. And if util_avg ratio reaches 50% or higher,
> it should search at most 4 CPUs.
>
> Yes, there would be questions like:
> Why using this linear function to calculate the number of CPUs to
> be scanned? Why choosing 50% as the threshold? These questions will
> be discussed in the [Limitations] section.
>
> [Benchmark]
> netperf, hackbench, schbench, tbench
> were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
> of CPU number (these ratios are not CPU utilization). Each test lasts
> for 100 seconds, and repeats 3 times. The system would reboot into a
> fresh environment for each benchmark.
>
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
>
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
> TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
> TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
> TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
> TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
> TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
> TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
> TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
> UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
> UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
> UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
> UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
> UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
> UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
> UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
> UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
>
> hackbench
> =========
> (each group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
> process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
> process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
> process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
> process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
> process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
> process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
> process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
> process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
> process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
> process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
> process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
> process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
> process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
> process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
> process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
> threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
> threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
> threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
> threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
> threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
> threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
> threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
> threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
> threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
> threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
> threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
> threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
> threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
> threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
> threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
> threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
>
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
> loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
> loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
> loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
> loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
> loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
> loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
> loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
>
> schbench
> ========
> (each mthread group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
> normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
> normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
> normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
> normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
> normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
> normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
> normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
Following are the results on benchmarks from Zen3 system (2 x 64C/128T)
on different NPS modes.

NPS Configurations:

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- sched-tip:      5.17-rc5 tip sched/core
- v2_sis_prop:    5.17-rc5 tip sched/core + this patch

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test:                   sched-tip              v2_sis_prop
 1-groups:         5.05 (0.00 pct)         5.00 (0.99 pct)
 2-groups:         5.72 (0.00 pct)         5.63 (1.57 pct)
 4-groups:         6.34 (0.00 pct)         6.21 (2.05 pct)
 8-groups:         7.89 (0.00 pct)         7.80 (1.14 pct)
16-groups:        11.80 (0.00 pct)        11.62 (1.52 pct)

NPS2

Test:                   sched-tip              v2_sis_prop
 1-groups:         4.94 (0.00 pct)         4.93 (0.20 pct)
 2-groups:         5.64 (0.00 pct)         5.55 (1.59 pct)
 4-groups:         6.23 (0.00 pct)         6.07 (2.56 pct)
 8-groups:         7.70 (0.00 pct)         7.46 (3.11 pct)
16-groups:        10.49 (0.00 pct)        10.12 (3.52 pct)

NPS4

Test:                   sched-tip              v2_sis_prop
 1-groups:         4.89 (0.00 pct)         4.97 (-1.63 pct)
 2-groups:         5.43 (0.00 pct)         5.48 (-0.92 pct)
 4-groups:         6.15 (0.00 pct)         6.21 (-0.97 pct)
 8-groups:         7.54 (0.00 pct)         8.07 (-7.02 pct)
16-groups:        10.20 (0.00 pct)        10.13 ( 0.68 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS 1

#workers:        sched-tip               v2_sis_prop
  1:      13.00 (0.00 pct)        14.50 (-11.53 pct)
  2:      31.50 (0.00 pct)        35.00 (-11.11 pct)
  4:      43.00 (0.00 pct)        44.50 (-3.48 pct)
  8:      52.50 (0.00 pct)        52.00 (0.95 pct)
 16:      69.00 (0.00 pct)        68.00 (1.44 pct)
 32:     108.50 (0.00 pct)       108.50 (0.00 pct)
 64:     195.00 (0.00 pct)       192.50 (1.28 pct)
128:     395.50 (0.00 pct)       399.00 (-0.88 pct)
256:     950.00 (0.00 pct)       944.00 (0.63 pct)
512:     60352.00 (0.00 pct)     60608.00 (-0.42 pct)

NPS2

#workers:        sched-tip             v2_sis_prop
  1:      10.00 (0.00 pct)        10.00 (0.00 pct)
  2:      25.00 (0.00 pct)        28.00 (-12.00 pct)
  4:      37.00 (0.00 pct)        37.50 (-1.35 pct)
  8:      50.50 (0.00 pct)        52.50 (-3.96 pct)
 16:      65.50 (0.00 pct)        67.50 (-3.05 pct)
 32:     104.00 (0.00 pct)       104.50 (-0.48 pct)
 64:     190.00 (0.00 pct)       186.50 (1.84 pct)
128:     394.50 (0.00 pct)       398.50 (-1.01 pct)
256:     959.00 (0.00 pct)       979.00 (-2.08 pct)
512:     60352.00 (0.00 pct)     60480.00 (-0.21 pct)

NPS4

#workers:        sched-tip              v2_sis_prop
  1:       9.50 (0.00 pct)        10.00 (-5.26 pct)
  2:      28.00 (0.00 pct)        30.00 (-7.14 pct)
  4:      32.00 (0.00 pct)        36.50 (-14.06 pct)
  8:      42.00 (0.00 pct)        43.00 (-2.38 pct)
 16:      68.00 (0.00 pct)        75.50 (-11.02 pct)
 32:     104.50 (0.00 pct)       106.50 (-1.91 pct)
 64:     186.00 (0.00 pct)       191.50 (-2.95 pct)
128:     382.50 (0.00 pct)       392.50 (-2.61 pct)
256:     966.00 (0.00 pct)       963.00 (0.31 pct)
512:     60480.00 (0.00 pct)     60416.00 (0.10 pct)

~~~~~~
tbench
~~~~~~

NPS 1

Clients:          sched-tip              v2_sis_prop
    1    477.85 (0.00 pct)       470.68 (-1.50 pct)
    2    924.07 (0.00 pct)       910.82 (-1.43 pct)
    4    1778.95 (0.00 pct)      1743.64 (-1.98 pct)
    8    3244.81 (0.00 pct)      3200.35 (-1.37 pct)
   16    5837.06 (0.00 pct)      5808.36 (-0.49 pct)
   32    9339.33 (0.00 pct)      8648.03 (-7.40 pct)
   64    14761.19 (0.00 pct)     15803.13 (7.05 pct)
  128    27806.11 (0.00 pct)     27510.69 (-1.06 pct)
  256    35262.03 (0.00 pct)     34135.78 (-3.19 pct)
  512    52459.78 (0.00 pct)     51630.53 (-1.58 pct)
 1024    52480.67 (0.00 pct)     52439.37 (-0.07 pct)

NPS 2

Clients:          sched-tip              v2_sis_prop
    1    478.98 (0.00 pct)       472.98 (-1.25 pct)
    2    930.52 (0.00 pct)       914.48 (-1.72 pct)
    4    1743.26 (0.00 pct)      1711.16 (-1.84 pct)
    8    3297.07 (0.00 pct)      3161.12 (-4.12 pct)
   16    5779.10 (0.00 pct)      5738.38 (-0.70 pct)
   32    10708.42 (0.00 pct)     10748.26 (0.37 pct)
   64    16965.21 (0.00 pct)     16894.42 (-0.41 pct)
  128    29152.49 (0.00 pct)     28287.31 (-2.96 pct)
  256    27408.75 (0.00 pct)     33680.59 (22.88 pct)
  512    51453.64 (0.00 pct)     47546.87 (-7.59 pct)
 1024    52156.85 (0.00 pct)     51233.28 (-1.77 pct)

NPS 4

Clients:          sched-tip              v2_sis_prop
    1    480.29 (0.00 pct)       473.75 (-1.36 pct)
    2    940.23 (0.00 pct)       915.60 (-2.61 pct)
    4    1760.21 (0.00 pct)      1687.99 (-4.10 pct)
    8    3269.75 (0.00 pct)      3154.02 (-3.53 pct)
   16    5503.71 (0.00 pct)      5485.01 (-0.33 pct)
   32    10633.93 (0.00 pct)     10276.21 (-3.36 pct)
   64    16304.44 (0.00 pct)     15351.17 (-5.84 pct)
  128    26893.95 (0.00 pct)     25337.08 (-5.78 pct)
  256    24469.94 (0.00 pct)     32178.33 (31.50 pct)
  512    46343.65 (0.00 pct)     49607.28 (7.04 pct)
 1024    51496.80 (0.00 pct)     51791.27 (0.57 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:               sched-tip             v2_sis_prop
 Copy:   204020.95 (0.00 pct)    211802.20 (3.81 pct)
Scale:   208261.03 (0.00 pct)    210809.40 (1.22 pct)
  Add:   243944.45 (0.00 pct)    249801.81 (2.40 pct)
Triad:   237033.90 (0.00 pct)    242273.45 (2.21 pct)

NPS2

Test:               sched-tip              v2_sis_prop
 Copy:   171679.21 (0.00 pct)    153853.24 (-10.38 pct)
Scale:   191362.43 (0.00 pct)    188219.32 (-1.64 pct)
  Add:   218986.47 (0.00 pct)    204766.66 (-6.49 pct)
Triad:   215118.01 (0.00 pct)    202370.69 (-5.92 pct)

NPS4

Test:               sched-tip              v2_sis_prop
 Copy:   133829.00 (0.00 pct)    125722.97 (-6.05 pct)
Scale:   192074.89 (0.00 pct)    187164.95 (-2.55 pct)
  Add:   186288.73 (0.00 pct)    175316.23 (-5.89 pct)
Triad:   185469.53 (0.00 pct)    175985.74 (-5.11 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:               sched-tip             v2_sis_prop
 Copy:   217592.30 (0.00 pct)    231684.51 (6.47 pct)
Scale:   217269.78 (0.00 pct)    218632.93 (0.62 pct)
  Add:   265842.95 (0.00 pct)    266692.50 (0.31 pct)
Triad:   252647.10 (0.00 pct)    253462.69 (0.32 pct)

NPS2

Test:               sched-tip             v2_sis_prop
 Copy:   226948.55 (0.00 pct)    242576.78 (6.88 pct)
Scale:   217968.87 (0.00 pct)    220613.96 (1.21 pct)
  Add:   274839.22 (0.00 pct)    277389.36 (0.92 pct)
Triad:   261920.73 (0.00 pct)    263849.95 (0.73 pct)

NPS4

Test:               sched-tip              v2_sis_prop
 Copy:   256162.84 (0.00 pct)    238881.35 (-6.74 pct)
Scale:   228251.12 (0.00 pct)    228669.16 (0.18 pct)
  Add:   292794.77 (0.00 pct)    293700.42 (0.30 pct)
Triad:   274152.69 (0.00 pct)    274900.77 (0.27 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip:      304934.67 (var: 0.88)
v2_sis_prop:    301560.0  (var: 2.0)    (-1.1%)

NPS2

sched-tip:      303757.0 (var: 1.0)
v2_sis_prop:    302283.0 (var: 0.58)    (-0.48%)

NPS4

sched-tip:      308106.67 (var: 2.88)
v2_sis_prop:    302302.67 (var: 1.12)   (-1.88%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
schbench, tbench - System 50% loaded to fully loaded
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers:        sched-tip              v2_sis_prop
128:     395.50 (0.00 pct)       399.00 (-0.88 pct)
144:     441.50 (0.00 pct)       450.00 (-1.92 pct)
160:     488.00 (0.00 pct)       513.00 (-5.12 pct)
176:     549.00 (0.00 pct)       547.00 (0.36 pct)
192:     601.00 (0.00 pct)       589.00 (1.99 pct)
208:     653.00 (0.00 pct)       736.00 (-12.71 pct)
224:     740.00 (0.00 pct)       757.00 (-2.29 pct)
240:     835.00 (0.00 pct)       808.00 (3.23 pct)
256:     950.00 (0.00 pct)       944.00 (0.63 pct)

NPS2:
#workers:        sched-tip              v2_sis_prop
128:     394.50 (0.00 pct)       398.50 (-1.01 pct)
144:     454.50 (0.00 pct)       443.50 (2.42 pct)
160:     500.00 (0.00 pct)       494.00 (1.20 pct)
176:     547.00 (0.00 pct)       530.00 (3.10 pct)
192:     595.00 (0.00 pct)       609.00 (-2.35 pct)
208:     739.00 (0.00 pct)       738.00 (0.13 pct)
224:     781.00 (0.00 pct)       794.00 (-1.66 pct)
240:     894.00 (0.00 pct)       890.00 (0.44 pct)
256:     959.00 (0.00 pct)       979.00 (-2.08 pct)

NPS4

#workers:        sched-tip              v2_sis_prop
128:     382.50 (0.00 pct)       392.50 (-2.61 pct)
144:     451.50 (0.00 pct)       459.00 (-1.66 pct)
160:     491.00 (0.00 pct)       497.00 (-1.22 pct)
176:     578.00 (0.00 pct)       564.00 (2.42 pct)
192:     593.00 (0.00 pct)       612.00 (-3.20 pct)
208:     737.00 (0.00 pct)       720.00 (2.30 pct)
224:     786.00 (0.00 pct)       796.00 (-1.27 pct)
240:     893.00 (0.00 pct)       890.00 (0.33 pct)
256:     966.00 (0.00 pct)       963.00 (0.31 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients:         sched-tip              v2_sis_prop
128    28369.07 (0.00 pct)     25649.26 (-9.58 pct)
144    25794.95 (0.00 pct)     25464.33 (-1.28 pct)
160    23905.48 (0.00 pct)     23507.18 (-1.66 pct)
176    24219.13 (0.00 pct)     22664.99 (-6.41 pct)
192    23978.71 (0.00 pct)     22506.50 (-6.13 pct)
208    24045.91 (0.00 pct)     22592.62 (-6.04 pct)
224    21961.21 (0.00 pct)     22154.28 (0.87 pct)
240    22001.05 (0.00 pct)     26245.06 (19.29 pct)
256    26866.60 (0.00 pct)     33064.10 (23.06 pct)

NPS2

Clients:         sched-tip              v2_sis_prop
128    25229.75 (0.00 pct)     26396.32 (4.62 pct)
144    27488.16 (0.00 pct)     24596.76 (-10.51 pct)
160    23765.03 (0.00 pct)     23945.55 (0.75 pct)
176    22230.05 (0.00 pct)     22207.84 (-0.09 pct)
192    21383.39 (0.00 pct)     22385.72 (4.68 pct)
208    23920.96 (0.00 pct)     22323.24 (-6.67 pct)
224    22212.38 (0.00 pct)     24108.90 (8.53 pct)
240    22143.36 (0.00 pct)     25655.54 (15.86 pct)
256    29923.16 (0.00 pct)     32570.60 (8.84 pct)

NPS4

Clients:         sched-tip              v2_sis_prop
128    26336.35 (0.00 pct)     24604.90 (-6.57 pct)
144    24469.64 (0.00 pct)     24685.57 (0.88 pct)
160    23742.98 (0.00 pct)     23381.86 (-1.52 pct)
176    22512.46 (0.00 pct)     22602.49 (0.39 pct)
192    21141.71 (0.00 pct)     22752.25 (7.61 pct)
208    21803.07 (0.00 pct)     22280.24 (2.18 pct)
224    21174.10 (0.00 pct)     23582.49 (11.37 pct)
240    21510.36 (0.00 pct)     26295.92 (22.24 pct)
256    25170.50 (0.00 pct)     34215.08 (35.93 pct)

Hackbench shows improvements in NPS1 and NPS2 modes. tbench sees good
improvement when system is close to fully loaded. Stream values are very
close to the baseline. YCSB MongoDB shows slight regression and schbench
is a mixed bag.

This strategy would probably work better on systems with large number of
CPUs per LLC but from our observations, I believe it is worthwhile to
search for an idle CPUs in the entire LLC on Zen3 systems as we have
16 CPUs per LLC.
Only when the system is close to fully loaded, we see a good uplift by not
searching the entire LLC in tbench which is what we saw with the v1 too.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-15 11:37 ` K Prateek Nayak
@ 2022-03-16 11:54   ` Chen Yu
  2022-03-17 10:39     ` K Prateek Nayak
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Yu @ 2022-03-16 11:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li

Hi Prateek, 
On Tue, Mar 15, 2022 at 05:07:35PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 3/10/2022 6:22 AM, Chen Yu wrote:
> > [Problem Statement]
> > Currently select_idle_cpu() uses the percpu average idle time to
> > estimate the total LLC domain idle time, and calculate the number
> > of CPUs to be scanned. This might be inconsistent because idle time
> > of a CPU does not necessarily correlate with idle time of a domain.
> > As a result, the load could be underestimated and causes over searching
> > when the system is very busy.
> >
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instance of netperf on a system with 112 CPUs
> > per LLC domain:
> >
> > @usecs:
> > [0]                  533 |                                                    |
> > [1]                 5495 |                                                    |
> > [2, 4)             12008 |                                                    |
> > [4, 8)            239252 |                                                    |
> > [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> > [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> > [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> > [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> > [512, 1K)        2600472 |@@@@@@@@@                                           |
> > [1K, 2K)          927912 |@@@                                                 |
> > [2K, 4K)          218720 |                                                    |
> > [4K, 8K)           98161 |                                                    |
> > [8K, 16K)          37722 |                                                    |
> > [16K, 32K)          6715 |                                                    |
> > [32K, 64K)           477 |                                                    |
> > [64K, 128K)            7 |                                                    |
> >
> > netperf latency:
> > =======
> > case            	load    	    Lat_99th	    std%
> > TCP_RR          	thread-224	      257.39	(  0.21)
> > UDP_RR          	thread-224	      242.83	(  6.29)
> >
> > The netperf 99th latency(usec) above is comparable with the time spent in
> > select_idle_cpu(). That is to say, when the system is overloaded, searching
> > for idle CPU could be a bottleneck.
> >
> > [Proposal]
> > The main idea is to replace percpu average idle time with the domain
> > based metric. Choose average CPU utilization(util_avg) as the candidate.
> > In general, the number of CPUs to be scanned should be inversely
> > proportional to the sum of util_avg in this domain. That is, the lower
> > the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> > and vice versa. The benefit of choosing util_avg is that, it is a metric
> > of accumulated historic activity, which seems to be more accurate than
> > instantaneous metrics(such as rq->nr_running).
> >
> > Furthermore, borrow the util_avg from periodic load balance,
> > which could offload the overhead of select_idle_cpu().
> >
> > According to last discussion[1], introduced the linear function
> > for experimental purpose:
> >
> > f(x) = a - bx
> >
> >      llc_size
> > x = \Sum      util_avg[cpu] / llc_cpu_capacity
> >      1
> > f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> > To decide a and b, the following condition should be met:
> >
> > [1] f(0) = llc_size
> > [2] f(x) = 4,  x >= 50%
> >
> > That is to say, when the util_avg is 0, we should search for
> > the whole LLC domain. And if util_avg ratio reaches 50% or higher,
> > it should search at most 4 CPUs.
> >
> > Yes, there would be questions like:
> > Why using this linear function to calculate the number of CPUs to
> > be scanned? Why choosing 50% as the threshold? These questions will
> > be discussed in the [Limitations] section.
> >
> > [Benchmark]
> > netperf, hackbench, schbench, tbench
> > were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
> > of CPU number (these ratios are not CPU utilization). Each test lasts
> > for 100 seconds, and repeats 3 times. The system would reboot into a
> > fresh environment for each benchmark.
> >
> > The following is the benchmark result comparison between
> > baseline:vanilla and compare:patched kernel. Positive compare%
> > indicates better performance.
> >
> > netperf
> > =======
> > case            	load    	baseline(std%)	compare%( std%)
> > TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
> > TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
> > TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
> > TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
> > TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
> > TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
> > TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
> > TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
> > UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
> > UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
> > UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
> > UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
> > UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
> > UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
> > UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
> > UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
> >
> > hackbench
> > =========
> > (each group has 1/4 * 112 tasks)
> > case            	load    	baseline(std%)	compare%( std%)
> > process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
> > process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
> > process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
> > process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
> > process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
> > process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
> > process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
> > process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
> > process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
> > process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
> > process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
> > process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
> > process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
> > process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
> > process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
> > process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
> > threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
> > threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
> > threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
> > threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
> > threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
> > threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
> > threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
> > threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
> > threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
> > threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
> > threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
> > threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
> > threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
> > threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
> > threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
> > threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
> >
> > tbench
> > ======
> > case            	load    	baseline(std%)	compare%( std%)
> > loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
> > loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
> > loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
> > loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
> > loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
> > loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
> > loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
> > loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
> >
> > schbench
> > ========
> > (each mthread group has 1/4 * 112 tasks)
> > case            	load    	baseline(std%)	compare%( std%)
> > normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
> > normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
> > normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
> > normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
> > normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
> > normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
> > normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
> > normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
> Following are the results on benchmarks from Zen3 system (2 x 64C/128T)
> on different NPS modes.
> 
Thanks for the testing.
> NPS Configurations:
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - sched-tip:      5.17-rc5 tip sched/core
> - v2_sis_prop:    5.17-rc5 tip sched/core + this patch
> 
Just wonder what the kernel version was when you tested v1?
https://lore.kernel.org/lkml/4ca9ba48-20f0-84d5-6a38-11f9d4c7a028@amd.com/
It seems that there is slight performance difference between the old baseline
and current 5.17-rc5 tip sched/core.
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
> 
> NPS1
> 
> Test:                   sched-tip              v2_sis_prop
>  1-groups:         5.05 (0.00 pct)         5.00 (0.99 pct)
>  2-groups:         5.72 (0.00 pct)         5.63 (1.57 pct)
>  4-groups:         6.34 (0.00 pct)         6.21 (2.05 pct)
>  8-groups:         7.89 (0.00 pct)         7.80 (1.14 pct)
> 16-groups:        11.80 (0.00 pct)        11.62 (1.52 pct)
> 
> NPS2
> 
> Test:                   sched-tip              v2_sis_prop
>  1-groups:         4.94 (0.00 pct)         4.93 (0.20 pct)
>  2-groups:         5.64 (0.00 pct)         5.55 (1.59 pct)
>  4-groups:         6.23 (0.00 pct)         6.07 (2.56 pct)
>  8-groups:         7.70 (0.00 pct)         7.46 (3.11 pct)
> 16-groups:        10.49 (0.00 pct)        10.12 (3.52 pct)
> 
> NPS4
> 
> Test:                   sched-tip              v2_sis_prop
>  1-groups:         4.89 (0.00 pct)         4.97 (-1.63 pct)
>  2-groups:         5.43 (0.00 pct)         5.48 (-0.92 pct)
>  4-groups:         6.15 (0.00 pct)         6.21 (-0.97 pct)
>  8-groups:         7.54 (0.00 pct)         8.07 (-7.02 pct)
> 16-groups:        10.20 (0.00 pct)        10.13 ( 0.68 pct)
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS 1
> 
> #workers:        sched-tip               v2_sis_prop
>   1:      13.00 (0.00 pct)        14.50 (-11.53 pct)
>   2:      31.50 (0.00 pct)        35.00 (-11.11 pct)
It seems that in the old result:
NPS Mode - NPS1
#workers:       sched-tip                util-avg
  1:      13.00 (0.00 pct)        14.50 (-11.53 pct)
  2:      31.50 (0.00 pct)        34.00 (-7.93 pct)
we still saw some downgradings. Although in the v1 patch,
there is no logic change when the utilization is below 85%.
I'm thinking of this might be deviation when the load is low.
For example in v2 test of schbench, 3 cycles of testings were
launched:
case                load            baseline(std%)  compare%( std%)
normal              1 mthread group  1.00 ( 17.92)  +19.23 ( 23.67)
The standard deviation ratio is 23%, which seams to be relatively
large. But consider that v2 patch has changed the logic of how aggressive
we searching for a idle CPU, even in low utilization, this result
needs to be evaluated.
>   4:      43.00 (0.00 pct)        44.50 (-3.48 pct)
>   8:      52.50 (0.00 pct)        52.00 (0.95 pct)
>  16:      69.00 (0.00 pct)        68.00 (1.44 pct)
>  32:     108.50 (0.00 pct)       108.50 (0.00 pct)
>  64:     195.00 (0.00 pct)       192.50 (1.28 pct)
> 128:     395.50 (0.00 pct)       399.00 (-0.88 pct)
> 256:     950.00 (0.00 pct)       944.00 (0.63 pct)
> 512:     60352.00 (0.00 pct)     60608.00 (-0.42 pct)
> 
> NPS2
> 
> #workers:        sched-tip             v2_sis_prop
>   1:      10.00 (0.00 pct)        10.00 (0.00 pct)
>   2:      25.00 (0.00 pct)        28.00 (-12.00 pct)
>   4:      37.00 (0.00 pct)        37.50 (-1.35 pct)
>   8:      50.50 (0.00 pct)        52.50 (-3.96 pct)
>  16:      65.50 (0.00 pct)        67.50 (-3.05 pct)
>  32:     104.00 (0.00 pct)       104.50 (-0.48 pct)
>  64:     190.00 (0.00 pct)       186.50 (1.84 pct)
> 128:     394.50 (0.00 pct)       398.50 (-1.01 pct)
> 256:     959.00 (0.00 pct)       979.00 (-2.08 pct)
> 512:     60352.00 (0.00 pct)     60480.00 (-0.21 pct)
> 
> NPS4
> 
> #workers:        sched-tip              v2_sis_prop
>   1:       9.50 (0.00 pct)        10.00 (-5.26 pct)
>   2:      28.00 (0.00 pct)        30.00 (-7.14 pct)
>   4:      32.00 (0.00 pct)        36.50 (-14.06 pct)
>   8:      42.00 (0.00 pct)        43.00 (-2.38 pct)
>  16:      68.00 (0.00 pct)        75.50 (-11.02 pct)

>  32:     104.50 (0.00 pct)       106.50 (-1.91 pct)
>  64:     186.00 (0.00 pct)       191.50 (-2.95 pct)
> 128:     382.50 (0.00 pct)       392.50 (-2.61 pct)
> 256:     966.00 (0.00 pct)       963.00 (0.31 pct)
> 512:     60480.00 (0.00 pct)     60416.00 (0.10 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS 1
> 
> Clients:          sched-tip              v2_sis_prop
>     1    477.85 (0.00 pct)       470.68 (-1.50 pct)
>     2    924.07 (0.00 pct)       910.82 (-1.43 pct)
>     4    1778.95 (0.00 pct)      1743.64 (-1.98 pct)
>     8    3244.81 (0.00 pct)      3200.35 (-1.37 pct)
>    16    5837.06 (0.00 pct)      5808.36 (-0.49 pct)
>    32    9339.33 (0.00 pct)      8648.03 (-7.40 pct)
>    64    14761.19 (0.00 pct)     15803.13 (7.05 pct)
>   128    27806.11 (0.00 pct)     27510.69 (-1.06 pct)
>   256    35262.03 (0.00 pct)     34135.78 (-3.19 pct)
The result from v1 patch:
NPS Mode - NPS1
Clients:        sched-tip               util-avg
256    26726.29 (0.00 pct)     52502.83 (96.44 pct)
>   512    52459.78 (0.00 pct)     51630.53 (-1.58 pct)
>  1024    52480.67 (0.00 pct)     52439.37 (-0.07 pct)
> 
> NPS 2
> 
> Clients:          sched-tip              v2_sis_prop
>     1    478.98 (0.00 pct)       472.98 (-1.25 pct)
>     2    930.52 (0.00 pct)       914.48 (-1.72 pct)
>     4    1743.26 (0.00 pct)      1711.16 (-1.84 pct)
>     8    3297.07 (0.00 pct)      3161.12 (-4.12 pct)
>    16    5779.10 (0.00 pct)      5738.38 (-0.70 pct)
>    32    10708.42 (0.00 pct)     10748.26 (0.37 pct)
>    64    16965.21 (0.00 pct)     16894.42 (-0.41 pct)
>   128    29152.49 (0.00 pct)     28287.31 (-2.96 pct)
>   256    27408.75 (0.00 pct)     33680.59 (22.88 pct)
The result from v1 patch:
256    27654.49 (0.00 pct)     47126.18 (70.41 pct)
>   512    51453.64 (0.00 pct)     47546.87 (-7.59 pct)
>  1024    52156.85 (0.00 pct)     51233.28 (-1.77 pct)
> 
> NPS 4
> 
> Clients:          sched-tip              v2_sis_prop
>     1    480.29 (0.00 pct)       473.75 (-1.36 pct)
>     2    940.23 (0.00 pct)       915.60 (-2.61 pct)
>     4    1760.21 (0.00 pct)      1687.99 (-4.10 pct)
>     8    3269.75 (0.00 pct)      3154.02 (-3.53 pct)
>    16    5503.71 (0.00 pct)      5485.01 (-0.33 pct)
>    32    10633.93 (0.00 pct)     10276.21 (-3.36 pct)
>    64    16304.44 (0.00 pct)     15351.17 (-5.84 pct)
>   128    26893.95 (0.00 pct)     25337.08 (-5.78 pct)
>   256    24469.94 (0.00 pct)     32178.33 (31.50 pct)
The result from v1 patch:
256    25997.38 (0.00 pct)     47735.83 (83.61 pct)

In above three cases, v2 has smaller improvement compared to
v1. In both v1 and v2, the improvement mainly comes from choosing
previous running CPU as the target, when the system is busy. But
v2 is more likely to choose a previous CPU than v1, because its
threshold 50% is lower than 85% from v2. Then why v2 has less improvement
than v1? It seems that v2 patch only changes the logic of SIS_PRO for
single idle CPU searching, but not touches the idle Core searching.
Meanwhile v1 limits both idle CPU and idle Core searching, and this
might explain the extra benefit from v1 patch IMO.
>   512    46343.65 (0.00 pct)     49607.28 (7.04 pct)
>  1024    51496.80 (0.00 pct)     51791.27 (0.57 pct)
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Stream with 16 threads.
> built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> NPS1
> 
> Test:               sched-tip             v2_sis_prop
>  Copy:   204020.95 (0.00 pct)    211802.20 (3.81 pct)
> Scale:   208261.03 (0.00 pct)    210809.40 (1.22 pct)
>   Add:   243944.45 (0.00 pct)    249801.81 (2.40 pct)
> Triad:   237033.90 (0.00 pct)    242273.45 (2.21 pct)
> 
> NPS2
> 
> Test:               sched-tip              v2_sis_prop
>  Copy:   171679.21 (0.00 pct)    153853.24 (-10.38 pct)
> Scale:   191362.43 (0.00 pct)    188219.32 (-1.64 pct)
>   Add:   218986.47 (0.00 pct)    204766.66 (-6.49 pct)
> Triad:   215118.01 (0.00 pct)    202370.69 (-5.92 pct)
> 
> NPS4
> 
> Test:               sched-tip              v2_sis_prop
>  Copy:   133829.00 (0.00 pct)    125722.97 (-6.05 pct)
> Scale:   192074.89 (0.00 pct)    187164.95 (-2.55 pct)
>   Add:   186288.73 (0.00 pct)    175316.23 (-5.89 pct)
> Triad:   185469.53 (0.00 pct)    175985.74 (-5.11 pct)
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Stream with 16 threads.
> built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> NPS1
> 
> Test:               sched-tip             v2_sis_prop
>  Copy:   217592.30 (0.00 pct)    231684.51 (6.47 pct)
> Scale:   217269.78 (0.00 pct)    218632.93 (0.62 pct)
>   Add:   265842.95 (0.00 pct)    266692.50 (0.31 pct)
> Triad:   252647.10 (0.00 pct)    253462.69 (0.32 pct)
> 
> NPS2
> 
> Test:               sched-tip             v2_sis_prop
>  Copy:   226948.55 (0.00 pct)    242576.78 (6.88 pct)
> Scale:   217968.87 (0.00 pct)    220613.96 (1.21 pct)
>   Add:   274839.22 (0.00 pct)    277389.36 (0.92 pct)
> Triad:   261920.73 (0.00 pct)    263849.95 (0.73 pct)
> 
> NPS4
> 
> Test:               sched-tip              v2_sis_prop
>  Copy:   256162.84 (0.00 pct)    238881.35 (-6.74 pct)
> Scale:   228251.12 (0.00 pct)    228669.16 (0.18 pct)
>   Add:   292794.77 (0.00 pct)    293700.42 (0.30 pct)
> Triad:   274152.69 (0.00 pct)    274900.77 (0.27 pct)
> 
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
> 
> NPS1
> 
> sched-tip:      304934.67 (var: 0.88)
> v2_sis_prop:    301560.0  (var: 2.0)    (-1.1%)
> 
> NPS2
> 
> sched-tip:      303757.0 (var: 1.0)
> v2_sis_prop:    302283.0 (var: 0.58)    (-0.48%)
> 
> NPS4
> 
> sched-tip:      308106.67 (var: 2.88)
> v2_sis_prop:    302302.67 (var: 1.12)   (-1.88%)
>
May I know the average CPU utilization of this benchmark?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> schbench, tbench - System 50% loaded to fully loaded
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS1
> 
> #workers:        sched-tip              v2_sis_prop
> 128:     395.50 (0.00 pct)       399.00 (-0.88 pct)
> 144:     441.50 (0.00 pct)       450.00 (-1.92 pct)
> 160:     488.00 (0.00 pct)       513.00 (-5.12 pct)
> 176:     549.00 (0.00 pct)       547.00 (0.36 pct)
> 192:     601.00 (0.00 pct)       589.00 (1.99 pct)
> 208:     653.00 (0.00 pct)       736.00 (-12.71 pct)
> 224:     740.00 (0.00 pct)       757.00 (-2.29 pct)
> 240:     835.00 (0.00 pct)       808.00 (3.23 pct)
> 256:     950.00 (0.00 pct)       944.00 (0.63 pct)
> 
> NPS2:
> #workers:        sched-tip              v2_sis_prop
> 128:     394.50 (0.00 pct)       398.50 (-1.01 pct)
> 144:     454.50 (0.00 pct)       443.50 (2.42 pct)
> 160:     500.00 (0.00 pct)       494.00 (1.20 pct)
> 176:     547.00 (0.00 pct)       530.00 (3.10 pct)
> 192:     595.00 (0.00 pct)       609.00 (-2.35 pct)
> 208:     739.00 (0.00 pct)       738.00 (0.13 pct)
> 224:     781.00 (0.00 pct)       794.00 (-1.66 pct)
> 240:     894.00 (0.00 pct)       890.00 (0.44 pct)
> 256:     959.00 (0.00 pct)       979.00 (-2.08 pct)
> 
> NPS4
> 
> #workers:        sched-tip              v2_sis_prop
> 128:     382.50 (0.00 pct)       392.50 (-2.61 pct)
> 144:     451.50 (0.00 pct)       459.00 (-1.66 pct)
> 160:     491.00 (0.00 pct)       497.00 (-1.22 pct)
> 176:     578.00 (0.00 pct)       564.00 (2.42 pct)
> 192:     593.00 (0.00 pct)       612.00 (-3.20 pct)
> 208:     737.00 (0.00 pct)       720.00 (2.30 pct)
> 224:     786.00 (0.00 pct)       796.00 (-1.27 pct)
> 240:     893.00 (0.00 pct)       890.00 (0.33 pct)
> 256:     966.00 (0.00 pct)       963.00 (0.31 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS1
> 
> Clients:         sched-tip              v2_sis_prop
> 128    28369.07 (0.00 pct)     25649.26 (-9.58 pct)
> 144    25794.95 (0.00 pct)     25464.33 (-1.28 pct)
> 160    23905.48 (0.00 pct)     23507.18 (-1.66 pct)
> 176    24219.13 (0.00 pct)     22664.99 (-6.41 pct)
> 192    23978.71 (0.00 pct)     22506.50 (-6.13 pct)
> 208    24045.91 (0.00 pct)     22592.62 (-6.04 pct)
> 224    21961.21 (0.00 pct)     22154.28 (0.87 pct)
> 240    22001.05 (0.00 pct)     26245.06 (19.29 pct)
> 256    26866.60 (0.00 pct)     33064.10 (23.06 pct)
> 
> NPS2
> 
> Clients:         sched-tip              v2_sis_prop
> 128    25229.75 (0.00 pct)     26396.32 (4.62 pct)
> 144    27488.16 (0.00 pct)     24596.76 (-10.51 pct)
> 160    23765.03 (0.00 pct)     23945.55 (0.75 pct)
> 176    22230.05 (0.00 pct)     22207.84 (-0.09 pct)
> 192    21383.39 (0.00 pct)     22385.72 (4.68 pct)
> 208    23920.96 (0.00 pct)     22323.24 (-6.67 pct)
> 224    22212.38 (0.00 pct)     24108.90 (8.53 pct)
> 240    22143.36 (0.00 pct)     25655.54 (15.86 pct)
> 256    29923.16 (0.00 pct)     32570.60 (8.84 pct)
> 
> NPS4
> 
> Clients:         sched-tip              v2_sis_prop
> 128    26336.35 (0.00 pct)     24604.90 (-6.57 pct)
> 144    24469.64 (0.00 pct)     24685.57 (0.88 pct)
> 160    23742.98 (0.00 pct)     23381.86 (-1.52 pct)
> 176    22512.46 (0.00 pct)     22602.49 (0.39 pct)
> 192    21141.71 (0.00 pct)     22752.25 (7.61 pct)
> 208    21803.07 (0.00 pct)     22280.24 (2.18 pct)
> 224    21174.10 (0.00 pct)     23582.49 (11.37 pct)
> 240    21510.36 (0.00 pct)     26295.92 (22.24 pct)
> 256    25170.50 (0.00 pct)     34215.08 (35.93 pct)
> 
> Hackbench shows improvements in NPS1 and NPS2 modes. tbench sees good
> improvement when system is close to fully loaded. Stream values are very
> close to the baseline. YCSB MongoDB shows slight regression and schbench
> is a mixed bag.
> 
> This strategy would probably work better on systems with large number of
> CPUs per LLC but from our observations, I believe it is worthwhile to
> search for an idle CPUs in the entire LLC on Zen3 systems as we have
> 16 CPUs per LLC.
> Only when the system is close to fully loaded, we see a good uplift by not
> searching the entire LLC in tbench which is what we saw with the v1 too.
I see. But we might have to make this per-LLC search generic, both for smaller
size and bigger size. Current using exponential descent function could increase the
number of CPUs to be searched when the system is not busy. I'll think about it
and do some investigation.

thanks,
Chenyu
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-14 17:34     ` Tim Chen
@ 2022-03-16 16:13       ` Chen Yu
  0 siblings, 0 replies; 15+ messages in thread
From: Chen Yu @ 2022-03-16 16:13 UTC (permalink / raw)
  To: Tim Chen
  Cc: Abel Wu, linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

On Mon, Mar 14, 2022 at 10:34:30AM -0700, Tim Chen wrote:
> On Mon, 2022-03-14 at 20:56 +0800, Chen Yu wrote:
> > 
> > > 
> > > So nr_scan will probably be updated at llc-domain-lb-interval, which
> > > is llc_size milliseconds. Since load can be varied a lot during such
> > > a period, would this brought accuracy issues?
> > > 
> > I agree there might be delay in reflecting the latest utilization.
> > The sum_util calculated by periodic load balance after 112ms would be
> > decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%.
> > But consider that this is a server platform, I have an impression that
> > the CPU utilization jitter during a small period of time is not a regular
> > scenario? It seems to be a trade-off. Checking the util_avg in newidle
> > load balance path would be more frequent, but it also brings overhead -
> > multiple CPUs write/read the per-LLC shared variable and introduces cache
> > false sharing. But to make this more robust, maybe we can add time interval
> > control in newidle load balance too.
> > 
> > 
> 
> Also the idea is we allow ourselves to be non-optimal in terms of
> scheduling for the short term variations.  But we want to make sure that if
> there's a long term trend in the load behavior, the scheduler should
> adjust for that.  I think if you see high utilization and CPUs are
> all close to fully busy for quite a while, that is a long term trend 
> that overwhelms any short load jitters.
>
Agree. From long term trend, the scheduler should approach an optimization
status.

thanks,
Chenyu 
> Tim
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-16 11:54   ` Chen Yu
@ 2022-03-17 10:39     ` K Prateek Nayak
  2022-04-02 14:35       ` K Prateek Nayak
  0 siblings, 1 reply; 15+ messages in thread
From: K Prateek Nayak @ 2022-03-17 10:39 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li

Hello Chenyu,

Thank you for looking into the results.

On 3/16/2022 5:24 PM, Chen Yu wrote:
> [..snip..]
> Just wonder what the kernel version was when you tested v1?
> https://lore.kernel.org/lkml/4ca9ba48-20f0-84d5-6a38-11f9d4c7a028@amd.com/
> It seems that there is slight performance difference between the old baseline
> and current 5.17-rc5 tip sched/core.
I'll make a point to include the HEAD commit from next time onward to
remove this ambiguity.

- While testing v1, the sched-tip was at:
  commit: 3624ba7b5e2a ("sched/numa-balancing: Move some document to make it consistent with the code")

- While testing v2, the sched-tip was at:
  commit: a0a7e453b502 ("sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers")
>> [..snip..]
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> NPS 1
>>
>> #workers:        sched-tip               v2_sis_prop
>>   1:      13.00 (0.00 pct)        14.50 (-11.53 pct)
>>   2:      31.50 (0.00 pct)        35.00 (-11.11 pct)
> It seems that in the old result:
> NPS Mode - NPS1
> #workers:       sched-tip                util-avg
>   1:      13.00 (0.00 pct)        14.50 (-11.53 pct)
>   2:      31.50 (0.00 pct)        34.00 (-7.93 pct)
> we still saw some downgradings. Although in the v1 patch,
> there is no logic change when the utilization is below 85%.
> I'm thinking of this might be deviation when the load is low.
> For example in v2 test of schbench, 3 cycles of testings were
> launched:
> case                load            baseline(std%)  compare%( std%)
> normal              1 mthread group  1.00 ( 17.92)  +19.23 ( 23.67)
> The standard deviation ratio is 23%, which seams to be relatively
> large. But consider that v2 patch has changed the logic of how aggressive
> we searching for a idle CPU, even in low utilization, this result
> needs to be evaluated.
We too see a lot of variation for schbench. For two worker case,
following is the data from 10 runs in NPS1 mode:

- sched-tip data

    Min           : 23.00
    Max           : 40.00
    Median        : 31.50
    AMean         : 30.50
    GMean         : 29.87
    HMean         : 29.25
    AMean Stddev  : 6.55
    GMean Stddev  : 6.59
    HMean Stddev  : 6.68
    AMean CoefVar : 21.49 pct
    GMean CoefVar : 22.05 pct
    HMean CoefVar : 22.85 pct

- v2_sis_prop data

    Min           : 22.00
    Max           : 41.00
    Median        : 35.00
    AMean         : 33.50
    GMean         : 32.84
    HMean         : 32.13
    AMean Stddev  : 6.64
    GMean Stddev  : 6.67
    HMean Stddev  : 6.79
    AMean CoefVar : 19.81 pct
    GMean CoefVar : 20.32 pct
    HMean CoefVar : 21.14 pct

The median of the data was reported previously.
> [..snip..]
>> ~~~~~~
>> tbench
>> ~~~~~~
>>
>> NPS 1
>>
>> Clients:          sched-tip              v2_sis_prop
>>     1    477.85 (0.00 pct)       470.68 (-1.50 pct)
>>     2    924.07 (0.00 pct)       910.82 (-1.43 pct)
>>     4    1778.95 (0.00 pct)      1743.64 (-1.98 pct)
>>     8    3244.81 (0.00 pct)      3200.35 (-1.37 pct)
>>    16    5837.06 (0.00 pct)      5808.36 (-0.49 pct)
>>    32    9339.33 (0.00 pct)      8648.03 (-7.40 pct)
>>    64    14761.19 (0.00 pct)     15803.13 (7.05 pct)
>>   128    27806.11 (0.00 pct)     27510.69 (-1.06 pct)
>>   256    35262.03 (0.00 pct)     34135.78 (-3.19 pct)
> The result from v1 patch:
> NPS Mode - NPS1
> Clients:        sched-tip               util-avg
> 256    26726.29 (0.00 pct)     52502.83 (96.44 pct)
>>   512    52459.78 (0.00 pct)     51630.53 (-1.58 pct)
>>  1024    52480.67 (0.00 pct)     52439.37 (-0.07 pct)
>>
>> NPS 2
>>
>> Clients:          sched-tip              v2_sis_prop
>>     1    478.98 (0.00 pct)       472.98 (-1.25 pct)
>>     2    930.52 (0.00 pct)       914.48 (-1.72 pct)
>>     4    1743.26 (0.00 pct)      1711.16 (-1.84 pct)
>>     8    3297.07 (0.00 pct)      3161.12 (-4.12 pct)
>>    16    5779.10 (0.00 pct)      5738.38 (-0.70 pct)
>>    32    10708.42 (0.00 pct)     10748.26 (0.37 pct)
>>    64    16965.21 (0.00 pct)     16894.42 (-0.41 pct)
>>   128    29152.49 (0.00 pct)     28287.31 (-2.96 pct)
>>   256    27408.75 (0.00 pct)     33680.59 (22.88 pct)
> The result from v1 patch:
> 256    27654.49 (0.00 pct)     47126.18 (70.41 pct)
>>   512    51453.64 (0.00 pct)     47546.87 (-7.59 pct)
>>  1024    52156.85 (0.00 pct)     51233.28 (-1.77 pct)
>>
>> NPS 4
>>
>> Clients:          sched-tip              v2_sis_prop
>>     1    480.29 (0.00 pct)       473.75 (-1.36 pct)
>>     2    940.23 (0.00 pct)       915.60 (-2.61 pct)
>>     4    1760.21 (0.00 pct)      1687.99 (-4.10 pct)
>>     8    3269.75 (0.00 pct)      3154.02 (-3.53 pct)
>>    16    5503.71 (0.00 pct)      5485.01 (-0.33 pct)
>>    32    10633.93 (0.00 pct)     10276.21 (-3.36 pct)
>>    64    16304.44 (0.00 pct)     15351.17 (-5.84 pct)
>>   128    26893.95 (0.00 pct)     25337.08 (-5.78 pct)
>>   256    24469.94 (0.00 pct)     32178.33 (31.50 pct)
> The result from v1 patch:
> 256    25997.38 (0.00 pct)     47735.83 (83.61 pct)
>
> In above three cases, v2 has smaller improvement compared to
> v1. In both v1 and v2, the improvement mainly comes from choosing
> previous running CPU as the target, when the system is busy. But
> v2 is more likely to choose a previous CPU than v1, because its
> threshold 50% is lower than 85% from v2. Then why v2 has less improvement
> than v1? It seems that v2 patch only changes the logic of SIS_PRO for
> single idle CPU searching, but not touches the idle Core searching.
> Meanwhile v1 limits both idle CPU and idle Core searching, and this
> might explain the extra benefit from v1 patch IMO.
Yes, this might be the case.
>> [..snip..]
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> NPS1
>>
>> sched-tip:      304934.67 (var: 0.88)
>> v2_sis_prop:    301560.0  (var: 2.0)    (-1.1%)
>>
>> NPS2
>>
>> sched-tip:      303757.0 (var: 1.0)
>> v2_sis_prop:    302283.0 (var: 0.58)    (-0.48%)
>>
>> NPS4
>>
>> sched-tip:      308106.67 (var: 2.88)
>> v2_sis_prop:    302302.67 (var: 1.12)   (-1.88%)
>>
> May I know the average CPU utilization of this benchmark?
I don't have this data at hand. I'll get back to you soon with the data.
> [..snip..]
> I see. But we might have to make this per-LLC search generic, both for smaller
> size and bigger size. Current using exponential descent function could increase the
> number of CPUs to be searched when the system is not busy. I'll think about it
> and do some investigation.
It would indeed be great to have this work well for all LLC sizes.
Thank you for looking into it :)
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-10  0:52 [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg Chen Yu
  2022-03-14  4:53 ` Abel Wu
  2022-03-15 11:37 ` K Prateek Nayak
@ 2022-03-17 17:39 ` Yicong Yang
  2022-03-18  3:43   ` Chen Yu
  2 siblings, 1 reply; 15+ messages in thread
From: Yicong Yang @ 2022-03-17 17:39 UTC (permalink / raw)
  To: Chen Yu, linux-kernel
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

Hi Chen,

Thanks for the update. I'm still testing on this along with the sched cluster patches.
I'll show some results when I get enough data. So some questions below.

在 2022/3/10 8:52, Chen Yu 写道:
> [Problem Statement]
> Currently select_idle_cpu() uses the percpu average idle time to
> estimate the total LLC domain idle time, and calculate the number
> of CPUs to be scanned. This might be inconsistent because idle time
> of a CPU does not necessarily correlate with idle time of a domain.
> As a result, the load could be underestimated and causes over searching
> when the system is very busy.
>
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instance of netperf on a system with 112 CPUs
> per LLC domain:
>
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
>
> netperf latency:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> UDP_RR          	thread-224	      242.83	(  6.29)
>
> The netperf 99th latency(usec) above is comparable with the time spent in
> select_idle_cpu(). That is to say, when the system is overloaded, searching
> for idle CPU could be a bottleneck.
>
> [Proposal]
> The main idea is to replace percpu average idle time with the domain
> based metric. Choose average CPU utilization(util_avg) as the candidate.
> In general, the number of CPUs to be scanned should be inversely
> proportional to the sum of util_avg in this domain. That is, the lower
> the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> and vice versa. The benefit of choosing util_avg is that, it is a metric
> of accumulated historic activity, which seems to be more accurate than
> instantaneous metrics(such as rq->nr_running).
>
> Furthermore, borrow the util_avg from periodic load balance,
> which could offload the overhead of select_idle_cpu().
>
> According to last discussion[1], introduced the linear function
> for experimental purpose:
>
> f(x) = a - bx
>
>      llc_size
> x = \Sum      util_avg[cpu] / llc_cpu_capacity
>      1
> f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> To decide a and b, the following condition should be met:
>
> [1] f(0) = llc_size
> [2] f(x) = 4,  x >= 50%
>
> That is to say, when the util_avg is 0, we should search for
> the whole LLC domain. And if util_avg ratio reaches 50% or higher,
> it should search at most 4 CPUs.

I might have a question here. In your V1 patch, we won't scan when the LLC
util >85%. But in this patch we'll always scan 4 cpus no matter how much the
LLC is overloaded. When the LLC is rather busy the scan is probably redundant
so is it better if we found a threadhold for stopping the scan? The util_avg
cannot indicate how much the cpu is overloaded so perhaps just stop scan when
it is 100% utilized.

>
> Yes, there would be questions like:
> Why using this linear function to calculate the number of CPUs to
> be scanned? Why choosing 50% as the threshold? These questions will
> be discussed in the [Limitations] section.
>
> [Benchmark]
> netperf, hackbench, schbench, tbench
> were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
> of CPU number (these ratios are not CPU utilization). Each test lasts
> for 100 seconds, and repeats 3 times. The system would reboot into a
> fresh environment for each benchmark.
>
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
>
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
> TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
> TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
> TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
> TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
> TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
> TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
> TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
> UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
> UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
> UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
> UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
> UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
> UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
> UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
> UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
>
> hackbench
> =========
> (each group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
> process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
> process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
> process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
> process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
> process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
> process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
> process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
> process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
> process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
> process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
> process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
> process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
> process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
> process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
> process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
> threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
> threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
> threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
> threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
> threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
> threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
> threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
> threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
> threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
> threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
> threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
> threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
> threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
> threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
> threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
> threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
>
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
> loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
> loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
> loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
> loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
> loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
> loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
> loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
>
> schbench
> ========
> (each mthread group has 1/4 * 112 tasks)
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
> normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
> normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
> normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
> normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
> normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
> normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
> normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
>
> According to the results above, when the workloads is heavy, the throughput
> of netperf improves a lot. It might be interesting to look into the reason
> why this patch benefits netperf significantly. Further investigation has
> shown that, this might be a 'side effect' of this patch. It is found that,
> the CPU utilization is around 90% on vanilla kernel, while it is nearly
> 100% on patched kernel. According to the perf profile, with the patch
> applied, the scheduler would likely to choose previous running CPU for the
> waking task, thus reduces runqueue lock contention, so the CPU utilization
> is higher and get better performance.
>
> [Limitations]
> Q:Why using 50% as the util_avg/capacity threshold to search at most 4 CPUs?
>
> A: 50% is chosen as that corresponds to almost full CPU utilization, when
>    the CPU is fixed to run at its base frequency, with turbo enabled.
>    4 is the minimal number of CPUs to be scanned in current select_idle_cpu().
>
>    A synthetic workload was used to simulate different level of
>    load. This workload takes every 10ms as the sample period, and in
>    each sample period:
>
>    while (!timeout_10ms) {
>         loop(busy_pct_ms);
>    	sleep(10ms-busy_pct_ms)
>    }
>
>    to simulate busy_pct% of CPU utilization. When the workload is
>    running, the percpu runqueue util_avg was monitored. The
>    following is the result from turbostat's Busy% on CPU2 and
>    cfs_rq[2].util_avg from /sys/kernel/debug/sched/debug:
>
>    Busy%    util_avg   util_avg/cpu_capacity%
>    10.06    35         3.42
>    19.97    99         9.67
>    29.93    154        15.04
>    39.86    213        20.80
>    49.79    256        25.00
>    59.73    325        31.74
>    69.77    437        42.68
>    79.69    458        44.73
>    89.62    519        50.68
>    99.54    598        58.39
>
>    The reason why util_avg ratio is not consistent with Busy% might be due
>    to CPU frequency invariance. The CPU is running at fixed lower frequency
>    than the turbo frequency, then the util_avg scales lower than
>    SCHED_CAPACITY_SCALE. In our test platform, the base frequency is 1.9GHz,
>    and the max turbo frequency is 3.7GHz, so 1.9/3.7 is around 50%.
>    In the future maybe we could use arch_scale_freq_capacity()
>    instead of sds->total_capacity, so as to remove the impact from frequency.
>    Then the 50% could be adjusted higher. For now, 50% is an aggressive
>    threshold to restric the idle CPU searching and shows benchmark
>    improvement.
>
> Q: Why using nr_scan = a - b * sum_util_avg to do linear search?
>
> A: Ideally the nr_scan could be:
>
>    nr_scan = sum_util_avg / pelt_avg_scan_cost
>
>    However consider the overhead of calculating pelt on avg_scan_cost
>    in each wake up, choosing heuristic search for evaluation seems to
>    be an acceptable trade-off.
>
>    The f(sum_util_avg) could be of any form, as long as it is a monotonically
>    decreasing function. At first f(x) = a - 2^(bx) was chosen. Because when the
>    sum_util_avg is low, the system should try very hard to find an idle CPU. And
>    if sum_util_avg goes higher, the system dramatically lose its interest to search
>    for the idle CPU. But exponential function does have its drawback:
>
>    Consider a system with 112 CPUs, let f(x) =  112 when x = 0,
>    f(x) = 4 when x = 50, x belongs to [0, 100], then we have:
>
>    f1(x) = 113 - 2^(x / 7.35)
>    and
>    f2(x) = 112 - 2.16 * x
>
>    Since kernel does not support floating point, above functions are converted into:
>    nr_scan1(x) = 113 - 2^(x / 7)
>    and
>    nr_scan2(x) = 112 - 2 * x
>
>    util_avg%      0     1     2  ...   8     9   ... 47   48   49
>    nr_scan1     112   112   112      111   111       49   49    4
>    nr_scan2     112   110   108       96    94       18   16   14
>
>    According to above result, the granularity of exponential function
>    is coarse-grained, while the linear function is fine-grained.
>
>    So finally choose linear function. After all, it has shown benchmark
>    benefit without noticeable regression so far.
>
> Q: How to deal with the following corner case:
>
>    It is possible that there is unbalanced tasks among CPUs due to CPU affinity.
>    For example, suppose the LLC domain is composed of 6 CPUs, and 5 tasks are bound
>    to CPU0~CPU4, while CPU5 is idle:
>
> 		CPU0	CPU1	CPU2	CPU3	CPU4	CPU5
>    util_avg	1024	1024	1024	1024	1024	0
>
>    Since the util_avg ratio is 83%( = 5/6 ), which is higher than 50%, select_idle_cpu()
>    only searches 4 CPUs starting from CPU0, thus leaves idle CPU5 undetected.
>
>    A possible workaround to mitigate this problem is that, the nr_scan should
>    be increased by the number of idle CPUs found during periodic load balance
>    in update_sd_lb_stats(). In above example, the nr_scan will be adjusted to
>    4 + 1 = 5. Currently I don't have better solution in mind to deal with it
>    gracefully.

Without CPU affinity, is it possible that we also meet this case? Considering we always
scan from the target cpu and the further cpus have less chance to be checked, the scan
possibility of each CPUs is not equal. When the util_avg ratio >50%, after several wakeups
from CPU0 the CPU 1~4 will be non-idle andthe following scans may fail without checking CPU5.

>
> Any comment is appreciated.
>
> Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net/ # [1]
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |   1 +
>  kernel/sched/fair.c            | 107 +++++++++++++++++++--------------
>  2 files changed, 63 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..aae558459f00 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		nr_idle_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5146163bfabb..59f5f8432c21 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6271,43 +6271,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> -	struct rq *this_rq = this_rq();
> -	int this = smp_processor_id();
> -	struct sched_domain *this_sd;
> -	u64 time = 0;
> -
> -	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -	if (!this_sd)
> -		return -1;
> +	struct sched_domain_shared *sd_share;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
>  	if (sched_feat(SIS_PROP) && !has_idle_core) {
> -		u64 avg_cost, avg_idle, span_avg;
> -		unsigned long now = jiffies;
> -
> -		/*
> -		 * If we're busy, the assumption that the last idle period
> -		 * predicts the future is flawed; age away the remaining
> -		 * predicted idle time.
> -		 */
> -		if (unlikely(this_rq->wake_stamp < now)) {
> -			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
> -				this_rq->wake_stamp++;
> -				this_rq->wake_avg_idle >>= 1;
> -			}
> -		}
> -
> -		avg_idle = this_rq->wake_avg_idle;
> -		avg_cost = this_sd->avg_scan_cost + 1;
> -

With this patch, sd->avg_scan_cost, rq->{wake_stamp, wake_avg_idle} may have no users.

Thanks,
Yicong

> -		span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> -			nr = div_u64(span_avg, avg_cost);
> -		else
> -			nr = 4;
> -
> -		time = cpu_clock(this);
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share)
> +			nr = READ_ONCE(sd_share->nr_idle_scan);
>  	}
>  
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> @@ -6328,18 +6299,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	if (has_idle_core)
>  		set_idle_cores(target, false);
>  
> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> -		time = cpu_clock(this) - time;
> -
> -		/*
> -		 * Account for the scan cost of wakeups against the average
> -		 * idle time.
> -		 */
> -		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> -
> -		update_avg(&this_sd->avg_scan_cost, time);
> -	}
> -
>  	return idle_cpu;
>  }
>  
> @@ -9199,6 +9158,60 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static inline void update_nr_idle_scan(struct lb_env *env, struct sd_lb_stats *sds,
> +				       unsigned long sum_util)
> +{
> +	struct sched_domain_shared *sd_share;
> +	int llc_size = per_cpu(sd_llc_size, env->dst_cpu);
> +	int nr_scan;
> +
> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of this hint
> +	 * occurs during periodic load balancing, rather than frequent
> +	 * newidle balance.
> +	 */
> +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
> +		return;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * In general, the number of cpus to be scanned should be
> +	 * inversely proportional to the sum_util. That is, the lower
> +	 * the sum_util is, the harder select_idle_cpu() should scan
> +	 * for idle CPU, and vice versa. Let x be the sum_util ratio
> +	 * [0-100] of the LLC domain, f(x) be the number of CPUs scanned:
> +	 *
> +	 * f(x) = a - bx                                              [1]
> +	 *
> +	 * Consider that f(x) = nr_llc when x = 0, and f(x) = 4 when
> +	 * x >= threshold('h' below) then:
> +	 *
> +	 * a = llc_size;
> +	 * b = (nr_llc - 4) / h                                       [2]
> +	 *
> +	 * then [2] becomes:
> +	 *
> +	 * f(x) = llc_size - (llc_size -4)x/h                         [3]
> +	 *
> +	 * Choose 50 (50%) for h as the threshold from experiment result.
> +	 * And since x = 100 * sum_util / total_cap, [3] becomes:
> +	 *
> +	 * f(sum_util)
> +	 *  = llc_size - (llc_size - 4) * 100 * sum_util / total_cap * 50
> +	 *  = llc_size - (llc_size - 4) * 2 * sum_util / total_cap
> +	 *
> +	 */
> +	nr_scan = llc_size - (llc_size - 4) * 2 * sum_util / sds->total_capacity;
> +	if (nr_scan < 4)
> +		nr_scan = 4;
> +
> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -9212,6 +9225,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
> +	unsigned long sum_util = 0;
>  
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -9242,6 +9256,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		/* Now, start updating sd_lb_stats */
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
> +		sum_util += sgs->group_util;
>  
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
> @@ -9268,6 +9283,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>  	}
> +
> +	update_nr_idle_scan(env, sds, sum_util);
>  }
>  
>  #define NUMA_IMBALANCE_MIN 2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-17 17:39 ` Yicong Yang
@ 2022-03-18  3:43   ` Chen Yu
  2022-04-02 10:11     ` Yicong Yang
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Yu @ 2022-03-18  3:43 UTC (permalink / raw)
  To: Yicong Yang
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak

Hi Yicong,
On Fri, Mar 18, 2022 at 01:39:48AM +0800, Yicong Yang wrote:
> Hi Chen,
> 
> Thanks for the update. I'm still testing on this along with the sched cluster patches.
> I'll show some results when I get enough data. So some questions below.
> 
> 在 2022/3/10 8:52, Chen Yu 写道:
> > [Problem Statement]
> > Currently select_idle_cpu() uses the percpu average idle time to
> > estimate the total LLC domain idle time, and calculate the number
> > of CPUs to be scanned. This might be inconsistent because idle time
> > of a CPU does not necessarily correlate with idle time of a domain.
> > As a result, the load could be underestimated and causes over searching
> > when the system is very busy.
> >
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instance of netperf on a system with 112 CPUs
> > per LLC domain:
> >
> > @usecs:
> > [0]                  533 |                                                    |
> > [1]                 5495 |                                                    |
> > [2, 4)             12008 |                                                    |
> > [4, 8)            239252 |                                                    |
> > [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> > [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> > [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> > [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> > [512, 1K)        2600472 |@@@@@@@@@                                           |
> > [1K, 2K)          927912 |@@@                                                 |
> > [2K, 4K)          218720 |                                                    |
> > [4K, 8K)           98161 |                                                    |
> > [8K, 16K)          37722 |                                                    |
> > [16K, 32K)          6715 |                                                    |
> > [32K, 64K)           477 |                                                    |
> > [64K, 128K)            7 |                                                    |
> >
> > netperf latency:
> > =======
> > case            	load    	    Lat_99th	    std%
> > TCP_RR          	thread-224	      257.39	(  0.21)
> > UDP_RR          	thread-224	      242.83	(  6.29)
> >
> > The netperf 99th latency(usec) above is comparable with the time spent in
> > select_idle_cpu(). That is to say, when the system is overloaded, searching
> > for idle CPU could be a bottleneck.
> >
> > [Proposal]
> > The main idea is to replace percpu average idle time with the domain
> > based metric. Choose average CPU utilization(util_avg) as the candidate.
> > In general, the number of CPUs to be scanned should be inversely
> > proportional to the sum of util_avg in this domain. That is, the lower
> > the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> > and vice versa. The benefit of choosing util_avg is that, it is a metric
> > of accumulated historic activity, which seems to be more accurate than
> > instantaneous metrics(such as rq->nr_running).
> >
> > Furthermore, borrow the util_avg from periodic load balance,
> > which could offload the overhead of select_idle_cpu().
> >
> > According to last discussion[1], introduced the linear function
> > for experimental purpose:
> >
> > f(x) = a - bx
> >
> >      llc_size
> > x = \Sum      util_avg[cpu] / llc_cpu_capacity
> >      1
> > f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> > To decide a and b, the following condition should be met:
> >
> > [1] f(0) = llc_size
> > [2] f(x) = 4,  x >= 50%
> >
> > That is to say, when the util_avg is 0, we should search for
> > the whole LLC domain. And if util_avg ratio reaches 50% or higher,
> > it should search at most 4 CPUs.
> 
> I might have a question here. In your V1 patch, we won't scan when the LLC
> util >85%. But in this patch we'll always scan 4 cpus no matter how much the
> LLC is overloaded. When the LLC is rather busy the scan is probably redundant
> so is it better if we found a threadhold for stopping the scan? The util_avg
> cannot indicate how much the cpu is overloaded so perhaps just stop scan when
> it is 100% utilized.
>
The reason we makes the scan number >=4 is that:
1. In the tbench test result based on v1 in your environment, there seems to be
   a -8.49% downgrading with 128 threads. It is possible that, when there is
   128 thread in your system, it is not fully busy, but we give up searching for
   an idle CPU, which causes downgrading. Tim suggested that we can still search
   for a minimal number of CPU even the system is very busy.
2. This is consistent with the current kernel's logic, 4 is the minal search number
   no matter how busy the system is.
https://lore.kernel.org/lkml/2627025ab96a315af0e76e5983c803578623c826.camel@linux.intel.com/

> >
> > Yes, there would be questions like:
> > Why using this linear function to calculate the number of CPUs to
> > be scanned? Why choosing 50% as the threshold? These questions will
> > be discussed in the [Limitations] section.
> >
> > [Benchmark]
> > netperf, hackbench, schbench, tbench
> > were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
> > of CPU number (these ratios are not CPU utilization). Each test lasts
> > for 100 seconds, and repeats 3 times. The system would reboot into a
> > fresh environment for each benchmark.
> >
> > The following is the benchmark result comparison between
> > baseline:vanilla and compare:patched kernel. Positive compare%
> > indicates better performance.
> >
> > netperf
> > =======
> > case            	load    	baseline(std%)	compare%( std%)
> > TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
> > TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
> > TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
> > TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
> > TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
> > TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
> > TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
> > TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
> > UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
> > UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
> > UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
> > UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
> > UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
> > UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
> > UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
> > UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
> >
> > hackbench
> > =========
> > (each group has 1/4 * 112 tasks)
> > case            	load    	baseline(std%)	compare%( std%)
> > process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
> > process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
> > process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
> > process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
> > process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
> > process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
> > process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
> > process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
> > process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
> > process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
> > process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
> > process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
> > process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
> > process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
> > process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
> > process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
> > threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
> > threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
> > threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
> > threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
> > threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
> > threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
> > threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
> > threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
> > threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
> > threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
> > threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
> > threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
> > threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
> > threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
> > threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
> > threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
> >
> > tbench
> > ======
> > case            	load    	baseline(std%)	compare%( std%)
> > loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
> > loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
> > loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
> > loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
> > loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
> > loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
> > loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
> > loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
> >
> > schbench
> > ========
> > (each mthread group has 1/4 * 112 tasks)
> > case            	load    	baseline(std%)	compare%( std%)
> > normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
> > normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
> > normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
> > normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
> > normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
> > normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
> > normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
> > normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
> >
> > According to the results above, when the workloads is heavy, the throughput
> > of netperf improves a lot. It might be interesting to look into the reason
> > why this patch benefits netperf significantly. Further investigation has
> > shown that, this might be a 'side effect' of this patch. It is found that,
> > the CPU utilization is around 90% on vanilla kernel, while it is nearly
> > 100% on patched kernel. According to the perf profile, with the patch
> > applied, the scheduler would likely to choose previous running CPU for the
> > waking task, thus reduces runqueue lock contention, so the CPU utilization
> > is higher and get better performance.
> >
> > [Limitations]
> > Q:Why using 50% as the util_avg/capacity threshold to search at most 4 CPUs?
> >
> > A: 50% is chosen as that corresponds to almost full CPU utilization, when
> >    the CPU is fixed to run at its base frequency, with turbo enabled.
> >    4 is the minimal number of CPUs to be scanned in current select_idle_cpu().
> >
> >    A synthetic workload was used to simulate different level of
> >    load. This workload takes every 10ms as the sample period, and in
> >    each sample period:
> >
> >    while (!timeout_10ms) {
> >         loop(busy_pct_ms);
> >    	sleep(10ms-busy_pct_ms)
> >    }
> >
> >    to simulate busy_pct% of CPU utilization. When the workload is
> >    running, the percpu runqueue util_avg was monitored. The
> >    following is the result from turbostat's Busy% on CPU2 and
> >    cfs_rq[2].util_avg from /sys/kernel/debug/sched/debug:
> >
> >    Busy%    util_avg   util_avg/cpu_capacity%
> >    10.06    35         3.42
> >    19.97    99         9.67
> >    29.93    154        15.04
> >    39.86    213        20.80
> >    49.79    256        25.00
> >    59.73    325        31.74
> >    69.77    437        42.68
> >    79.69    458        44.73
> >    89.62    519        50.68
> >    99.54    598        58.39
> >
> >    The reason why util_avg ratio is not consistent with Busy% might be due
> >    to CPU frequency invariance. The CPU is running at fixed lower frequency
> >    than the turbo frequency, then the util_avg scales lower than
> >    SCHED_CAPACITY_SCALE. In our test platform, the base frequency is 1.9GHz,
> >    and the max turbo frequency is 3.7GHz, so 1.9/3.7 is around 50%.
> >    In the future maybe we could use arch_scale_freq_capacity()
> >    instead of sds->total_capacity, so as to remove the impact from frequency.
> >    Then the 50% could be adjusted higher. For now, 50% is an aggressive
> >    threshold to restric the idle CPU searching and shows benchmark
> >    improvement.
> >
> > Q: Why using nr_scan = a - b * sum_util_avg to do linear search?
> >
> > A: Ideally the nr_scan could be:
> >
> >    nr_scan = sum_util_avg / pelt_avg_scan_cost
> >
> >    However consider the overhead of calculating pelt on avg_scan_cost
> >    in each wake up, choosing heuristic search for evaluation seems to
> >    be an acceptable trade-off.
> >
> >    The f(sum_util_avg) could be of any form, as long as it is a monotonically
> >    decreasing function. At first f(x) = a - 2^(bx) was chosen. Because when the
> >    sum_util_avg is low, the system should try very hard to find an idle CPU. And
> >    if sum_util_avg goes higher, the system dramatically lose its interest to search
> >    for the idle CPU. But exponential function does have its drawback:
> >
> >    Consider a system with 112 CPUs, let f(x) =  112 when x = 0,
> >    f(x) = 4 when x = 50, x belongs to [0, 100], then we have:
> >
> >    f1(x) = 113 - 2^(x / 7.35)
> >    and
> >    f2(x) = 112 - 2.16 * x
> >
> >    Since kernel does not support floating point, above functions are converted into:
> >    nr_scan1(x) = 113 - 2^(x / 7)
> >    and
> >    nr_scan2(x) = 112 - 2 * x
> >
> >    util_avg%      0     1     2  ...   8     9   ... 47   48   49
> >    nr_scan1     112   112   112      111   111       49   49    4
> >    nr_scan2     112   110   108       96    94       18   16   14
> >
> >    According to above result, the granularity of exponential function
> >    is coarse-grained, while the linear function is fine-grained.
> >
> >    So finally choose linear function. After all, it has shown benchmark
> >    benefit without noticeable regression so far.
> >
> > Q: How to deal with the following corner case:
> >
> >    It is possible that there is unbalanced tasks among CPUs due to CPU affinity.
> >    For example, suppose the LLC domain is composed of 6 CPUs, and 5 tasks are bound
> >    to CPU0~CPU4, while CPU5 is idle:
> >
> > 		CPU0	CPU1	CPU2	CPU3	CPU4	CPU5
> >    util_avg	1024	1024	1024	1024	1024	0
> >
> >    Since the util_avg ratio is 83%( = 5/6 ), which is higher than 50%, select_idle_cpu()
> >    only searches 4 CPUs starting from CPU0, thus leaves idle CPU5 undetected.
> >
> >    A possible workaround to mitigate this problem is that, the nr_scan should
> >    be increased by the number of idle CPUs found during periodic load balance
> >    in update_sd_lb_stats(). In above example, the nr_scan will be adjusted to
> >    4 + 1 = 5. Currently I don't have better solution in mind to deal with it
> >    gracefully.
> 
> Without CPU affinity, is it possible that we also meet this case?
Yes, it is true.
> Considering we always scan from the target cpu and the further cpus have less
> chance to be checked, the scan possibility of each CPUs is not equal. When the
> util_avg ratio >50%, after several wakeups from CPU0 the CPU 1~4 will be non-idle
> andthe following scans may fail without checking CPU5.
In this case, we relies on the load balance path to migrate some tasks among
CPUs and 'saturate'this LLC domain equally.
 
> >
> > -		 * If we're busy, the assumption that the last idle period
> > -		 * predicts the future is flawed; age away the remaining
> > -		 * predicted idle time.
> > -		 */
> > -		if (unlikely(this_rq->wake_stamp < now)) {
> > -			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
> > -				this_rq->wake_stamp++;
> > -				this_rq->wake_avg_idle >>= 1;
> > -			}
> > -		}
> > -
> > -		avg_idle = this_rq->wake_avg_idle;
> > -		avg_cost = this_sd->avg_scan_cost + 1;
> > -
> 
> With this patch, sd->avg_scan_cost, rq->{wake_stamp, wake_avg_idle} may have no users.
> 
If we 'rebase' the SIS_PRO to use sum_util_avg, it seems that avg_scan_cost and
rq->{wake_stamp, wake_avg_idle} are not needed IMO.
For rq->{wake_stamp, wake_avg_idle}, it is used to reduce the search number when waking
up a task on a busy rq. However this metric still uses one CPU's statistic to predict
the whole system's status, which is trying to be avoid in this patch.

For sd->avg_scan_cost, unless we use sum_util_avg / pelt(sd->avg_scan_cost), it
could be leveraged to predict the number of CPUs to scan. I'm not sure how much
the overhead is when calculating pelt on sd->avg_scan_cost each time during wakeup,
but I can have a try to get some data.

thanks,
Chenyu
> Thanks,
> Yicong
> 
> > -		span_avg = sd->span_weight * avg_idle;
> > -		if (span_avg > 4*avg_cost)
> > -			nr = div_u64(span_avg, avg_cost);
> > -		else
> > -			nr = 4;
> > -
> > -		time = cpu_clock(this);
> > +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> > +		if (sd_share)
> > +			nr = READ_ONCE(sd_share->nr_idle_scan);
> >  	}
> >  
> >  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> > @@ -6328,18 +6299,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  	if (has_idle_core)
> >  		set_idle_cores(target, false);
> >  
> > -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> > -		time = cpu_clock(this) - time;
> > -
> > -		/*
> > -		 * Account for the scan cost of wakeups against the average
> > -		 * idle time.
> > -		 */
> > -		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> > -
> > -		update_avg(&this_sd->avg_scan_cost, time);
> > -	}
> > -
> >  	return idle_cpu;
> >  }
> >  
> > @@ -9199,6 +9158,60 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >  	return idlest;
> >  }
> >  
> > +static inline void update_nr_idle_scan(struct lb_env *env, struct sd_lb_stats *sds,
> > +				       unsigned long sum_util)
> > +{
> > +	struct sched_domain_shared *sd_share;
> > +	int llc_size = per_cpu(sd_llc_size, env->dst_cpu);
> > +	int nr_scan;
> > +
> > +	/*
> > +	 * Update the number of CPUs to scan in LLC domain, which could
> > +	 * be used as a hint in select_idle_cpu(). The update of this hint
> > +	 * occurs during periodic load balancing, rather than frequent
> > +	 * newidle balance.
> > +	 */
> > +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
> > +		return;
> > +
> > +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> > +	if (!sd_share)
> > +		return;
> > +
> > +	/*
> > +	 * In general, the number of cpus to be scanned should be
> > +	 * inversely proportional to the sum_util. That is, the lower
> > +	 * the sum_util is, the harder select_idle_cpu() should scan
> > +	 * for idle CPU, and vice versa. Let x be the sum_util ratio
> > +	 * [0-100] of the LLC domain, f(x) be the number of CPUs scanned:
> > +	 *
> > +	 * f(x) = a - bx                                              [1]
> > +	 *
> > +	 * Consider that f(x) = nr_llc when x = 0, and f(x) = 4 when
> > +	 * x >= threshold('h' below) then:
> > +	 *
> > +	 * a = llc_size;
> > +	 * b = (nr_llc - 4) / h                                       [2]
> > +	 *
> > +	 * then [2] becomes:
> > +	 *
> > +	 * f(x) = llc_size - (llc_size -4)x/h                         [3]
> > +	 *
> > +	 * Choose 50 (50%) for h as the threshold from experiment result.
> > +	 * And since x = 100 * sum_util / total_cap, [3] becomes:
> > +	 *
> > +	 * f(sum_util)
> > +	 *  = llc_size - (llc_size - 4) * 100 * sum_util / total_cap * 50
> > +	 *  = llc_size - (llc_size - 4) * 2 * sum_util / total_cap
> > +	 *
> > +	 */
> > +	nr_scan = llc_size - (llc_size - 4) * 2 * sum_util / sds->total_capacity;
> > +	if (nr_scan < 4)
> > +		nr_scan = 4;
> > +
> > +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > +}
> > +
> >  /**
> >   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> >   * @env: The load balancing environment.
> > @@ -9212,6 +9225,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	struct sg_lb_stats *local = &sds->local_stat;
> >  	struct sg_lb_stats tmp_sgs;
> >  	int sg_status = 0;
> > +	unsigned long sum_util = 0;
> >  
> >  	do {
> >  		struct sg_lb_stats *sgs = &tmp_sgs;
> > @@ -9242,6 +9256,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  		/* Now, start updating sd_lb_stats */
> >  		sds->total_load += sgs->group_load;
> >  		sds->total_capacity += sgs->group_capacity;
> > +		sum_util += sgs->group_util;
> >  
> >  		sg = sg->next;
> >  	} while (sg != env->sd->groups);
> > @@ -9268,6 +9283,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> >  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> >  	}
> > +
> > +	update_nr_idle_scan(env, sds, sum_util);
> >  }
> >  
> >  #define NUMA_IMBALANCE_MIN 2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-18  3:43   ` Chen Yu
@ 2022-04-02 10:11     ` Yicong Yang
  2022-04-09 15:09       ` Chen Yu
  0 siblings, 1 reply; 15+ messages in thread
From: Yicong Yang @ 2022-04-02 10:11 UTC (permalink / raw)
  To: Chen Yu, Yicong Yang
  Cc: yangyicong, linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Viresh Kumar, Barry Song, Barry Song,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li, K Prateek Nayak,
	shenyang (M)

On 2022/3/18 11:43, Chen Yu wrote:
> Hi Yicong,
> On Fri, Mar 18, 2022 at 01:39:48AM +0800, Yicong Yang wrote:
>> Hi Chen,
>>
>> Thanks for the update. I'm still testing on this along with the sched cluster patches.
>> I'll show some results when I get enough data. So some questions below.
>>
>> 在 2022/3/10 8:52, Chen Yu 写道:
>>> [Problem Statement]
>>> Currently select_idle_cpu() uses the percpu average idle time to
>>> estimate the total LLC domain idle time, and calculate the number
>>> of CPUs to be scanned. This might be inconsistent because idle time
>>> of a CPU does not necessarily correlate with idle time of a domain.
>>> As a result, the load could be underestimated and causes over searching
>>> when the system is very busy.
>>>
>>> The following histogram is the time spent in select_idle_cpu(),
>>> when running 224 instance of netperf on a system with 112 CPUs
>>> per LLC domain:
>>>
>>> @usecs:
>>> [0]                  533 |                                                    |
>>> [1]                 5495 |                                                    |
>>> [2, 4)             12008 |                                                    |
>>> [4, 8)            239252 |                                                    |
>>> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
>>> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
>>> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
>>> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
>>> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
>>> [512, 1K)        2600472 |@@@@@@@@@                                           |
>>> [1K, 2K)          927912 |@@@                                                 |
>>> [2K, 4K)          218720 |                                                    |
>>> [4K, 8K)           98161 |                                                    |
>>> [8K, 16K)          37722 |                                                    |
>>> [16K, 32K)          6715 |                                                    |
>>> [32K, 64K)           477 |                                                    |
>>> [64K, 128K)            7 |                                                    |
>>>
>>> netperf latency:
>>> =======
>>> case            	load    	    Lat_99th	    std%
>>> TCP_RR          	thread-224	      257.39	(  0.21)
>>> UDP_RR          	thread-224	      242.83	(  6.29)
>>>
>>> The netperf 99th latency(usec) above is comparable with the time spent in
>>> select_idle_cpu(). That is to say, when the system is overloaded, searching
>>> for idle CPU could be a bottleneck.
>>>
>>> [Proposal]
>>> The main idea is to replace percpu average idle time with the domain
>>> based metric. Choose average CPU utilization(util_avg) as the candidate.
>>> In general, the number of CPUs to be scanned should be inversely
>>> proportional to the sum of util_avg in this domain. That is, the lower
>>> the util_avg is, the more select_idle_cpu() should scan for idle CPU,
>>> and vice versa. The benefit of choosing util_avg is that, it is a metric
>>> of accumulated historic activity, which seems to be more accurate than
>>> instantaneous metrics(such as rq->nr_running).
>>>
>>> Furthermore, borrow the util_avg from periodic load balance,
>>> which could offload the overhead of select_idle_cpu().
>>>
>>> According to last discussion[1], introduced the linear function
>>> for experimental purpose:
>>>
>>> f(x) = a - bx
>>>
>>>      llc_size
>>> x = \Sum      util_avg[cpu] / llc_cpu_capacity
>>>      1
>>> f(x) is the number of CPUs to be scanned, x is the sum util_avg.
>>> To decide a and b, the following condition should be met:
>>>
>>> [1] f(0) = llc_size
>>> [2] f(x) = 4,  x >= 50%
>>>
>>> That is to say, when the util_avg is 0, we should search for
>>> the whole LLC domain. And if util_avg ratio reaches 50% or higher,
>>> it should search at most 4 CPUs.
>>
>> I might have a question here. In your V1 patch, we won't scan when the LLC
>> util >85%. But in this patch we'll always scan 4 cpus no matter how much the
>> LLC is overloaded. When the LLC is rather busy the scan is probably redundant
>> so is it better if we found a threadhold for stopping the scan? The util_avg
>> cannot indicate how much the cpu is overloaded so perhaps just stop scan when
>> it is 100% utilized.
>>
> The reason we makes the scan number >=4 is that:
> 1. In the tbench test result based on v1 in your environment, there seems to be
>    a -8.49% downgrading with 128 threads. It is possible that, when there is
>    128 thread in your system, it is not fully busy, but we give up searching for
>    an idle CPU, which causes downgrading. Tim suggested that we can still search
>    for a minimal number of CPU even the system is very busy.
> 2. This is consistent with the current kernel's logic, 4 is the minal search number
>    no matter how busy the system is.
> https://lore.kernel.org/lkml/2627025ab96a315af0e76e5983c803578623c826.camel@linux.intel.com/
> 

FYI, shenyang has done some investigation on whether we can get an idle cpu if the nr is 4.
For netperf running on node 0-1 (32 cores on each node) with 32, 64, 128 threads, the success
rate of findindg an idle cpu is about 61.8%, 7.4%, <0.1%, the CPU utilizaiton is 70.7%, 87.4%
and 99.9% respectively.

I have test this patch based on 5.17-rc7 on Kunpeng 920. The benchmarks are binding to node 0
or node 0-1. The tbench result has some oscillation so I need to have a further check.
For netperf I see performance enhancement when the threads equals to the cpu number.

For netperf:
TCP_RR 2 nodes
threads		base		patched		pct
16		50335.56667	49970.63333	-0.73%
32		47281.53333	48191.93333	1.93%
64		18907.7		34263.63333	81.22%
128		14391.1		14480.8		0.62%
256		6905.286667	6853.83		-0.75%

TCP_RR 1 node
threads		base		patched		pct
16		50086.06667	49648.13333	-0.87%
32		24983.3		39489.43333	58.06%
64		18340.03333	18399.56667	0.32%
128		7174.713333	7390.09		3.00%
256		3433.696667	3404.956667	-0.84%

UDP_RR 2 nodes
threads		base		patched		pct
16		81448.7		82659.43333	1.49%
32		75351.13333	76812.36667	1.94%
64		25539.46667	41835.96667	63.81%
128		25081.56667	23595.56667	-5.92%
256		11848.23333	11017.13333	-7.01%

UDP_RR 1 node
threads		base		patched		pct
16		87288.96667	88719.83333	1.64%
32		22891.73333	68854.33333	200.78%
64		33853.4		35891.6		6.02%
128		12108.4		11885.76667	-1.84%
256		5620.403333	5531.006667	-1.59%

mysql on node 0-1
			base		patched		pct
16threads-TPS		7100.27		7224.31		1.75%
16threads-QPS		142005.45	144486.19	1.75%
16threads-avg lat	2.25		2.22		1.63%
16threads-99th lat	2.46		2.43		1.08%
24threads-TPS		10424.70	10312.20	-1.08%
24threads-QPS		208493.86	206243.93	-1.08%
24threads-avg lat	2.30		2.32		-0.87%
24threads-99th lat	2.52		2.57		-1.85%
32threads-TPS		12528.79	12228.88	-2.39%
32threads-QPS		250575.92	244577.59	-2.39%
32threads-avg lat	2.55		2.61		-2.35%
32threads-99th lat	2.88		2.99		-3.82%
64threads-TPS		21386.17	21789.99	1.89%
64threads-QPS		427723.41	435799.85	1.89%
64threads-avg lat	2.99		2.94		1.78%
64threads-99th lat	5.00		4.69		6.33%
128threads-TPS		20865.13	20781.24	-0.40%
128threads-QPS		417302.73	415624.83	-0.40%
128threads-avg lat	6.13		6.16		-0.38%
128threads-99th lat	8.90		8.95		-0.60%
256threads-TPS		19258.15	19295.11	0.19%
256threads-QPS		385162.92	385902.27	0.19%
256threads-avg lat	13.29		13.26		0.23%
256threads-99th lat	20.12		20.12		0.00%

I also had a look on a machine with 2-socket Xeon 6148 (80 threads in total)
For TCP_RR, the best enhancement also happens when the threads equals to
the cpu number.

netperf TCP_RR
threads		base		patched		pct
20		36584.73333	36309		-0.75%
40		26679.6		25958.56667	-2.70%
80		13969.2		14669.13333	5.01%
160		9571.28		9669.026667	1.02%
240		6367.056667	6416.93		0.78%

tbench
                               base                patched
Hmean     1        255.73 (   0.00%)      255.44 *  -0.11%*
Hmean     2        508.75 (   0.00%)      511.09 *   0.46%*
Hmean     4       1014.11 (   0.00%)     1009.36 *  -0.47%*
Hmean     8       1989.11 (   0.00%)     1978.16 *  -0.55%*
Hmean     16      3772.11 (   0.00%)     3778.19 *   0.16%*
Hmean     32      6106.35 (   0.00%)     5969.34 *  -2.24%*
Hmean     64      6627.70 (   0.00%)     6680.38 *   0.79%*
Hmean     128    12345.44 (   0.00%)    12436.74 *   0.74%*
Hmean     256    12204.99 (   0.00%)    12271.06 *   0.54%*
Hmean     320    12037.84 (   0.00%)    12142.56 *   0.87%*

hackbench-process-pipes
                               base                patched
Amean     1        0.3482 (   0.00%)      0.3385 *   2.79%*
Amean     4        0.9519 (   0.00%)      0.9423 *   1.01%*
Amean     7        1.3334 (   0.00%)      1.3665 *  -2.48%*
Amean     12       2.1149 (   0.00%)      2.2927 *  -8.41%*
Amean     21       3.9327 (   0.00%)      4.3591 * -10.84%*
Amean     30       6.8436 (   0.00%)      6.2177 *   9.15%*
Amean     48      12.3377 (   0.00%)     11.6233 *   5.79%*
Amean     79      17.0142 (   0.00%)     16.8587 *   0.91%*
Amean     110     21.3452 (   0.00%)     21.6252 *  -1.31%*
Amean     141     26.9371 (   0.00%)     26.7137 *   0.83%*
Amean     160     30.2086 (   0.00%)     30.2942 *  -0.28%*

>>>
>>> Yes, there would be questions like:
>>> Why using this linear function to calculate the number of CPUs to
>>> be scanned? Why choosing 50% as the threshold? These questions will
>>> be discussed in the [Limitations] section.
>>>
>>> [Benchmark]
>>> netperf, hackbench, schbench, tbench
>>> were tested with 25% 50% 75% 100% 125% 150% 175% 200% instance
>>> of CPU number (these ratios are not CPU utilization). Each test lasts
>>> for 100 seconds, and repeats 3 times. The system would reboot into a
>>> fresh environment for each benchmark.
>>>
>>> The following is the benchmark result comparison between
>>> baseline:vanilla and compare:patched kernel. Positive compare%
>>> indicates better performance.
>>>
>>> netperf
>>> =======
>>> case            	load    	baseline(std%)	compare%( std%)
>>> TCP_RR          	28 threads	 1.00 (  0.30)	 -1.26 (  0.32)
>>> TCP_RR          	56 threads	 1.00 (  0.35)	 -1.26 (  0.41)
>>> TCP_RR          	84 threads	 1.00 (  0.46)	 -0.15 (  0.60)
>>> TCP_RR          	112 threads	 1.00 (  0.36)	 +0.44 (  0.41)
>>> TCP_RR          	140 threads	 1.00 (  0.23)	 +0.95 (  0.21)
>>> TCP_RR          	168 threads	 1.00 (  0.20)	+177.77 (  3.78)
>>> TCP_RR          	196 threads	 1.00 (  0.18)	+185.43 ( 10.08)
>>> TCP_RR          	224 threads	 1.00 (  0.16)	+187.86 (  7.32)
>>> UDP_RR          	28 threads	 1.00 (  0.43)	 -0.93 (  0.27)
>>> UDP_RR          	56 threads	 1.00 (  0.17)	 -0.39 ( 10.91)
>>> UDP_RR          	84 threads	 1.00 (  6.36)	 +1.03 (  0.92)
>>> UDP_RR          	112 threads	 1.00 (  5.55)	 +1.47 ( 17.67)
>>> UDP_RR          	140 threads	 1.00 ( 18.17)	 +0.31 ( 15.48)
>>> UDP_RR          	168 threads	 1.00 ( 15.00)	+153.87 ( 13.20)
>>> UDP_RR          	196 threads	 1.00 ( 16.26)	+169.19 ( 13.78)
>>> UDP_RR          	224 threads	 1.00 ( 51.81)	+76.72 ( 10.95)
>>>
>>> hackbench
>>> =========
>>> (each group has 1/4 * 112 tasks)
>>> case            	load    	baseline(std%)	compare%( std%)
>>> process-pipe    	1 group 	 1.00 (  0.47)	 -0.46 (  0.16)
>>> process-pipe    	2 groups 	 1.00 (  0.42)	 -0.61 (  0.74)
>>> process-pipe    	3 groups 	 1.00 (  0.42)	 +0.38 (  0.20)
>>> process-pipe    	4 groups 	 1.00 (  0.15)	 -0.36 (  0.56)
>>> process-pipe    	5 groups 	 1.00 (  0.20)	 -5.08 (  0.01)
>>> process-pipe    	6 groups 	 1.00 (  0.28)	 -2.98 (  0.29)
>>> process-pipe    	7 groups 	 1.00 (  0.08)	 -1.18 (  0.28)
>>> process-pipe    	8 groups 	 1.00 (  0.11)	 -0.40 (  0.07)
>>> process-sockets 	1 group 	 1.00 (  0.43)	 -1.93 (  0.58)
>>> process-sockets 	2 groups 	 1.00 (  0.23)	 -1.10 (  0.49)
>>> process-sockets 	3 groups 	 1.00 (  1.10)	 -0.96 (  1.12)
>>> process-sockets 	4 groups 	 1.00 (  0.59)	 -0.08 (  0.88)
>>> process-sockets 	5 groups 	 1.00 (  0.45)	 +0.31 (  0.34)
>>> process-sockets 	6 groups 	 1.00 (  0.23)	 +0.06 (  0.66)
>>> process-sockets 	7 groups 	 1.00 (  0.12)	 +1.72 (  0.20)
>>> process-sockets 	8 groups 	 1.00 (  0.11)	 +1.98 (  0.02)
>>> threads-pipe    	1 group 	 1.00 (  1.07)	 +0.03 (  0.40)
>>> threads-pipe    	2 groups 	 1.00 (  1.05)	 +0.19 (  1.27)
>>> threads-pipe    	3 groups 	 1.00 (  0.32)	 -0.42 (  0.48)
>>> threads-pipe    	4 groups 	 1.00 (  0.42)	 -0.76 (  0.79)
>>> threads-pipe    	5 groups 	 1.00 (  0.19)	 -4.97 (  0.07)
>>> threads-pipe    	6 groups 	 1.00 (  0.05)	 -4.11 (  0.04)
>>> threads-pipe    	7 groups 	 1.00 (  0.10)	 -1.13 (  0.16)
>>> threads-pipe    	8 groups 	 1.00 (  0.03)	 -0.08 (  0.05)
>>> threads-sockets 	1 group 	 1.00 (  0.33)	 -1.93 (  0.69)
>>> threads-sockets 	2 groups 	 1.00 (  0.20)	 -1.55 (  0.30)
>>> threads-sockets 	3 groups 	 1.00 (  0.37)	 -1.29 (  0.59)
>>> threads-sockets 	4 groups 	 1.00 (  1.83)	 +0.31 (  1.17)
>>> threads-sockets 	5 groups 	 1.00 (  0.28)	+15.73 (  0.24)
>>> threads-sockets 	6 groups 	 1.00 (  0.15)	 +5.02 (  0.34)
>>> threads-sockets 	7 groups 	 1.00 (  0.10)	 +2.29 (  0.14)
>>> threads-sockets 	8 groups 	 1.00 (  0.17)	 +2.22 (  0.12)
>>>
>>> tbench
>>> ======
>>> case            	load    	baseline(std%)	compare%( std%)
>>> loopback        	28 threads	 1.00 (  0.05)	 -1.39 (  0.04)
>>> loopback        	56 threads	 1.00 (  0.08)	 -0.37 (  0.04)
>>> loopback        	84 threads	 1.00 (  0.03)	 +0.20 (  0.13)
>>> loopback        	112 threads	 1.00 (  0.04)	 +0.69 (  0.04)
>>> loopback        	140 threads	 1.00 (  0.13)	 +1.15 (  0.21)
>>> loopback        	168 threads	 1.00 (  0.03)	 +1.62 (  0.08)
>>> loopback        	196 threads	 1.00 (  0.08)	 +1.50 (  0.30)
>>> loopback        	224 threads	 1.00 (  0.05)	 +1.62 (  0.05)
>>>
>>> schbench
>>> ========
>>> (each mthread group has 1/4 * 112 tasks)
>>> case            	load    	baseline(std%)	compare%( std%)
>>> normal          	1 mthread group	 1.00 ( 17.92)	+19.23 ( 23.67)
>>> normal          	2 mthread groups 1.00 ( 21.10)	 +8.32 ( 16.92)
>>> normal          	3 mthread groups 1.00 ( 10.80)	+10.03 (  9.21)
>>> normal          	4 mthread groups 1.00 (  2.67)	 +0.11 (  3.00)
>>> normal          	5 mthread groups 1.00 (  0.08)	 +0.00 (  0.13)
>>> normal          	6 mthread groups 1.00 (  2.99)	 -2.66 (  3.87)
>>> normal          	7 mthread groups 1.00 (  2.16)	 -0.83 (  2.24)
>>> normal          	8 mthread groups 1.00 (  1.75)	 +0.18 (  3.18)
>>>
>>> According to the results above, when the workloads is heavy, the throughput
>>> of netperf improves a lot. It might be interesting to look into the reason
>>> why this patch benefits netperf significantly. Further investigation has
>>> shown that, this might be a 'side effect' of this patch. It is found that,
>>> the CPU utilization is around 90% on vanilla kernel, while it is nearly
>>> 100% on patched kernel. According to the perf profile, with the patch
>>> applied, the scheduler would likely to choose previous running CPU for the
>>> waking task, thus reduces runqueue lock contention, so the CPU utilization
>>> is higher and get better performance.
>>>
>>> [Limitations]
>>> Q:Why using 50% as the util_avg/capacity threshold to search at most 4 CPUs?
>>>
>>> A: 50% is chosen as that corresponds to almost full CPU utilization, when
>>>    the CPU is fixed to run at its base frequency, with turbo enabled.
>>>    4 is the minimal number of CPUs to be scanned in current select_idle_cpu().
>>>
>>>    A synthetic workload was used to simulate different level of
>>>    load. This workload takes every 10ms as the sample period, and in
>>>    each sample period:
>>>
>>>    while (!timeout_10ms) {
>>>         loop(busy_pct_ms);
>>>    	sleep(10ms-busy_pct_ms)
>>>    }
>>>
>>>    to simulate busy_pct% of CPU utilization. When the workload is
>>>    running, the percpu runqueue util_avg was monitored. The
>>>    following is the result from turbostat's Busy% on CPU2 and
>>>    cfs_rq[2].util_avg from /sys/kernel/debug/sched/debug:
>>>
>>>    Busy%    util_avg   util_avg/cpu_capacity%
>>>    10.06    35         3.42
>>>    19.97    99         9.67
>>>    29.93    154        15.04
>>>    39.86    213        20.80
>>>    49.79    256        25.00
>>>    59.73    325        31.74
>>>    69.77    437        42.68
>>>    79.69    458        44.73
>>>    89.62    519        50.68
>>>    99.54    598        58.39
>>>
>>>    The reason why util_avg ratio is not consistent with Busy% might be due
>>>    to CPU frequency invariance. The CPU is running at fixed lower frequency
>>>    than the turbo frequency, then the util_avg scales lower than
>>>    SCHED_CAPACITY_SCALE. In our test platform, the base frequency is 1.9GHz,
>>>    and the max turbo frequency is 3.7GHz, so 1.9/3.7 is around 50%.
>>>    In the future maybe we could use arch_scale_freq_capacity()
>>>    instead of sds->total_capacity, so as to remove the impact from frequency.
>>>    Then the 50% could be adjusted higher. For now, 50% is an aggressive
>>>    threshold to restric the idle CPU searching and shows benchmark
>>>    improvement.
>>>
>>> Q: Why using nr_scan = a - b * sum_util_avg to do linear search?
>>>
>>> A: Ideally the nr_scan could be:
>>>
>>>    nr_scan = sum_util_avg / pelt_avg_scan_cost
>>>
>>>    However consider the overhead of calculating pelt on avg_scan_cost
>>>    in each wake up, choosing heuristic search for evaluation seems to
>>>    be an acceptable trade-off.
>>>
>>>    The f(sum_util_avg) could be of any form, as long as it is a monotonically
>>>    decreasing function. At first f(x) = a - 2^(bx) was chosen. Because when the
>>>    sum_util_avg is low, the system should try very hard to find an idle CPU. And
>>>    if sum_util_avg goes higher, the system dramatically lose its interest to search
>>>    for the idle CPU. But exponential function does have its drawback:
>>>
>>>    Consider a system with 112 CPUs, let f(x) =  112 when x = 0,
>>>    f(x) = 4 when x = 50, x belongs to [0, 100], then we have:
>>>
>>>    f1(x) = 113 - 2^(x / 7.35)
>>>    and
>>>    f2(x) = 112 - 2.16 * x
>>>
>>>    Since kernel does not support floating point, above functions are converted into:
>>>    nr_scan1(x) = 113 - 2^(x / 7)
>>>    and
>>>    nr_scan2(x) = 112 - 2 * x
>>>
>>>    util_avg%      0     1     2  ...   8     9   ... 47   48   49
>>>    nr_scan1     112   112   112      111   111       49   49    4
>>>    nr_scan2     112   110   108       96    94       18   16   14
>>>
>>>    According to above result, the granularity of exponential function
>>>    is coarse-grained, while the linear function is fine-grained.
>>>
>>>    So finally choose linear function. After all, it has shown benchmark
>>>    benefit without noticeable regression so far.
>>>
>>> Q: How to deal with the following corner case:
>>>
>>>    It is possible that there is unbalanced tasks among CPUs due to CPU affinity.
>>>    For example, suppose the LLC domain is composed of 6 CPUs, and 5 tasks are bound
>>>    to CPU0~CPU4, while CPU5 is idle:
>>>
>>> 		CPU0	CPU1	CPU2	CPU3	CPU4	CPU5
>>>    util_avg	1024	1024	1024	1024	1024	0
>>>
>>>    Since the util_avg ratio is 83%( = 5/6 ), which is higher than 50%, select_idle_cpu()
>>>    only searches 4 CPUs starting from CPU0, thus leaves idle CPU5 undetected.
>>>
>>>    A possible workaround to mitigate this problem is that, the nr_scan should
>>>    be increased by the number of idle CPUs found during periodic load balance
>>>    in update_sd_lb_stats(). In above example, the nr_scan will be adjusted to
>>>    4 + 1 = 5. Currently I don't have better solution in mind to deal with it
>>>    gracefully.
>>
>> Without CPU affinity, is it possible that we also meet this case?
> Yes, it is true.
>> Considering we always scan from the target cpu and the further cpus have less
>> chance to be checked, the scan possibility of each CPUs is not equal. When the
>> util_avg ratio >50%, after several wakeups from CPU0 the CPU 1~4 will be non-idle
>> andthe following scans may fail without checking CPU5.
> In this case, we relies on the load balance path to migrate some tasks among
> CPUs and 'saturate'this LLC domain equally.
>  
>>>
>>> -		 * If we're busy, the assumption that the last idle period
>>> -		 * predicts the future is flawed; age away the remaining
>>> -		 * predicted idle time.
>>> -		 */
>>> -		if (unlikely(this_rq->wake_stamp < now)) {
>>> -			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
>>> -				this_rq->wake_stamp++;
>>> -				this_rq->wake_avg_idle >>= 1;
>>> -			}
>>> -		}
>>> -
>>> -		avg_idle = this_rq->wake_avg_idle;
>>> -		avg_cost = this_sd->avg_scan_cost + 1;
>>> -
>>
>> With this patch, sd->avg_scan_cost, rq->{wake_stamp, wake_avg_idle} may have no users.
>>
> If we 'rebase' the SIS_PRO to use sum_util_avg, it seems that avg_scan_cost and
> rq->{wake_stamp, wake_avg_idle} are not needed IMO.
> For rq->{wake_stamp, wake_avg_idle}, it is used to reduce the search number when waking
> up a task on a busy rq. However this metric still uses one CPU's statistic to predict
> the whole system's status, which is trying to be avoid in this patch.
> 
> For sd->avg_scan_cost, unless we use sum_util_avg / pelt(sd->avg_scan_cost), it
> could be leveraged to predict the number of CPUs to scan. I'm not sure how much
> the overhead is when calculating pelt on sd->avg_scan_cost each time during wakeup,
> but I can have a try to get some data.
> 
> thanks,
> Chenyu
>> Thanks,
>> Yicong
>>
>>> -		span_avg = sd->span_weight * avg_idle;
>>> -		if (span_avg > 4*avg_cost)
>>> -			nr = div_u64(span_avg, avg_cost);
>>> -		else
>>> -			nr = 4;
>>> -
>>> -		time = cpu_clock(this);
>>> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>>> +		if (sd_share)
>>> +			nr = READ_ONCE(sd_share->nr_idle_scan);
>>>  	}
>>>  
>>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>> @@ -6328,18 +6299,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>  	if (has_idle_core)
>>>  		set_idle_cores(target, false);
>>>  
>>> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
>>> -		time = cpu_clock(this) - time;
>>> -
>>> -		/*
>>> -		 * Account for the scan cost of wakeups against the average
>>> -		 * idle time.
>>> -		 */
>>> -		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
>>> -
>>> -		update_avg(&this_sd->avg_scan_cost, time);
>>> -	}
>>> -
>>>  	return idle_cpu;
>>>  }
>>>  
>>> @@ -9199,6 +9158,60 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>  	return idlest;
>>>  }
>>>  
>>> +static inline void update_nr_idle_scan(struct lb_env *env, struct sd_lb_stats *sds,
>>> +				       unsigned long sum_util)
>>> +{
>>> +	struct sched_domain_shared *sd_share;
>>> +	int llc_size = per_cpu(sd_llc_size, env->dst_cpu);
>>> +	int nr_scan;
>>> +
>>> +	/*
>>> +	 * Update the number of CPUs to scan in LLC domain, which could
>>> +	 * be used as a hint in select_idle_cpu(). The update of this hint
>>> +	 * occurs during periodic load balancing, rather than frequent
>>> +	 * newidle balance.
>>> +	 */
>>> +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
>>> +		return;
>>> +
>>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
>>> +	if (!sd_share)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * In general, the number of cpus to be scanned should be
>>> +	 * inversely proportional to the sum_util. That is, the lower
>>> +	 * the sum_util is, the harder select_idle_cpu() should scan
>>> +	 * for idle CPU, and vice versa. Let x be the sum_util ratio
>>> +	 * [0-100] of the LLC domain, f(x) be the number of CPUs scanned:
>>> +	 *
>>> +	 * f(x) = a - bx                                              [1]
>>> +	 *
>>> +	 * Consider that f(x) = nr_llc when x = 0, and f(x) = 4 when
>>> +	 * x >= threshold('h' below) then:
>>> +	 *
>>> +	 * a = llc_size;
>>> +	 * b = (nr_llc - 4) / h                                       [2]
>>> +	 *
>>> +	 * then [2] becomes:
>>> +	 *
>>> +	 * f(x) = llc_size - (llc_size -4)x/h                         [3]
>>> +	 *
>>> +	 * Choose 50 (50%) for h as the threshold from experiment result.
>>> +	 * And since x = 100 * sum_util / total_cap, [3] becomes:
>>> +	 *
>>> +	 * f(sum_util)
>>> +	 *  = llc_size - (llc_size - 4) * 100 * sum_util / total_cap * 50
>>> +	 *  = llc_size - (llc_size - 4) * 2 * sum_util / total_cap
>>> +	 *
>>> +	 */
>>> +	nr_scan = llc_size - (llc_size - 4) * 2 * sum_util / sds->total_capacity;
>>> +	if (nr_scan < 4)
>>> +		nr_scan = 4;
>>> +
>>> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
>>> +}
>>> +
>>>  /**
>>>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>>>   * @env: The load balancing environment.
>>> @@ -9212,6 +9225,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>  	struct sg_lb_stats *local = &sds->local_stat;
>>>  	struct sg_lb_stats tmp_sgs;
>>>  	int sg_status = 0;
>>> +	unsigned long sum_util = 0;
>>>  
>>>  	do {
>>>  		struct sg_lb_stats *sgs = &tmp_sgs;
>>> @@ -9242,6 +9256,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>  		/* Now, start updating sd_lb_stats */
>>>  		sds->total_load += sgs->group_load;
>>>  		sds->total_capacity += sgs->group_capacity;
>>> +		sum_util += sgs->group_util;
>>>  
>>>  		sg = sg->next;
>>>  	} while (sg != env->sd->groups);
>>> @@ -9268,6 +9283,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>>>  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>>>  	}
>>> +
>>> +	update_nr_idle_scan(env, sds, sum_util);
>>>  }
>>>  
>>>  #define NUMA_IMBALANCE_MIN 2
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-03-17 10:39     ` K Prateek Nayak
@ 2022-04-02 14:35       ` K Prateek Nayak
  2022-04-09 15:15         ` Chen Yu
  0 siblings, 1 reply; 15+ messages in thread
From: K Prateek Nayak @ 2022-04-02 14:35 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Yicong Yang,
	Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li

Hello Chenyu,

On 3/17/2022 4:09 PM, K Prateek Nayak wrote:
>> May I know the average CPU utilization of this benchmark?
> I don't have this data at hand. I'll get back to you soon with the data.

This is the output of vmstat captured every minute during the runtime of the benchmark:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
  0  0      0 524179712  47956 623328    0    0    18     1   95   70  0  1 99  0  0
  0  0      0 524583488   6984 296760    0    0   869    32  765  613  0  0 100  0  0
  2  0      0 519057408  10840 413092    0    0  1413    97 40022 245964  2  1 98  0  0
 13  0      0 515806048  11036 1560292    0    0     1 15861 47291 481310  3  1 95  0  0
  0  0      0 519494240  11844 2073552    0    0    27 11934 7397 70297  1  0 99  0  0
 49  1      0 508753216  12064 4519468    0    0    22 82723 160775 1219361 11  5 84  0  0
 51  1      0 508787424  12188 4686136    0    0     3 177661 199218 1522425 13  5 81  0  0
 51  0      0 508717088  12464 5102900    0    0     8 172393 215555 1526566 14  6 81  0  0
 54  0      0 508974208  12584 5133116    0    0     0 196022 194294 1520359 14  5 81  0  0
 49  0      0 509171712  12720 5133420    0    0     1 189473 181195 1520990 14  5 81  0  0
 65  2      0 509624672  12820 5172400    0    0     0 192406 190311 1516981 14  5 80  0  0
 50  1      0 510052864  14228 5231312    0    0   886 173806 236733 1491140 14  6 80  0  0
 50  1      0 510455904  14340 5231412    0    0     0 191996 252078 1479435 14  6 80  0  0
 51  0      0 510845184  14464 5231436    0    0     0 191033 251433 1480865 14  6 80  0  0
 54  0      0 511137472  15128 5247844    0    0    12 198460 254401 1485615 14  6 80  0  0
 54  0      0 511242816  15240 5247868    0    0     0 205262 255299 1483994 14  6 80  0  0
  0  0      0 524682656   6488 302624    0    0   596 64012 55117 315385  3  1 96  0  0
 11  0      0 523307136  10096 350728    0    0   788    89 3350 30092  0  0 100  0  0
  9  0      0 516713440  10276 1222368    0    0     2  8941 51751 497130  3  1 95  0  0
  0  0      0 519524608  10428 2009192    0    0     8 18969 29798 277747  2  1 97  0  0
 50  0      0 510469184  10608 4463716    0    0    15 72376 99893 592582  6  2 92  0  0
 50  1      0 510225728  10760 4954988    0    0     3 175225 223641 1531010 13  5 81  0  0
 49  0      0 510185376  10856 4955084    0    0     0 167856 208568 1541377 13  5 81  0  0
 51  1      0 510101696  10964 5058408    0    0     0 185400 216839 1534992 13  5 81  0  0
 56  0      0 509908448  11084 5058524    0    0     0 184871 240666 1478386 14  6 81  0  0
 57  0      0 509593088  11188 5058624    0    0     0 174319 257078 1472289 14  6 80  0  0
 46  0      0 509449280  11320 5120364    0    0     0 199695 246095 1484076 14  6 81  0  0
 47  0      0 509569120  11440 5120504    0    0     0 195409 246954 1486632 14  6 81  0  0
 51  1      0 509394240  11564 5120640    0    0     0 187739 235864 1487453 14  5 81  0  0
 46  1      0 509165184  11668 5137480    0    0     0 190097 225997 1510812 14  5 81  0  0
 83  1      0 509214944  11788 5137564    0    0     0 185712 232513 1524289 14  5 81  0  0
  0  0      0 524685792   3940 268116    0    0    48 153174 152934 966090  9  3 88  0  0
  0  0      0 524574880   6560 317068    0    0   703    82  193  342  0  0 100  0  0
 11  0      0 515208960   8380 877736    0    0   606  8627 31043 324947  2  1 97  0  0
  1  0      0 519673440   9196 1845944    0    0    20 13871 41874 470542  3  1 96  0  0
 56  0      0 513421280   9328 2819588    0    0    13  6776 21005 106752  1  0 98  0  0
46  0      0 509829408   9520 4192212    0    0     4 157023 199117 1512516 13  5 82  0  0
49  1      0 509403200  10084 4596844    0    0    22 175970 205070 1512552 13  5 81  0  0
50  0      0 508933536  10220 4828368    0    0     0 180676 224455 1512344 14  6 81  0  0
47  0      0 508529472  10364 4940972    0    0     1 196158 250799 1502868 14  6 80  0  0
58  0      0 508565344  10476 4941012    0    0     0 194334 256721 1502510 14  6 80  0  0
50  0      0 508594176  10596 4941144    0    0     0 182299 257354 1501533 14  6 80  0  0
57  1      0 508721088  10700 4965532    0    0     0 186060 255378 1501167 14  6 80  0  0
49  0      0 508646144  10824 4965644    0    0     0 191029 257866 1500889 14  6 80  0  0
49  0      0 508739136  10920 4965672    0    0     0 182014 260923 1504201 14  6 80  0  0
50  0      0 509059616  11032 4974112    0    0     0 195981 260905 1503577 14  6 80  0  0
 0  1      0 516111104  11152 4975472    0    0     0 190756 186797 1452580 13  5 82  0  0
 0  0      0 519977088  17876 4985004    0    0   177 18948  747  650  0  0 100  0  0

If there is any specific monitoring program you would like me to run, please let me know.

--

Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-04-02 10:11     ` Yicong Yang
@ 2022-04-09 15:09       ` Chen Yu
  2022-04-11  8:04         ` Yicong Yang
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Yu @ 2022-04-09 15:09 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Chen Yu, Yicong Yang, yangyicong, Linux Kernel Mailing List,
	Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Srikar Dronamraju,
	Len Brown, Ben Segall, Daniel Bristot de Oliveira, Aubrey Li,
	K Prateek Nayak, shenyang (M)

On Tue, Apr 5, 2022 at 9:05 AM Yicong Yang <yangyicong@huawei.com> wrote:
>
> FYI, shenyang has done some investigation on whether we can get an idle cpu if the nr is 4.
> For netperf running on node 0-1 (32 cores on each node) with 32, 64, 128 threads, the success
> rate of findindg an idle cpu is about 61.8%, 7.4%, <0.1%, the CPU utilization is 70.7%, 87.4%
> and 99.9% respectively.
>
Thanks for this testing. So this indicates that nr = 4 would not
improve the idle CPU search efficiency
 much when the load is extremely high. Stop searching entirely when it
is nearly 100%  may be
more appropriate.
> I have test this patch based on 5.17-rc7 on Kunpeng 920. The benchmarks are binding to node 0
> or node 0-1. The tbench result has some oscillation so I need to have a further check.
> For netperf I see performance enhancement when the threads equals to the cpu number.
>
The benefit might come from returning previous CPU earlier when
nr_threads equals to nr_cpu.
And when the threads number exceeds that of CPU, it might have already
returned previous CPU
without this patch, so we didn't see much improvements(in Shenyang's
test, the success rate is only
7.4% when threads number equals to CPU number)
> For netperf:
> TCP_RR 2 nodes
> threads         base            patched         pct
> 16              50335.56667     49970.63333     -0.73%
> 32              47281.53333     48191.93333     1.93%
> 64              18907.7         34263.63333     81.22%
> 128             14391.1         14480.8         0.62%
> 256             6905.286667     6853.83         -0.75%
>
> TCP_RR 1 node
> threads         base            patched         pct
> 16              50086.06667     49648.13333     -0.87%
> 32              24983.3         39489.43333     58.06%
> 64              18340.03333     18399.56667     0.32%
> 128             7174.713333     7390.09         3.00%
> 256             3433.696667     3404.956667     -0.84%
>
> UDP_RR 2 nodes
> threads         base            patched         pct
> 16              81448.7         82659.43333     1.49%
> 32              75351.13333     76812.36667     1.94%
> 64              25539.46667     41835.96667     63.81%
> 128             25081.56667     23595.56667     -5.92%
> 256             11848.23333     11017.13333     -7.01%
>
> UDP_RR 1 node
> threads         base            patched         pct
> 16              87288.96667     88719.83333     1.64%
> 32              22891.73333     68854.33333     200.78%
> 64              33853.4         35891.6         6.02%
> 128             12108.4         11885.76667     -1.84%
> 256             5620.403333     5531.006667     -1.59%
>
> mysql on node 0-1
>                         base            patched         pct
> 16threads-TPS           7100.27         7224.31         1.75%
> 16threads-QPS           142005.45       144486.19       1.75%
> 16threads-avg lat       2.25            2.22            1.63%
> 16threads-99th lat      2.46            2.43            1.08%
> 24threads-TPS           10424.70        10312.20        -1.08%
> 24threads-QPS           208493.86       206243.93       -1.08%
> 24threads-avg lat       2.30            2.32            -0.87%
> 24threads-99th lat      2.52            2.57            -1.85%
> 32threads-TPS           12528.79        12228.88        -2.39%
> 32threads-QPS           250575.92       244577.59       -2.39%
> 32threads-avg lat       2.55            2.61            -2.35%
> 32threads-99th lat      2.88            2.99            -3.82%
> 64threads-TPS           21386.17        21789.99        1.89%
> 64threads-QPS           427723.41       435799.85       1.89%
> 64threads-avg lat       2.99            2.94            1.78%
> 64threads-99th lat      5.00            4.69            6.33%
> 128threads-TPS          20865.13        20781.24        -0.40%
> 128threads-QPS          417302.73       415624.83       -0.40%
> 128threads-avg lat      6.13            6.16            -0.38%
> 128threads-99th lat     8.90            8.95            -0.60%
> 256threads-TPS          19258.15        19295.11        0.19%
> 256threads-QPS          385162.92       385902.27       0.19%
> 256threads-avg lat      13.29           13.26           0.23%
> 256threads-99th lat     20.12           20.12           0.00%
>
> I also had a look on a machine with 2-socket Xeon 6148 (80 threads in total)
> For TCP_RR, the best enhancement also happens when the threads equals to
> the cpu number.
>
May I know if the test is with turbo enabled or disabled? If the turbo
is disabled,
there might be some issues when calculating the util_avg. I had a workaround at
https://lore.kernel.org/all/20220407234258.569681-1-yu.c.chen@intel.com/
And I'm working on the v3 patch which would include above workaround,
will sent it
out later.

-- 
Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-04-02 14:35       ` K Prateek Nayak
@ 2022-04-09 15:15         ` Chen Yu
  0 siblings, 0 replies; 15+ messages in thread
From: Chen Yu @ 2022-04-09 15:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Chen Yu, Linux Kernel Mailing List, Tim Chen, Peter Zijlstra,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Viresh Kumar, Barry Song, Barry Song,
	Yicong Yang, Srikar Dronamraju, Len Brown, Ben Segall,
	Daniel Bristot de Oliveira, Aubrey Li

On Tue, Apr 5, 2022 at 11:20 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Chenyu,
>
> This is the output of vmstat captured every minute during the runtime of the benchmark:
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>   0  0      0 524179712  47956 623328    0    0    18     1   95   70  0  1 99  0  0
>   0  0      0 524583488   6984 296760    0    0   869    32  765  613  0  0 100  0  0
>   2  0      0 519057408  10840 413092    0    0  1413    97 40022 245964  2  1 98  0  0
>  13  0      0 515806048  11036 1560292    0    0     1 15861 47291 481310  3  1 95  0  0
>   0  0      0 519494240  11844 2073552    0    0    27 11934 7397 70297  1  0 99  0  0
>  49  1      0 508753216  12064 4519468    0    0    22 82723 160775 1219361 11  5 84  0  0
>  51  1      0 508787424  12188 4686136    0    0     3 177661 199218 1522425 13  5 81  0  0
>  51  0      0 508717088  12464 5102900    0    0     8 172393 215555 1526566 14  6 81  0  0
>  54  0      0 508974208  12584 5133116    0    0     0 196022 194294 1520359 14  5 81  0  0
>  49  0      0 509171712  12720 5133420    0    0     1 189473 181195 1520990 14  5 81  0  0
>  65  2      0 509624672  12820 5172400    0    0     0 192406 190311 1516981 14  5 80  0  0
>  50  1      0 510052864  14228 5231312    0    0   886 173806 236733 1491140 14  6 80  0  0
>  50  1      0 510455904  14340 5231412    0    0     0 191996 252078 1479435 14  6 80  0  0
>  51  0      0 510845184  14464 5231436    0    0     0 191033 251433 1480865 14  6 80  0  0
>  54  0      0 511137472  15128 5247844    0    0    12 198460 254401 1485615 14  6 80  0  0
>  54  0      0 511242816  15240 5247868    0    0     0 205262 255299 1483994 14  6 80  0  0
>   0  0      0 524682656   6488 302624    0    0   596 64012 55117 315385  3  1 96  0  0
>  11  0      0 523307136  10096 350728    0    0   788    89 3350 30092  0  0 100  0  0
>   9  0      0 516713440  10276 1222368    0    0     2  8941 51751 497130  3  1 95  0  0
>   0  0      0 519524608  10428 2009192    0    0     8 18969 29798 277747  2  1 97  0  0
>  50  0      0 510469184  10608 4463716    0    0    15 72376 99893 592582  6  2 92  0  0
>  50  1      0 510225728  10760 4954988    0    0     3 175225 223641 1531010 13  5 81  0  0
>  49  0      0 510185376  10856 4955084    0    0     0 167856 208568 1541377 13  5 81  0  0
>  51  1      0 510101696  10964 5058408    0    0     0 185400 216839 1534992 13  5 81  0  0
>  56  0      0 509908448  11084 5058524    0    0     0 184871 240666 1478386 14  6 81  0  0
>  57  0      0 509593088  11188 5058624    0    0     0 174319 257078 1472289 14  6 80  0  0
>  46  0      0 509449280  11320 5120364    0    0     0 199695 246095 1484076 14  6 81  0  0
>  47  0      0 509569120  11440 5120504    0    0     0 195409 246954 1486632 14  6 81  0  0
>  51  1      0 509394240  11564 5120640    0    0     0 187739 235864 1487453 14  5 81  0  0
>  46  1      0 509165184  11668 5137480    0    0     0 190097 225997 1510812 14  5 81  0  0
>  83  1      0 509214944  11788 5137564    0    0     0 185712 232513 1524289 14  5 81  0  0
>   0  0      0 524685792   3940 268116    0    0    48 153174 152934 966090  9  3 88  0  0
>   0  0      0 524574880   6560 317068    0    0   703    82  193  342  0  0 100  0  0
>  11  0      0 515208960   8380 877736    0    0   606  8627 31043 324947  2  1 97  0  0
>   1  0      0 519673440   9196 1845944    0    0    20 13871 41874 470542  3  1 96  0  0
>  56  0      0 513421280   9328 2819588    0    0    13  6776 21005 106752  1  0 98  0  0
> 46  0      0 509829408   9520 4192212    0    0     4 157023 199117 1512516 13  5 82  0  0
> 49  1      0 509403200  10084 4596844    0    0    22 175970 205070 1512552 13  5 81  0  0
> 50  0      0 508933536  10220 4828368    0    0     0 180676 224455 1512344 14  6 81  0  0
> 47  0      0 508529472  10364 4940972    0    0     1 196158 250799 1502868 14  6 80  0  0
> 58  0      0 508565344  10476 4941012    0    0     0 194334 256721 1502510 14  6 80  0  0
> 50  0      0 508594176  10596 4941144    0    0     0 182299 257354 1501533 14  6 80  0  0
> 57  1      0 508721088  10700 4965532    0    0     0 186060 255378 1501167 14  6 80  0  0
> 49  0      0 508646144  10824 4965644    0    0     0 191029 257866 1500889 14  6 80  0  0
> 49  0      0 508739136  10920 4965672    0    0     0 182014 260923 1504201 14  6 80  0  0
> 50  0      0 509059616  11032 4974112    0    0     0 195981 260905 1503577 14  6 80  0  0
>  0  1      0 516111104  11152 4975472    0    0     0 190756 186797 1452580 13  5 82  0  0
>  0  0      0 519977088  17876 4985004    0    0   177 18948  747  650  0  0 100  0  0
>
> If there is any specific monitoring program you would like me to run, please let me know.
>
Thanks for providing this information. So the ycsb-mongodb had some
regression when the
load was around 20%. As v2 patch would change the search depth when
the load is low, this
results might indicate that we should search more CPUs when the system is low.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg
  2022-04-09 15:09       ` Chen Yu
@ 2022-04-11  8:04         ` Yicong Yang
  0 siblings, 0 replies; 15+ messages in thread
From: Yicong Yang @ 2022-04-11  8:04 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, Chen Yu, Yicong Yang, Linux Kernel Mailing List,
	Tim Chen, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Viresh Kumar, Barry Song, Barry Song, Srikar Dronamraju,
	Len Brown, Ben Segall, Daniel Bristot de Oliveira, Aubrey Li,
	K Prateek Nayak, shenyang (M)

On 2022/4/9 23:09, Chen Yu wrote:
> On Tue, Apr 5, 2022 at 9:05 AM Yicong Yang <yangyicong@huawei.com> wrote:
>>
>> FYI, shenyang has done some investigation on whether we can get an idle cpu if the nr is 4.
>> For netperf running on node 0-1 (32 cores on each node) with 32, 64, 128 threads, the success
>> rate of findindg an idle cpu is about 61.8%, 7.4%, <0.1%, the CPU utilization is 70.7%, 87.4%
>> and 99.9% respectively.
>>
> Thanks for this testing. So this indicates that nr = 4 would not
> improve the idle CPU search efficiency
>  much when the load is extremely high. Stop searching entirely when it
> is nearly 100%  may be
> more appropriate.
>> I have test this patch based on 5.17-rc7 on Kunpeng 920. The benchmarks are binding to node 0
>> or node 0-1. The tbench result has some oscillation so I need to have a further check.
>> For netperf I see performance enhancement when the threads equals to the cpu number.
>>
> The benefit might come from returning previous CPU earlier when
> nr_threads equals to nr_cpu.
> And when the threads number exceeds that of CPU, it might have already
> returned previous CPU
> without this patch, so we didn't see much improvements(in Shenyang's
> test, the success rate is only
> 7.4% when threads number equals to CPU number)

yes. I think it maybe the case. When the system is fully loaded the behaviour may stay
the same with previous approach, both are scanning 4 cpus.

>> For netperf:
>> TCP_RR 2 nodes
>> threads         base            patched         pct
>> 16              50335.56667     49970.63333     -0.73%
>> 32              47281.53333     48191.93333     1.93%
>> 64              18907.7         34263.63333     81.22%
>> 128             14391.1         14480.8         0.62%
>> 256             6905.286667     6853.83         -0.75%
>>
>> TCP_RR 1 node
>> threads         base            patched         pct
>> 16              50086.06667     49648.13333     -0.87%
>> 32              24983.3         39489.43333     58.06%
>> 64              18340.03333     18399.56667     0.32%
>> 128             7174.713333     7390.09         3.00%
>> 256             3433.696667     3404.956667     -0.84%
>>
>> UDP_RR 2 nodes
>> threads         base            patched         pct
>> 16              81448.7         82659.43333     1.49%
>> 32              75351.13333     76812.36667     1.94%
>> 64              25539.46667     41835.96667     63.81%
>> 128             25081.56667     23595.56667     -5.92%
>> 256             11848.23333     11017.13333     -7.01%
>>
>> UDP_RR 1 node
>> threads         base            patched         pct
>> 16              87288.96667     88719.83333     1.64%
>> 32              22891.73333     68854.33333     200.78%
>> 64              33853.4         35891.6         6.02%
>> 128             12108.4         11885.76667     -1.84%
>> 256             5620.403333     5531.006667     -1.59%
>>
>> mysql on node 0-1
>>                         base            patched         pct
>> 16threads-TPS           7100.27         7224.31         1.75%
>> 16threads-QPS           142005.45       144486.19       1.75%
>> 16threads-avg lat       2.25            2.22            1.63%
>> 16threads-99th lat      2.46            2.43            1.08%
>> 24threads-TPS           10424.70        10312.20        -1.08%
>> 24threads-QPS           208493.86       206243.93       -1.08%
>> 24threads-avg lat       2.30            2.32            -0.87%
>> 24threads-99th lat      2.52            2.57            -1.85%
>> 32threads-TPS           12528.79        12228.88        -2.39%
>> 32threads-QPS           250575.92       244577.59       -2.39%
>> 32threads-avg lat       2.55            2.61            -2.35%
>> 32threads-99th lat      2.88            2.99            -3.82%
>> 64threads-TPS           21386.17        21789.99        1.89%
>> 64threads-QPS           427723.41       435799.85       1.89%
>> 64threads-avg lat       2.99            2.94            1.78%
>> 64threads-99th lat      5.00            4.69            6.33%
>> 128threads-TPS          20865.13        20781.24        -0.40%
>> 128threads-QPS          417302.73       415624.83       -0.40%
>> 128threads-avg lat      6.13            6.16            -0.38%
>> 128threads-99th lat     8.90            8.95            -0.60%
>> 256threads-TPS          19258.15        19295.11        0.19%
>> 256threads-QPS          385162.92       385902.27       0.19%
>> 256threads-avg lat      13.29           13.26           0.23%
>> 256threads-99th lat     20.12           20.12           0.00%
>>
>> I also had a look on a machine with 2-socket Xeon 6148 (80 threads in total)
>> For TCP_RR, the best enhancement also happens when the threads equals to
>> the cpu number.
>>
> May I know if the test is with turbo enabled or disabled? If the turbo
> is disabled,

p-state is controlled by the platform on my machine. I assume it's on as the Avg_MHz
and Bzy_Mhz varies according to the load. turbostat says:
# cpu 0 idle
Package Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IPC     IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c6  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
-       -       -       0       0.01    3100    2400    1.41    1935    0       0       1831    0.00    100.00  99.99   0.00    69      69      150.55  44.41   0.00    0.00
0       0       0       2       0.06    3100    2400    0.99    46      0       0       45      0.00    99.95   99.94   0.00    66      68      74.12   21.71   0.00    0.00
# cpu 0 partly loaded
Package Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IPC     IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c6  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
-       -       -       1559    50.34   3097    2400    1.44    56952   0       64      4729    0.00    49.67   49.66   0.00    83      84      291.81  43.79   0.00    0.00
0       0       0       1642    53.01   3098    2400    1.48    1610    0       0       620     0.00    47.51   46.99   0.00    75      78      142.62  21.44   0.00    0.00
# cpu 0 loaded by stress-ng
Package Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IPC     IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c6  CoreTmp PkgTmp  PkgWatt RAMWatt PKG_%   RAM_%
-       -       -       2917    99.52   2931    2400    1.06    101718  0       0       0       0.00    0.00    0.48    0.00    86      86      299.87  43.84   0.00    0.00
0       0       0       2943    99.39   2961    2400    1.05    1278    0       0       0       0.00    0.00    0.61    0.00    81      83      149.90  21.38   0.00    0.00


> there might be some issues when calculating the util_avg. I had a workaround at
> https://lore.kernel.org/all/20220407234258.569681-1-yu.c.chen@intel.com/
> And I'm working on the v3 patch which would include above workaround,
> will sent it
> out later.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-11  8:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10  0:52 [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util_avg Chen Yu
2022-03-14  4:53 ` Abel Wu
2022-03-14 12:56   ` Chen Yu
2022-03-14 17:34     ` Tim Chen
2022-03-16 16:13       ` Chen Yu
2022-03-15 11:37 ` K Prateek Nayak
2022-03-16 11:54   ` Chen Yu
2022-03-17 10:39     ` K Prateek Nayak
2022-04-02 14:35       ` K Prateek Nayak
2022-04-09 15:15         ` Chen Yu
2022-03-17 17:39 ` Yicong Yang
2022-03-18  3:43   ` Chen Yu
2022-04-02 10:11     ` Yicong Yang
2022-04-09 15:09       ` Chen Yu
2022-04-11  8:04         ` Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).