linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
@ 2022-04-28 18:24 Chen Yu
  2022-05-06  8:38 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Chen Yu @ 2022-04-28 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Mel Gorman, Yicong Yang,
	K Prateek Nayak, Tim Chen
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira, Chen Yu

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, netperf, hackbench,
schbench and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Each test lasts for 100 seconds
and repeats 3 times. The system reboots into a fresh environment for
each test.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time. As Peter suggested[2], the lower the util_avg is, the
more select_idle_cpu() should scan for idle CPU, and vice versa.

Introduce the quadratic function:

y = a - bx^2

x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
as sum_util increases. When sum_util hits 85% or above, the scan stops.
Choosing 85% is because it is the threshold of an overloaded LLC sched group
(imbalance_pct = 117). Choosing quadratic function is because:

[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

The steps to calculate scan_nr are as followed:
[1] scan_percent = 100 - (x/8.5)^2
    when utilization reaches 85%, scan_percent becomes 0.
[2] scan_nr = nr_llc * scan_percent / 100
[3] scan_nr = max(scan_nr, 0)

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85 ...
scan_ns   112  112  108  103  92  80  64   47   24    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112ms would decay
to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache false sharing. And Tim also mentioned
that, it is allowed to be non-optimal in terms of scheduling for the
short term variations, but if there is a long term trend in the load
behavior, the scheduler can adjust for that.

SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
will use the nr_scan calculated by SIS_UTIL instead of the one from
SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.

[Test result]

The following is the benchmark result comparison between
baseline:vanilla and compare:patched kernel. Positive compare%
indicates better performance.

netperf.throughput
each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.40)	 +1.14 (  0.37)
TCP_RR          	56 threads	 1.00 (  0.49)	 +0.62 (  0.31)
TCP_RR          	84 threads	 1.00 (  0.50)	 +0.26 (  0.55)
TCP_RR          	112 threads	 1.00 (  0.27)	 +0.29 (  0.28)
TCP_RR          	140 threads	 1.00 (  0.22)	 +0.14 (  0.23)
TCP_RR          	168 threads	 1.00 (  0.21)	 +0.40 (  0.19)
TCP_RR          	196 threads	 1.00 (  0.18)	+183.40 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+188.44 (  9.29)
UDP_RR          	28 threads	 1.00 (  0.47)	 +1.45 (  0.47)
UDP_RR          	56 threads	 1.00 (  0.28)	 -0.22 (  0.30)
UDP_RR          	84 threads	 1.00 (  0.38)	 +1.72 ( 27.10)
UDP_RR          	112 threads	 1.00 (  0.16)	 +0.01 (  0.18)
UDP_RR          	140 threads	 1.00 ( 14.10)	 +0.32 ( 11.15)
UDP_RR          	168 threads	 1.00 ( 12.75)	 +0.91 ( 11.62)
UDP_RR          	196 threads	 1.00 ( 14.41)	+191.97 ( 19.34)
UDP_RR          	224 threads	 1.00 ( 15.34)	+194.88 ( 17.06)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This could help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPU's
lock.

Other benchmarks:

hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  0.09)	 -0.54 (  0.82)
process-pipe    	2 groups 	 1.00 (  0.47)	 +0.89 (  0.61)
process-pipe    	4 groups 	 1.00 (  0.83)	 +0.90 (  0.15)
process-pipe    	8 groups 	 1.00 (  0.09)	 +0.31 (  0.07)
process-sockets 	1 group 	 1.00 (  0.13)	 -0.58 (  0.49)
process-sockets 	2 groups 	 1.00 (  0.41)	 -0.58 (  0.52)
process-sockets 	4 groups 	 1.00 (  0.61)	 -0.37 (  0.50)
process-sockets 	8 groups 	 1.00 (  0.22)	 +1.15 (  0.10)
threads-pipe    	1 group 	 1.00 (  0.35)	 -0.28 (  0.78)
threads-pipe    	2 groups 	 1.00 (  0.65)	 +0.03 (  0.96)
threads-pipe    	4 groups 	 1.00 (  0.43)	 +0.81 (  0.38)
threads-pipe    	8 groups 	 1.00 (  0.11)	 -1.56 (  0.07)
threads-sockets 	1 group 	 1.00 (  0.30)	 -0.39 (  0.41)
threads-sockets 	2 groups 	 1.00 (  0.21)	 -0.23 (  0.27)
threads-sockets 	4 groups 	 1.00 (  0.23)	 +0.36 (  0.19)
threads-sockets 	8 groups 	 1.00 (  0.13)	 +1.57 (  0.06)

tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.15)	 +1.05 (  0.08)
loopback        	56 threads	 1.00 (  0.09)	 +0.36 (  0.04)
loopback        	84 threads	 1.00 (  0.12)	 +0.26 (  0.06)
loopback        	112 threads	 1.00 (  0.12)	 +0.04 (  0.09)
loopback        	140 threads	 1.00 (  0.04)	 +2.98 (  0.18)
loopback        	168 threads	 1.00 (  0.10)	 +2.88 (  0.30)
loopback        	196 threads	 1.00 (  0.06)	 +2.63 (  0.03)
loopback        	224 threads	 1.00 (  0.08)	 +2.60 (  0.06)

schbench.latency_90%_us
========
case            	load    	baseline	compare%
normal          	1 mthread	 1.00 	        -1.7%
normal          	2 mthreads	 1.00           +1.6%
normal          	4 mthreads	 1.00           +1.4%
normal          	8 mthreads	 1.00    	+21.0%

Limitations:
[1]
This patch is based on the util_avg, which is very sensitive to the CPU
frequency invariance. The util_avg would decay quite fast when the
CPU is idle, if the max frequency has been limited by the user.
Patch [3] should be applied if turbo is disabled manually on Intel
platforms.

[2]
There may be unbalanced tasks among CPUs due to CPU affinity. For example,
suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected.

A possible workaround to mitigate this problem is that the nr_scan should
be increased if idle CPUs are detected during periodic load balance. And the
problem could be mitigated by idle load balance that, CPU7 might pull some
tasks on it.

[3]
Prateek mentioned that we should scan aggressively in an LLC domain
with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
negligible. The current patch aims to propose a generic solution and only
considers the util_avg. A follow-up change could enhance the scan policy
to adjust the scan_percent according to the CPU number in LLC.

v2->v3:
 - Use 85% as the threshold again, because the CPU frequency invariance issue
   has been fixed and the patch is queued for 5.19.

 - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
   According to the feedback from Yicong, it might be better to stop scanning
   entirely when the LLC is overloaded.

 - Replace linear scan with quadratic function scan, to let the SIS scan
   aggressively when the LLC is not busy. Prateek mentioned there was slight
   regression from ycsb-mongodb in v2, which might be due to fewer CPUs
   scanned when the utilization is around 20%.

 - Add back the logic to stop the CPU scan even if has_idle_core is true.
   It might be a waste of time to search for an idle Core if the LLC is
   overloaded. Besides, according to the tbench result from Prateek, stop idle
   Core scan would bring extra performance improvement.

 - Provide the SIS search statistics in the commit log, based on Mel Gorman's
   patch, which is suggested by Adel.

 - Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
   directly, which can be reviewed easier.

v2->v1:
 - As suggested by Peter, introduce an idle CPU scan strategy that is based on
   the util_avg metric. When util_avg is very low it scans more, while when
   util_avg hits the threshold we naturally stop scanning entirely. The threshold
   has been decreased from 85% to 50%, because this is the threshold when the
   CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
   when the LLC is overloaded, to keep it consistent with the current logic of
   select_idle_cpu().

v1:
 - Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
   the LLC domain has reached 85%.

[Resend to include the missing mailing list, sorry for any inconvenience.]

Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net #2
Link: https://lore.kernel.org/lkml/20220407234258.569681-1-yu.c.chen@intel.com #3
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 58 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..816df6cc444e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		nr_idle_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23c7d0f617ee..50c9d5b2b338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
+	if (sched_feat(SIS_UTIL)) {
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+		if (sd_share) {
+			/* because !--nr is the condition to stop scan */
+			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			/* overloaded LLC is unlikely to have idle cpu/core */
+			if (nr == 1)
+				return -1;
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static inline void update_idle_cpu_scan(struct lb_env *env,
+					unsigned long sum_util)
+{
+	struct sched_domain_shared *sd_share;
+	int nr_scan, nr_llc, llc_util_pct;
+
+	if (!sched_feat(SIS_UTIL))
+		return;
+	/*
+	 * Update the number of CPUs to scan in LLC domain, which could
+	 * be used as a hint in select_idle_cpu(). The update of this hint
+	 * occurs during periodic load balancing, rather than frequent
+	 * newidle balance.
+	 */
+	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
+	if (env->idle == CPU_NEWLY_IDLE ||
+	    env->sd->span_weight != nr_llc)
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+	if (!sd_share)
+		return;
+
+	/*
+	 * The number of CPUs to search drops as sum_util increases, when
+	 * sum_util hits 85% or above, the scan stops.
+	 * The reason to choose 85% as the threshold is because this is the
+	 * imbalance_pct when a LLC sched group is overloaded.
+	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
+	 * y is the percentage of CPUs to be scanned in the LLC
+	 * domain, x is the ratio of sum_util compared to the
+	 * CPU capacity, which ranges in [0, 100], thus
+	 * nr_scan = nr_llc * y / 100
+	 */
+	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
+	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
+	nr_scan = max(nr_scan, 0);
+	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
+	unsigned long sum_util = 0;
 	int sg_status = 0;
 
 	do {
@@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
+		sum_util += sgs->group_util;
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
@@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
+
+	update_idle_cpu_scan(env, sum_util);
 }
 
 #define NUMA_IMBALANCE_MIN 2
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1cf435bbcd9c..69be099019f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_UTIL, false)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-04-28 18:24 [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
@ 2022-05-06  8:38 ` Peter Zijlstra
  2022-05-07 14:44   ` Chen Yu
  2022-05-10 12:41 ` Yicong Yang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2022-05-06  8:38 UTC (permalink / raw)
  To: Chen Yu
  Cc: Vincent Guittot, Mel Gorman, Yicong Yang, K Prateek Nayak,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira


So in general I'm liking this.

But we appear to be in the position where there's two patches touching
this code, the other by Abel Wu.

However, I'm thinking that this patch is the simpler and parts of Wu's
patch can be done of top of this.

Specifically, Wu's patch reminds of me some of my earlier experiments
that added an additional cpumask to the SIS path and there I found that
the cache-miss from accessing an additional mask was hurting enough to
negate a lot of the benefit it brought.

On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:

> It would be ideal to have a crystal ball to answer this question:
> How many CPUs must a wakeup path walk down, before it can find an idle
> CPU? Many potential metrics could be used to predict the number.
> One candidate is the sum of util_avg in this LLC domain. The benefit
> of choosing util_avg is that it is a metric of accumulated historic
> activity, which seems to be smoother than instantaneous metrics
> (such as rq->nr_running). Besides, choosing the sum of util_avg
> would help predict the load of the LLC domain more precisely, because
> SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> time. As Peter suggested[2], the lower the util_avg is, the
> more select_idle_cpu() should scan for idle CPU, and vice versa.
> 
> Introduce the quadratic function:
> 
> y = a - bx^2
> 
> x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> as sum_util increases. When sum_util hits 85% or above, the scan stops.
> Choosing 85% is because it is the threshold of an overloaded LLC sched group
> (imbalance_pct = 117). Choosing quadratic function is because:
> 
> [1] Compared to the linear function, it scans more aggressively when the
>     sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
>     and the number of CPUs to be scanned. Use heuristic scan for now.
> 
> The steps to calculate scan_nr are as followed:
> [1] scan_percent = 100 - (x/8.5)^2
>     when utilization reaches 85%, scan_percent becomes 0.
> [2] scan_nr = nr_llc * scan_percent / 100
> [3] scan_nr = max(scan_nr, 0)
> 
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85 ...
> scan_ns   112  112  108  103  92  80  64   47   24    0 ...
> 
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112ms would decay
> to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache false sharing. And Tim also mentioned
> that, it is allowed to be non-optimal in terms of scheduling for the
> short term variations, but if there is a long term trend in the load
> behavior, the scheduler can adjust for that.
> 
> SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> will use the nr_scan calculated by SIS_UTIL instead of the one from
> SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.

This; I'm thinking that it should default-enable, otherwise what's the
point. And ideally we're remove SIS_PROP after a few releases if this
works out.

A few more notes below.. mostly minor nits.

> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 58 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		nr_idle_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23c7d0f617ee..50c9d5b2b338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +	struct sched_domain_shared *sd_share;
>  	struct rq *this_rq = this_rq();
>  	int this = smp_processor_id();
>  	struct sched_domain *this_sd;
> @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> +	if (sched_feat(SIS_UTIL)) {
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share) {
> +			/* because !--nr is the condition to stop scan */
> +			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> +			/* overloaded LLC is unlikely to have idle cpu/core */
> +			if (nr == 1)
> +				return -1;
> +		}
> +	}
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static inline void update_idle_cpu_scan(struct lb_env *env,
> +					unsigned long sum_util)
> +{
> +	struct sched_domain_shared *sd_share;
> +	int nr_scan, nr_llc, llc_util_pct;
> +
> +	if (!sched_feat(SIS_UTIL))
> +		return;
> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of this hint
> +	 * occurs during periodic load balancing, rather than frequent
> +	 * newidle balance.
> +	 */
> +	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> +	if (env->idle == CPU_NEWLY_IDLE ||
> +	    env->sd->span_weight != nr_llc)
> +		return;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * The number of CPUs to search drops as sum_util increases, when
> +	 * sum_util hits 85% or above, the scan stops.
> +	 * The reason to choose 85% as the threshold is because this is the
> +	 * imbalance_pct when a LLC sched group is overloaded.
> +	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> +	 * y is the percentage of CPUs to be scanned in the LLC
> +	 * domain, x is the ratio of sum_util compared to the
> +	 * CPU capacity, which ranges in [0, 100], thus
> +	 * nr_scan = nr_llc * y / 100
> +	 */
> +	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> +	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> +	nr_scan = max(nr_scan, 0);

The comment and code don't nicely match up, which makes it much harder
to read than is needed. Given the comment, I would expect the code to
look a little like:

	x = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
	y = 100 - (x*x)/72;
	nr_scan = max(0, (nr_llc * y) / 100);

Now a pet peeve of mine, computers *SUCK* at '100', that's just not a
good number for them, and AFAICT there's no reason what*so*ever to use
100 here, we're free to pick any random number. So lets be silly and
pick SCHED_CAPACITY_SCALE :-)

	x = sum_util / nr_llc;
	y = SCHED_CAPACITY_SCALE - (x*x)/737;
	nr_scan = max(0, (nr_llc * y) / SCHED_CAPACITY_SCALE);

How's that?

(In general, forget about percentages; that's just a daft human
convention because humans like 100 due to having 10 digits or something,
fractions work on any base.)

> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
> +	unsigned long sum_util = 0;
>  	int sg_status = 0;
>  
>  	do {
> @@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
>  
> +		sum_util += sgs->group_util;
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  
> @@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>  	}
> +
> +	update_idle_cpu_scan(env, sum_util);
>  }
>  
>  #define NUMA_IMBALANCE_MIN 2
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1cf435bbcd9c..69be099019f4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>   */
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_UTIL, false)

As said above, flip those. Otherwise there's no point in merging this,
nobody will use it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-06  8:38 ` Peter Zijlstra
@ 2022-05-07 14:44   ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-05-07 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Mel Gorman, Yicong Yang, K Prateek Nayak,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

On Fri, May 06, 2022 at 10:38:39AM +0200, Peter Zijlstra wrote:
> 
> So in general I'm liking this.
>
Thanks Peter for the review. 
> But we appear to be in the position where there's two patches touching
> this code, the other by Abel Wu.
> 
> However, I'm thinking that this patch is the simpler and parts of Wu's
> patch can be done of top of this.
> 
> Specifically, Wu's patch reminds of me some of my earlier experiments
> that added an additional cpumask to the SIS path and there I found that
> the cache-miss from accessing an additional mask was hurting enough to
> negate a lot of the benefit it brought.
> 
I'll look at Abel Wu's proposal to see if we can co-work on that.
> On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> 
> > It would be ideal to have a crystal ball to answer this question:
> > How many CPUs must a wakeup path walk down, before it can find an idle
> > CPU? Many potential metrics could be used to predict the number.
> > One candidate is the sum of util_avg in this LLC domain. The benefit
> > of choosing util_avg is that it is a metric of accumulated historic
> > activity, which seems to be smoother than instantaneous metrics
> > (such as rq->nr_running). Besides, choosing the sum of util_avg
> > would help predict the load of the LLC domain more precisely, because
> > SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> > time. As Peter suggested[2], the lower the util_avg is, the
> > more select_idle_cpu() should scan for idle CPU, and vice versa.
> > 
> > Introduce the quadratic function:
> > 
> > y = a - bx^2
> > 
> > x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> > of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> > as sum_util increases. When sum_util hits 85% or above, the scan stops.
> > Choosing 85% is because it is the threshold of an overloaded LLC sched group
> > (imbalance_pct = 117). Choosing quadratic function is because:
> > 
> > [1] Compared to the linear function, it scans more aggressively when the
> >     sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> >     and the number of CPUs to be scanned. Use heuristic scan for now.
> > 
> > The steps to calculate scan_nr are as followed:
> > [1] scan_percent = 100 - (x/8.5)^2
> >     when utilization reaches 85%, scan_percent becomes 0.
> > [2] scan_nr = nr_llc * scan_percent / 100
> > [3] scan_nr = max(scan_nr, 0)
> > 
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util%   0    5   15   25  35  45  55   65   75   85 ...
> > scan_ns   112  112  108  103  92  80  64   47   24    0 ...
> > 
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112ms would decay
> > to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> > reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache false sharing. And Tim also mentioned
> > that, it is allowed to be non-optimal in terms of scheduling for the
> > short term variations, but if there is a long term trend in the load
> > behavior, the scheduler can adjust for that.
> > 
> > SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> > will use the nr_scan calculated by SIS_UTIL instead of the one from
> > SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
> 
> This; I'm thinking that it should default-enable, otherwise what's the
> point. And ideally we're remove SIS_PROP after a few releases if this
> works out.
> 
Ok, will change it to enabled by default.
> A few more notes below.. mostly minor nits.
> 
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
> >  kernel/sched/features.h        |  1 +
> >  3 files changed, 58 insertions(+)
> > 
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 56cffe42abbc..816df6cc444e 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -81,6 +81,7 @@ struct sched_domain_shared {
> >  	atomic_t	ref;
> >  	atomic_t	nr_busy_cpus;
> >  	int		has_idle_cores;
> > +	int		nr_idle_scan;
> >  };
> >  
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 23c7d0f617ee..50c9d5b2b338 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  {
> >  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> >  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > +	struct sched_domain_shared *sd_share;
> >  	struct rq *this_rq = this_rq();
> >  	int this = smp_processor_id();
> >  	struct sched_domain *this_sd;
> > @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  		time = cpu_clock(this);
> >  	}
> >  
> > +	if (sched_feat(SIS_UTIL)) {
> > +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> > +		if (sd_share) {
> > +			/* because !--nr is the condition to stop scan */
> > +			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > +			/* overloaded LLC is unlikely to have idle cpu/core */
> > +			if (nr == 1)
> > +				return -1;
> > +		}
> > +	}
> > +
> >  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> >  		if (has_idle_core) {
> >  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >  	return idlest;
> >  }
> >  
> > +static inline void update_idle_cpu_scan(struct lb_env *env,
> > +					unsigned long sum_util)
> > +{
> > +	struct sched_domain_shared *sd_share;
> > +	int nr_scan, nr_llc, llc_util_pct;
> > +
> > +	if (!sched_feat(SIS_UTIL))
> > +		return;
> > +	/*
> > +	 * Update the number of CPUs to scan in LLC domain, which could
> > +	 * be used as a hint in select_idle_cpu(). The update of this hint
> > +	 * occurs during periodic load balancing, rather than frequent
> > +	 * newidle balance.
> > +	 */
> > +	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> > +	if (env->idle == CPU_NEWLY_IDLE ||
> > +	    env->sd->span_weight != nr_llc)
> > +		return;
> > +
> > +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> > +	if (!sd_share)
> > +		return;
> > +
> > +	/*
> > +	 * The number of CPUs to search drops as sum_util increases, when
> > +	 * sum_util hits 85% or above, the scan stops.
> > +	 * The reason to choose 85% as the threshold is because this is the
> > +	 * imbalance_pct when a LLC sched group is overloaded.
> > +	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> > +	 * y is the percentage of CPUs to be scanned in the LLC
> > +	 * domain, x is the ratio of sum_util compared to the
> > +	 * CPU capacity, which ranges in [0, 100], thus
> > +	 * nr_scan = nr_llc * y / 100
> > +	 */
> > +	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> > +	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> > +	nr_scan = max(nr_scan, 0);
> 
> The comment and code don't nicely match up, which makes it much harder
> to read than is needed. Given the comment, I would expect the code to
> look a little like:
> 
> 	x = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> 	y = 100 - (x*x)/72;
> 	nr_scan = max(0, (nr_llc * y) / 100);
> 
> Now a pet peeve of mine, computers *SUCK* at '100', that's just not a
> good number for them, and AFAICT there's no reason what*so*ever to use
> 100 here, we're free to pick any random number. So lets be silly and
> pick SCHED_CAPACITY_SCALE :-)
> 
> 	x = sum_util / nr_llc;
> 	y = SCHED_CAPACITY_SCALE - (x*x)/737;
> 	nr_scan = max(0, (nr_llc * y) / SCHED_CAPACITY_SCALE);
> 
> How's that?
> 
Ok, will change it to this. I calculated the parameter from scratch
using 72.25 instead of 72, the parameter becomes 740. I'll do some test
based on this formula.
> (In general, forget about percentages; that's just a daft human
> convention because humans like 100 due to having 10 digits or something,
> fractions work on any base.)
> 
Thanks for the guidance.
> > +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > +}
> > +
> >  /**
> >   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> >   * @env: The load balancing environment.
> > @@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	struct sched_group *sg = env->sd->groups;
> >  	struct sg_lb_stats *local = &sds->local_stat;
> >  	struct sg_lb_stats tmp_sgs;
> > +	unsigned long sum_util = 0;
> >  	int sg_status = 0;
> >  
> >  	do {
> > @@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  		sds->total_load += sgs->group_load;
> >  		sds->total_capacity += sgs->group_capacity;
> >  
> > +		sum_util += sgs->group_util;
> >  		sg = sg->next;
> >  	} while (sg != env->sd->groups);
> >  
> > @@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> >  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> >  	}
> > +
> > +	update_idle_cpu_scan(env, sum_util);
> >  }
> >  
> >  #define NUMA_IMBALANCE_MIN 2
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 1cf435bbcd9c..69be099019f4 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_UTIL, false)
> 
> As said above, flip those. Otherwise there's no point in merging this,
> nobody will use it.
Ok, will do in next version.


Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-04-28 18:24 [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
  2022-05-06  8:38 ` Peter Zijlstra
@ 2022-05-10 12:41 ` Yicong Yang
  2022-05-12  8:14   ` Chen Yu
  2022-05-13  6:37 ` K Prateek Nayak
  2022-05-17 12:50 ` Mel Gorman
  3 siblings, 1 reply; 12+ messages in thread
From: Yicong Yang @ 2022-05-10 12:41 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Vincent Guittot, Mel Gorman,
	Yicong Yang, K Prateek Nayak, Tim Chen
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

On 2022/4/29 2:24, Chen Yu wrote:
> [Problem Statement]
> select_idle_cpu() might spend too much time searching for an idle CPU,
> when the system is overloaded.
> 
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instances of netperf on a system with 112 CPUs
> per LLC domain:
> 
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
> 
> netperf latency usecs:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> 
> The time spent in select_idle_cpu() is visible to netperf and might have a negative
> impact.
> 
> [Symptom analysis]
> The patch [1] from Mel Gorman has been applied to track the efficiency
> of select_idle_sibling. Copy the indicators here:
> 
> SIS Search Efficiency(se_eff%):
>         A ratio expressed as a percentage of runqueues scanned versus
>         idle CPUs found. A 100% efficiency indicates that the target,
>         prev or recent CPU of a task was idle at wakeup. The lower the
>         efficiency, the more runqueues were scanned before an idle CPU
>         was found.
> 
> SIS Domain Search Efficiency(dom_eff%):
>         Similar, except only for the slower SIS
> 	patch.
> 
> SIS Fast Success Rate(fast_rate%):
>         Percentage of SIS that used target, prev or
> 	recent CPUs.
> 
> SIS Success rate(success_rate%):
>         Percentage of scans that found an idle CPU.
> 
> The test is based on Aubrey's schedtests tool, netperf, hackbench,
> schbench and tbench were launched with 25% 50% 75% 100% 125% 150%
> 175% 200% of CPU number respectively. Each test lasts for 100 seconds
> and repeats 3 times. The system reboots into a fresh environment for
> each test.
> 
> Test on vanilla kernel:
> schedstat_parse.py -f netperf_vanilla.log
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
> TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
> TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
> TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
> TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
> TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
> TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
> TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
> UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
> UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
> UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
> UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
> UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
> UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
> UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
> UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
> 
> schedstat_parse.py -f schbench_vanilla.log
> (each group has 28 tasks)
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
> normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
> normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
> normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
> normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
> normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
> normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
> normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
> 
> schedstat_parse.py -f hackbench_vanilla.log
> (each group has 28 tasks)
> case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
> process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
> process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
> process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
> process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
> process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
> process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
> process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
> process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
> process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
> process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
> process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
> process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
> process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
> process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
> process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
> process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
> threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
> threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
> threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
> threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
> threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
> threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
> threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
> threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
> threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
> threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
> threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
> threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
> threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
> threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
> threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
> threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
> 
> schedstat_parse.py -f tbench_vanilla.log
> case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
> loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
> loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
> loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
> loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
> loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
> 
> According to the test above, if the system becomes busy, the
> SIS Search Efficiency(se_eff%) drops significantly. Although some
> benchmarks would finally find an idle CPU(success_rate% = 100%), it is
> doubtful whether it is worth it to search the whole LLC domain.
> 
> [Proposal]
> It would be ideal to have a crystal ball to answer this question:
> How many CPUs must a wakeup path walk down, before it can find an idle
> CPU? Many potential metrics could be used to predict the number.
> One candidate is the sum of util_avg in this LLC domain. The benefit
> of choosing util_avg is that it is a metric of accumulated historic
> activity, which seems to be smoother than instantaneous metrics
> (such as rq->nr_running). Besides, choosing the sum of util_avg
> would help predict the load of the LLC domain more precisely, because
> SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> time. As Peter suggested[2], the lower the util_avg is, the
> more select_idle_cpu() should scan for idle CPU, and vice versa.
> 
> Introduce the quadratic function:
> 
> y = a - bx^2
> 
> x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> as sum_util increases. When sum_util hits 85% or above, the scan stops.
> Choosing 85% is because it is the threshold of an overloaded LLC sched group
> (imbalance_pct = 117). Choosing quadratic function is because:
> 
> [1] Compared to the linear function, it scans more aggressively when the
>     sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
>     and the number of CPUs to be scanned. Use heuristic scan for now.
> 
> The steps to calculate scan_nr are as followed:
> [1] scan_percent = 100 - (x/8.5)^2
>     when utilization reaches 85%, scan_percent becomes 0.
> [2] scan_nr = nr_llc * scan_percent / 100
> [3] scan_nr = max(scan_nr, 0)
> 
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85 ...
> scan_ns   112  112  108  103  92  80  64   47   24    0 ...
> 
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112ms would decay
> to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache false sharing. And Tim also mentioned
> that, it is allowed to be non-optimal in terms of scheduling for the
> short term variations, but if there is a long term trend in the load
> behavior, the scheduler can adjust for that.
> 
> SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> will use the nr_scan calculated by SIS_UTIL instead of the one from
> SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
> 
> [Test result]
> 
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
> 
> netperf.throughput
> each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.40)	 +1.14 (  0.37)
> TCP_RR          	56 threads	 1.00 (  0.49)	 +0.62 (  0.31)
> TCP_RR          	84 threads	 1.00 (  0.50)	 +0.26 (  0.55)
> TCP_RR          	112 threads	 1.00 (  0.27)	 +0.29 (  0.28)
> TCP_RR          	140 threads	 1.00 (  0.22)	 +0.14 (  0.23)
> TCP_RR          	168 threads	 1.00 (  0.21)	 +0.40 (  0.19)
> TCP_RR          	196 threads	 1.00 (  0.18)	+183.40 ( 16.43)
> TCP_RR          	224 threads	 1.00 (  0.16)	+188.44 (  9.29)
> UDP_RR          	28 threads	 1.00 (  0.47)	 +1.45 (  0.47)
> UDP_RR          	56 threads	 1.00 (  0.28)	 -0.22 (  0.30)
> UDP_RR          	84 threads	 1.00 (  0.38)	 +1.72 ( 27.10)
> UDP_RR          	112 threads	 1.00 (  0.16)	 +0.01 (  0.18)
> UDP_RR          	140 threads	 1.00 ( 14.10)	 +0.32 ( 11.15)
> UDP_RR          	168 threads	 1.00 ( 12.75)	 +0.91 ( 11.62)
> UDP_RR          	196 threads	 1.00 ( 14.41)	+191.97 ( 19.34)
> UDP_RR          	224 threads	 1.00 ( 15.34)	+194.88 ( 17.06)
> 
> Take the 224 threads as an example, the SIS search metrics changes are
> illustrated below:
> 
>     vanilla                    patched
>    4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
>      38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
>   128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
>    5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
> 
> There is -87.9% less CPU scans after patched, which indicates lower overhead.
> Besides, with this patch applied, there is -13% less rq lock contention
> in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> .try_to_wake_up.default_wake_function.woken_wake_function.
> This could help explain the performance improvement - Because this patch allows
> the waking task to remain on the previous CPU, rather than grabbing other CPU's
> lock.
> 
> Other benchmarks:
> 
> hackbench.throughput
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  0.09)	 -0.54 (  0.82)
> process-pipe    	2 groups 	 1.00 (  0.47)	 +0.89 (  0.61)
> process-pipe    	4 groups 	 1.00 (  0.83)	 +0.90 (  0.15)
> process-pipe    	8 groups 	 1.00 (  0.09)	 +0.31 (  0.07)
> process-sockets 	1 group 	 1.00 (  0.13)	 -0.58 (  0.49)
> process-sockets 	2 groups 	 1.00 (  0.41)	 -0.58 (  0.52)
> process-sockets 	4 groups 	 1.00 (  0.61)	 -0.37 (  0.50)
> process-sockets 	8 groups 	 1.00 (  0.22)	 +1.15 (  0.10)
> threads-pipe    	1 group 	 1.00 (  0.35)	 -0.28 (  0.78)
> threads-pipe    	2 groups 	 1.00 (  0.65)	 +0.03 (  0.96)
> threads-pipe    	4 groups 	 1.00 (  0.43)	 +0.81 (  0.38)
> threads-pipe    	8 groups 	 1.00 (  0.11)	 -1.56 (  0.07)
> threads-sockets 	1 group 	 1.00 (  0.30)	 -0.39 (  0.41)
> threads-sockets 	2 groups 	 1.00 (  0.21)	 -0.23 (  0.27)
> threads-sockets 	4 groups 	 1.00 (  0.23)	 +0.36 (  0.19)
> threads-sockets 	8 groups 	 1.00 (  0.13)	 +1.57 (  0.06)
> 
> tbench.throughput
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.15)	 +1.05 (  0.08)
> loopback        	56 threads	 1.00 (  0.09)	 +0.36 (  0.04)
> loopback        	84 threads	 1.00 (  0.12)	 +0.26 (  0.06)
> loopback        	112 threads	 1.00 (  0.12)	 +0.04 (  0.09)
> loopback        	140 threads	 1.00 (  0.04)	 +2.98 (  0.18)
> loopback        	168 threads	 1.00 (  0.10)	 +2.88 (  0.30)
> loopback        	196 threads	 1.00 (  0.06)	 +2.63 (  0.03)
> loopback        	224 threads	 1.00 (  0.08)	 +2.60 (  0.06)
> 
> schbench.latency_90%_us
> ========
> case            	load    	baseline	compare%
> normal          	1 mthread	 1.00 	        -1.7%
> normal          	2 mthreads	 1.00           +1.6%
> normal          	4 mthreads	 1.00           +1.4%
> normal          	8 mthreads	 1.00    	+21.0%
> 
> Limitations:
> [1]
> This patch is based on the util_avg, which is very sensitive to the CPU
> frequency invariance. The util_avg would decay quite fast when the
> CPU is idle, if the max frequency has been limited by the user.
> Patch [3] should be applied if turbo is disabled manually on Intel
> platforms.
> 
> [2]
> There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> CPU0~CPU6, while CPU7 is idle:
> 
>           CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
> util_avg  1024    1024    1024    1024    1024    1024    1024    0
> 
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected.
> 
> A possible workaround to mitigate this problem is that the nr_scan should
> be increased if idle CPUs are detected during periodic load balance. And the
> problem could be mitigated by idle load balance that, CPU7 might pull some
> tasks on it.
> 
> [3]
> Prateek mentioned that we should scan aggressively in an LLC domain
> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> negligible. The current patch aims to propose a generic solution and only
> considers the util_avg. A follow-up change could enhance the scan policy
> to adjust the scan_percent according to the CPU number in LLC.
> 
> v2->v3:
>  - Use 85% as the threshold again, because the CPU frequency invariance issue
>    has been fixed and the patch is queued for 5.19.
> 
>  - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
>    According to the feedback from Yicong, it might be better to stop scanning
>    entirely when the LLC is overloaded.
> 
>  - Replace linear scan with quadratic function scan, to let the SIS scan
>    aggressively when the LLC is not busy. Prateek mentioned there was slight
>    regression from ycsb-mongodb in v2, which might be due to fewer CPUs
>    scanned when the utilization is around 20%.
> 
>  - Add back the logic to stop the CPU scan even if has_idle_core is true.
>    It might be a waste of time to search for an idle Core if the LLC is
>    overloaded. Besides, according to the tbench result from Prateek, stop idle
>    Core scan would bring extra performance improvement.
> 
>  - Provide the SIS search statistics in the commit log, based on Mel Gorman's
>    patch, which is suggested by Adel.
> 
>  - Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
>    directly, which can be reviewed easier.
> 
> v2->v1:
>  - As suggested by Peter, introduce an idle CPU scan strategy that is based on
>    the util_avg metric. When util_avg is very low it scans more, while when
>    util_avg hits the threshold we naturally stop scanning entirely. The threshold
>    has been decreased from 85% to 50%, because this is the threshold when the
>    CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
>    when the LLC is overloaded, to keep it consistent with the current logic of
>    select_idle_cpu().
> 
> v1:
>  - Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
>    the LLC domain has reached 85%.
> 
> [Resend to include the missing mailing list, sorry for any inconvenience.]
> 
> Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
> Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net #2
> Link: https://lore.kernel.org/lkml/20220407234258.569681-1-yu.c.chen@intel.com #3
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 58 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		nr_idle_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23c7d0f617ee..50c9d5b2b338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +	struct sched_domain_shared *sd_share;
>  	struct rq *this_rq = this_rq();
>  	int this = smp_processor_id();
>  	struct sched_domain *this_sd;
> @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> +	if (sched_feat(SIS_UTIL)) {
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share) {
> +			/* because !--nr is the condition to stop scan */
> +			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> +			/* overloaded LLC is unlikely to have idle cpu/core */
> +			if (nr == 1)
> +				return -1;
> +		}
> +	}
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static inline void update_idle_cpu_scan(struct lb_env *env,
> +					unsigned long sum_util)
> +{
> +	struct sched_domain_shared *sd_share;
> +	int nr_scan, nr_llc, llc_util_pct;
> +
> +	if (!sched_feat(SIS_UTIL))
> +		return;
> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of this hint
> +	 * occurs during periodic load balancing, rather than frequent
> +	 * newidle balance.
> +	 */
> +	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> +	if (env->idle == CPU_NEWLY_IDLE ||
> +	    env->sd->span_weight != nr_llc)
> +		return;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * The number of CPUs to search drops as sum_util increases, when
> +	 * sum_util hits 85% or above, the scan stops.
> +	 * The reason to choose 85% as the threshold is because this is the
> +	 * imbalance_pct when a LLC sched group is overloaded.
> +	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> +	 * y is the percentage of CPUs to be scanned in the LLC
> +	 * domain, x is the ratio of sum_util compared to the
> +	 * CPU capacity, which ranges in [0, 100], thus
> +	 * nr_scan = nr_llc * y / 100
> +	 */
> +	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> +	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> +	nr_scan = max(nr_scan, 0);
> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
> +	unsigned long sum_util = 0;
>  	int sg_status = 0;
>  
>  	do {
> @@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
>  
> +		sum_util += sgs->group_util;
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  
> @@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>  	}
> +
> +	update_idle_cpu_scan(env, sum_util);
>  }
>  
>  #define NUMA_IMBALANCE_MIN 2
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1cf435bbcd9c..69be099019f4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>   */
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_UTIL, false)
>  

I see you mentioned they're mutually exclusive in the commit, worth a comment here?

One minor question: nr is updated in load balance so there maybe a delay because of
interval of load balancing. Furthermore, the LLC domain may not be balanced everytime
if the lowest domain is not LLC, like CLS->LLC. So maybe a bit more delay included.

The test results is fine and as expected. The improvement of netperf at a heavy load
condition, compared to your v2 version.

Thanks,
Yicong

TCP_RR node 0-1
threads
16	57559.56667	57930.03333 (+0.64%)
32	56373		57754.53333 (+2.45%)
64	18831.4		46234.76667 (+145.52%)
128	15658.9		19620.26667 (+25.30%)
256	7959.896667	8869.013333 (+11.42%)

TCP_RR node 0
threads
16	58389.43333	59026.03333 (+1.09%)
32	23779.6		51563.33333 (+116.84%)
64	20514.56667	23485.63333 (+14.48%)
128	8202.49		9205.483333 (+12.23%)
256	3843.163333	4304.8      (+12.01%)

tbench4 node 0-1
                           5.18-rc1                patched
Hmean     1        299.02 (   0.00%)      307.73 *   2.91%*
Hmean     2        597.88 (   0.00%)      619.10 *   3.55%*
Hmean     4       1207.11 (   0.00%)     1239.57 *   2.69%*
Hmean     8       2406.67 (   0.00%)     2463.63 *   2.37%*
Hmean     16      4755.52 (   0.00%)     4979.46 *   4.71%*
Hmean     32      9449.01 (   0.00%)     9709.59 *   2.76%*
Hmean     64     10538.89 (   0.00%)    10727.86 *   1.79%*
Hmean     128    13333.84 (   0.00%)    14580.63 *   9.35%*
Hmean     256    11735.24 (   0.00%)    11737.16 (   0.02%)

tbench4 node 0
                           5.18-rc1                patched
Hmean     1        302.26 (   0.00%)      313.43 *   3.70%*
Hmean     2        603.87 (   0.00%)      618.56 *   2.43%*
Hmean     4       1213.91 (   0.00%)     1249.63 *   2.94%*
Hmean     8       2469.72 (   0.00%)     2527.48 *   2.34%*
Hmean     16      4980.70 (   0.00%)     5099.62 *   2.39%*
Hmean     32      9001.88 (   0.00%)     9730.27 *   8.09%*
Hmean     64      7032.07 (   0.00%)     7691.56 *   9.38%*
Hmean     128     6037.76 (   0.00%)     6712.86 *  11.18%*
Hmean     256     8513.83 (   0.00%)     9117.79 *   7.09%*

>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-10 12:41 ` Yicong Yang
@ 2022-05-12  8:14   ` Chen Yu
  2022-05-12 13:10     ` Yicong Yang
  0 siblings, 1 reply; 12+ messages in thread
From: Chen Yu @ 2022-05-12  8:14 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Yicong Yang,
	K Prateek Nayak, Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Barry Song, Srikar Dronamraju,
	Len Brown, Ben Segall, Aubrey Li, Abel Wu, Zhang Rui,
	linux-kernel, Daniel Bristot de Oliveira

On Tue, May 10, 2022 at 08:41:57PM +0800, Yicong Yang wrote:
> On 2022/4/29 2:24, Chen Yu wrote:
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_UTIL, false)
> >  
> 
> I see you mentioned they're mutually exclusive in the commit, worth a comment here?
>
Yes, previously I thought it could be made mutually exclusive, and Peter has
suggested that we should make SIS_UTIL enabled by default, so later we could
remove SIS_PROP if SIS_UTIL behaves stable. So I assume there is no need to
add comment in the next version now. 
> One minor question: nr is updated in load balance so there maybe a delay because of
> interval of load balancing.
Yes, this is a good question. The default interval between two load balance is sd_weight ms,
which is 112ms in my case. This interval was a trade off to reduce cache contention. Besides,
every 1st idle CPU or the balanced CPU in one sched group within the LLC domain has the chance
to launch a periodic load balance, for example, although CPU0 and CPU1's periodic load balance
are both triggered every 112ms, CPU1 could help launch the load balance when CPU0 is not in
load balance work. Consider there are many CPUs in a LLC domain, the 'internal' to launch
the periodic load balance becomes smaller.
> Furthermore, the LLC domain may not be balanced everytime
> if the lowest domain is not LLC, like CLS->LLC. So maybe a bit more delay included.
> 
I thought every domain has its chance to launch a load balance, the difference is different
domains have different interval. No?
> The test results is fine and as expected. The improvement of netperf at a heavy load
> condition, compared to your v2 version.
>
Thanks for the test, would you mind if I add Tested-by tag?

thanks,
Chenyu 
> Thanks,
> Yicong
> 
> TCP_RR node 0-1
> threads
> 16	57559.56667	57930.03333 (+0.64%)
> 32	56373		57754.53333 (+2.45%)
> 64	18831.4		46234.76667 (+145.52%)
> 128	15658.9		19620.26667 (+25.30%)
> 256	7959.896667	8869.013333 (+11.42%)
> 
> TCP_RR node 0
> threads
> 16	58389.43333	59026.03333 (+1.09%)
> 32	23779.6		51563.33333 (+116.84%)
> 64	20514.56667	23485.63333 (+14.48%)
> 128	8202.49		9205.483333 (+12.23%)
> 256	3843.163333	4304.8      (+12.01%)
> 
> tbench4 node 0-1
>                            5.18-rc1                patched
> Hmean     1        299.02 (   0.00%)      307.73 *   2.91%*
> Hmean     2        597.88 (   0.00%)      619.10 *   3.55%*
> Hmean     4       1207.11 (   0.00%)     1239.57 *   2.69%*
> Hmean     8       2406.67 (   0.00%)     2463.63 *   2.37%*
> Hmean     16      4755.52 (   0.00%)     4979.46 *   4.71%*
> Hmean     32      9449.01 (   0.00%)     9709.59 *   2.76%*
> Hmean     64     10538.89 (   0.00%)    10727.86 *   1.79%*
> Hmean     128    13333.84 (   0.00%)    14580.63 *   9.35%*
> Hmean     256    11735.24 (   0.00%)    11737.16 (   0.02%)
> 
> tbench4 node 0
>                            5.18-rc1                patched
> Hmean     1        302.26 (   0.00%)      313.43 *   3.70%*
> Hmean     2        603.87 (   0.00%)      618.56 *   2.43%*
> Hmean     4       1213.91 (   0.00%)     1249.63 *   2.94%*
> Hmean     8       2469.72 (   0.00%)     2527.48 *   2.34%*
> Hmean     16      4980.70 (   0.00%)     5099.62 *   2.39%*
> Hmean     32      9001.88 (   0.00%)     9730.27 *   8.09%*
> Hmean     64      7032.07 (   0.00%)     7691.56 *   9.38%*
> Hmean     128     6037.76 (   0.00%)     6712.86 *  11.18%*
> Hmean     256     8513.83 (   0.00%)     9117.79 *   7.09%*
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-12  8:14   ` Chen Yu
@ 2022-05-12 13:10     ` Yicong Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Yicong Yang @ 2022-05-12 13:10 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, Peter Zijlstra, Vincent Guittot, Mel Gorman,
	K Prateek Nayak, Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Barry Song, Srikar Dronamraju,
	Len Brown, Ben Segall, Aubrey Li, Abel Wu, Zhang Rui,
	linux-kernel, Daniel Bristot de Oliveira

On 2022/5/12 16:14, Chen Yu wrote:
> On Tue, May 10, 2022 at 08:41:57PM +0800, Yicong Yang wrote:
>> On 2022/4/29 2:24, Chen Yu wrote:
>>> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>>>   */
>>>  SCHED_FEAT(SIS_PROP, true)
>>> +SCHED_FEAT(SIS_UTIL, false)
>>>  
>>
>> I see you mentioned they're mutually exclusive in the commit, worth a comment here?
>>
> Yes, previously I thought it could be made mutually exclusive, and Peter has
> suggested that we should make SIS_UTIL enabled by default, so later we could
> remove SIS_PROP if SIS_UTIL behaves stable. So I assume there is no need to
> add comment in the next version now. 
>> One minor question: nr is updated in load balance so there maybe a delay because of
>> interval of load balancing.
> Yes, this is a good question. The default interval between two load balance is sd_weight ms,
> which is 112ms in my case. This interval was a trade off to reduce cache contention. Besides,
> every 1st idle CPU or the balanced CPU in one sched group within the LLC domain has the chance
> to launch a periodic load balance, for example, although CPU0 and CPU1's periodic load balance
> are both triggered every 112ms, CPU1 could help launch the load balance when CPU0 is not in
> load balance work. Consider there are many CPUs in a LLC domain, the 'internal' to launch
> the periodic load balance becomes smaller.
>> Furthermore, the LLC domain may not be balanced everytime
>> if the lowest domain is not LLC, like CLS->LLC. So maybe a bit more delay included.
>>
> I thought every domain has its chance to launch a load balance, the difference is different
> domains have different interval. No?
I might miss something. I think it's right here.
>> The test results is fine and as expected. The improvement of netperf at a heavy load
>> condition, compared to your v2 version.
>>
> Thanks for the test, would you mind if I add Tested-by tag?
> 

On Kunpeng920 for this patch,

Tested-by: Yicong Yang <yangyicong@hisilicon.com>

> thanks,
> Chenyu 
>> Thanks,
>> Yicong
>>
>> TCP_RR node 0-1
>> threads
>> 16	57559.56667	57930.03333 (+0.64%)
>> 32	56373		57754.53333 (+2.45%)
>> 64	18831.4		46234.76667 (+145.52%)
>> 128	15658.9		19620.26667 (+25.30%)
>> 256	7959.896667	8869.013333 (+11.42%)
>>
>> TCP_RR node 0
>> threads
>> 16	58389.43333	59026.03333 (+1.09%)
>> 32	23779.6		51563.33333 (+116.84%)
>> 64	20514.56667	23485.63333 (+14.48%)
>> 128	8202.49		9205.483333 (+12.23%)
>> 256	3843.163333	4304.8      (+12.01%)
>>
>> tbench4 node 0-1
>>                            5.18-rc1                patched
>> Hmean     1        299.02 (   0.00%)      307.73 *   2.91%*
>> Hmean     2        597.88 (   0.00%)      619.10 *   3.55%*
>> Hmean     4       1207.11 (   0.00%)     1239.57 *   2.69%*
>> Hmean     8       2406.67 (   0.00%)     2463.63 *   2.37%*
>> Hmean     16      4755.52 (   0.00%)     4979.46 *   4.71%*
>> Hmean     32      9449.01 (   0.00%)     9709.59 *   2.76%*
>> Hmean     64     10538.89 (   0.00%)    10727.86 *   1.79%*
>> Hmean     128    13333.84 (   0.00%)    14580.63 *   9.35%*
>> Hmean     256    11735.24 (   0.00%)    11737.16 (   0.02%)
>>
>> tbench4 node 0
>>                            5.18-rc1                patched
>> Hmean     1        302.26 (   0.00%)      313.43 *   3.70%*
>> Hmean     2        603.87 (   0.00%)      618.56 *   2.43%*
>> Hmean     4       1213.91 (   0.00%)     1249.63 *   2.94%*
>> Hmean     8       2469.72 (   0.00%)     2527.48 *   2.34%*
>> Hmean     16      4980.70 (   0.00%)     5099.62 *   2.39%*
>> Hmean     32      9001.88 (   0.00%)     9730.27 *   8.09%*
>> Hmean     64      7032.07 (   0.00%)     7691.56 *   9.38%*
>> Hmean     128     6037.76 (   0.00%)     6712.86 *  11.18%*
>> Hmean     256     8513.83 (   0.00%)     9117.79 *   7.09%*
>>
> .
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-04-28 18:24 [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
  2022-05-06  8:38 ` Peter Zijlstra
  2022-05-10 12:41 ` Yicong Yang
@ 2022-05-13  6:37 ` K Prateek Nayak
  2022-05-14 10:55   ` Chen Yu
  2022-05-17 12:50 ` Mel Gorman
  3 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2022-05-13  6:37 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Vincent Guittot, Mel Gorman,
	Yicong Yang, Tim Chen
  Cc: Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

Hello Chenyu,

Sorry for the delay with analysis.

On 4/28/2022 11:54 PM, Chen Yu wrote:
> [Problem Statement]
> select_idle_cpu() might spend too much time searching for an idle CPU,
> when the system is overloaded.
>
> [..snip..]
>
> [Test result]
>
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
>
> netperf.throughput
> each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.40)	 +1.14 (  0.37)
> TCP_RR          	56 threads	 1.00 (  0.49)	 +0.62 (  0.31)
> TCP_RR          	84 threads	 1.00 (  0.50)	 +0.26 (  0.55)
> TCP_RR          	112 threads	 1.00 (  0.27)	 +0.29 (  0.28)
> TCP_RR          	140 threads	 1.00 (  0.22)	 +0.14 (  0.23)
> TCP_RR          	168 threads	 1.00 (  0.21)	 +0.40 (  0.19)
> TCP_RR          	196 threads	 1.00 (  0.18)	+183.40 ( 16.43)
> TCP_RR          	224 threads	 1.00 (  0.16)	+188.44 (  9.29)
> UDP_RR          	28 threads	 1.00 (  0.47)	 +1.45 (  0.47)
> UDP_RR          	56 threads	 1.00 (  0.28)	 -0.22 (  0.30)
> UDP_RR          	84 threads	 1.00 (  0.38)	 +1.72 ( 27.10)
> UDP_RR          	112 threads	 1.00 (  0.16)	 +0.01 (  0.18)
> UDP_RR          	140 threads	 1.00 ( 14.10)	 +0.32 ( 11.15)
> UDP_RR          	168 threads	 1.00 ( 12.75)	 +0.91 ( 11.62)
> UDP_RR          	196 threads	 1.00 ( 14.41)	+191.97 ( 19.34)
> UDP_RR          	224 threads	 1.00 ( 15.34)	+194.88 ( 17.06)
>
> Take the 224 threads as an example, the SIS search metrics changes are
> illustrated below:
>
>     vanilla                    patched
>    4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
>      38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
>   128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
>    5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
>
> There is -87.9% less CPU scans after patched, which indicates lower overhead.
> Besides, with this patch applied, there is -13% less rq lock contention
> in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> .try_to_wake_up.default_wake_function.woken_wake_function.
> This could help explain the performance improvement - Because this patch allows
> the waking task to remain on the previous CPU, rather than grabbing other CPU's
> lock.
>
> Other benchmarks:
>
> hackbench.throughput
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  0.09)	 -0.54 (  0.82)
> process-pipe    	2 groups 	 1.00 (  0.47)	 +0.89 (  0.61)
> process-pipe    	4 groups 	 1.00 (  0.83)	 +0.90 (  0.15)
> process-pipe    	8 groups 	 1.00 (  0.09)	 +0.31 (  0.07)
> process-sockets 	1 group 	 1.00 (  0.13)	 -0.58 (  0.49)
> process-sockets 	2 groups 	 1.00 (  0.41)	 -0.58 (  0.52)
> process-sockets 	4 groups 	 1.00 (  0.61)	 -0.37 (  0.50)
> process-sockets 	8 groups 	 1.00 (  0.22)	 +1.15 (  0.10)
> threads-pipe    	1 group 	 1.00 (  0.35)	 -0.28 (  0.78)
> threads-pipe    	2 groups 	 1.00 (  0.65)	 +0.03 (  0.96)
> threads-pipe    	4 groups 	 1.00 (  0.43)	 +0.81 (  0.38)
> threads-pipe    	8 groups 	 1.00 (  0.11)	 -1.56 (  0.07)
> threads-sockets 	1 group 	 1.00 (  0.30)	 -0.39 (  0.41)
> threads-sockets 	2 groups 	 1.00 (  0.21)	 -0.23 (  0.27)
> threads-sockets 	4 groups 	 1.00 (  0.23)	 +0.36 (  0.19)
> threads-sockets 	8 groups 	 1.00 (  0.13)	 +1.57 (  0.06)
>
> tbench.throughput
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.15)	 +1.05 (  0.08)
> loopback        	56 threads	 1.00 (  0.09)	 +0.36 (  0.04)
> loopback        	84 threads	 1.00 (  0.12)	 +0.26 (  0.06)
> loopback        	112 threads	 1.00 (  0.12)	 +0.04 (  0.09)
> loopback        	140 threads	 1.00 (  0.04)	 +2.98 (  0.18)
> loopback        	168 threads	 1.00 (  0.10)	 +2.88 (  0.30)
> loopback        	196 threads	 1.00 (  0.06)	 +2.63 (  0.03)
> loopback        	224 threads	 1.00 (  0.08)	 +2.60 (  0.06)
>
> schbench.latency_90%_us
> ========
> case            	load    	baseline	compare%
> normal          	1 mthread	 1.00 	        -1.7%
> normal          	2 mthreads	 1.00           +1.6%
> normal          	4 mthreads	 1.00           +1.4%
> normal          	8 mthreads	 1.00    	+21.0%
>
> Limitations:
> [1]
> This patch is based on the util_avg, which is very sensitive to the CPU
> frequency invariance. The util_avg would decay quite fast when the
> CPU is idle, if the max frequency has been limited by the user.
> Patch [3] should be applied if turbo is disabled manually on Intel
> platforms.
>
> [2]
> There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> CPU0~CPU6, while CPU7 is idle:
>
>           CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
> util_avg  1024    1024    1024    1024    1024    1024    1024    0
>
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected.

Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
various NPS configuration:

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:      5.18-rc1 tip sched/core
- SIS_UTIL:    5.18-rc1 tip sched/core + this patch

When we began testing, tip was at:

commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"

Following are the results from the benchmark:

* - Data points of concern

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test:                   tip                     SIS_UTIL
 1-groups:         4.64 (0.00 pct)         4.70 (-1.29 pct)
 2-groups:         5.38 (0.00 pct)         5.45 (-1.30 pct)
 4-groups:         6.15 (0.00 pct)         6.10 (0.81 pct)
 8-groups:         7.42 (0.00 pct)         7.42 (0.00 pct)
16-groups:        10.70 (0.00 pct)        11.69 (-9.25 pct)  *

NPS2

Test:                   tip                     SIS_UTIL
 1-groups:         4.70 (0.00 pct)         4.70 (0.00 pct)
 2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
 4-groups:         6.13 (0.00 pct)         6.05 (1.30 pct)
 8-groups:         7.30 (0.00 pct)         7.05 (3.42 pct)
16-groups:        10.30 (0.00 pct)        10.12 (1.74 pct)

NPS4

Test:                   tip                     SIS_UTIL
 1-groups:         4.60 (0.00 pct)         4.75 (-3.26 pct)  *
 2-groups:         5.41 (0.00 pct)         5.42 (-0.18 pct)
 4-groups:         6.12 (0.00 pct)         6.00 (1.96 pct)
 8-groups:         7.22 (0.00 pct)         7.10 (1.66 pct)
16-groups:        10.24 (0.00 pct)        10.11 (1.26 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS 1

#workers:   tip                     SIS_UTIL
  1:      29.00 (0.00 pct)        21.00 (27.58 pct)
  2:      28.00 (0.00 pct)        28.00 (0.00 pct)
  4:      31.50 (0.00 pct)        31.00 (1.58 pct)
  8:      42.00 (0.00 pct)        39.00 (7.14 pct)
 16:      56.50 (0.00 pct)        54.50 (3.53 pct)
 32:      94.50 (0.00 pct)        94.00 (0.52 pct)
 64:     176.00 (0.00 pct)       175.00 (0.56 pct)
128:     404.00 (0.00 pct)       394.00 (2.47 pct)
256:     869.00 (0.00 pct)       863.00 (0.69 pct)
512:     58432.00 (0.00 pct)     55424.00 (5.14 pct)

NPS2

#workers:      tip                     SIS_UTIL
  1:      26.50 (0.00 pct)        25.00 (5.66 pct)
  2:      26.50 (0.00 pct)        25.50 (3.77 pct)
  4:      34.50 (0.00 pct)        34.00 (1.44 pct)
  8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
 16:      56.50 (0.00 pct)        60.50 (-7.07 pct)        *
 32:      95.50 (0.00 pct)        93.00 (2.61 pct)
 64:     179.00 (0.00 pct)       179.00 (0.00 pct)
128:     369.00 (0.00 pct)       376.00 (-1.89 pct)
256:     898.00 (0.00 pct)       903.00 (-0.55 pct)
512:     56256.00 (0.00 pct)     57088.00 (-1.47 pct)

NPS4

#workers:    tip                     SIS_UTIL
  1:      25.00 (0.00 pct)        21.00 (16.00 pct)
  2:      28.00 (0.00 pct)        24.00 (14.28 pct)
  4:      29.50 (0.00 pct)        29.50 (0.00 pct)
  8:      41.00 (0.00 pct)        37.50 (8.53 pct)
 16:      65.50 (0.00 pct)        64.00 (2.29 pct)
 32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
 64:     170.50 (0.00 pct)       175.50 (-2.93 pct)
128:     377.00 (0.00 pct)       368.50 (2.25 pct)
256:     867.00 (0.00 pct)       902.00 (-4.03 pct)
512:     58048.00 (0.00 pct)     55488.00 (4.41 pct)

~~~~~~
tbench
~~~~~~

NPS 1

Clients:     tip                     SIS_UTIL
    1    443.31 (0.00 pct)       456.19 (2.90 pct)
    2    877.32 (0.00 pct)       875.24 (-0.23 pct)
    4    1665.11 (0.00 pct)      1647.31 (-1.06 pct)
    8    3016.68 (0.00 pct)      2993.23 (-0.77 pct)
   16    5374.30 (0.00 pct)      5246.93 (-2.36 pct)
   32    8763.86 (0.00 pct)      7878.18 (-10.10 pct)     *
   64    15786.93 (0.00 pct)     12958.47 (-17.91 pct)    *
  128    26826.08 (0.00 pct)     26741.14 (-0.31 pct)
  256    24207.35 (0.00 pct)     52041.89 (114.98 pct)
  512    51740.58 (0.00 pct)     52084.44 (0.66 pct)
 1024    51177.82 (0.00 pct)     53126.29 (3.80 pct)

NPS 2

Clients:     tip                     SIS_UTIL
    1    449.49 (0.00 pct)       447.96 (-0.34 pct)
    2    867.28 (0.00 pct)       869.52 (0.25 pct)
    4    1643.60 (0.00 pct)      1625.91 (-1.07 pct)
    8    3047.35 (0.00 pct)      2952.82 (-3.10 pct)
   16    5340.77 (0.00 pct)      5251.41 (-1.67 pct)
   32    10536.85 (0.00 pct)     8843.49 (-16.07 pct)     *
   64    16543.23 (0.00 pct)     14265.35 (-13.76 pct)    *
  128    26400.40 (0.00 pct)     25595.42 (-3.04 pct)
  256    23436.75 (0.00 pct)     47090.03 (100.92 pct)
  512    50902.60 (0.00 pct)     50036.58 (-1.70 pct)
 1024    50216.10 (0.00 pct)     50639.74 (0.84 pct)

NPS 4

Clients:     tip                     SIS_UTIL
    1    443.82 (0.00 pct)       459.93 (3.62 pct)
    2    849.14 (0.00 pct)       882.17 (3.88 pct)
    4    1603.26 (0.00 pct)      1629.64 (1.64 pct)
    8    2972.37 (0.00 pct)      3003.09 (1.03 pct)
   16    5277.13 (0.00 pct)      5234.07 (-0.81 pct)
   32    9744.73 (0.00 pct)      9347.90 (-4.07 pct)      *
   64    15854.80 (0.00 pct)     14180.27 (-10.56 pct)    *
  128    26116.97 (0.00 pct)     24597.45 (-5.81 pct)     *
  256    22403.25 (0.00 pct)     47385.09 (111.50 pct)
  512    48317.20 (0.00 pct)     49781.02 (3.02 pct)
 1024    50445.41 (0.00 pct)     51607.53 (2.30 pct)

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

              tip                     SIS_UTIL
 Copy:   189113.11 (0.00 pct)    188490.27 (-0.32 pct)
Scale:   201190.61 (0.00 pct)    204526.15 (1.65 pct)
  Add:   232654.21 (0.00 pct)    234948.01 (0.98 pct)
Triad:   226583.57 (0.00 pct)    228844.43 (0.99 pct)

NPS2

Test:         tip                     SIS_UTIL
 Copy:   155347.14 (0.00 pct)    169386.29 (9.03 pct)
Scale:   191701.53 (0.00 pct)    196110.51 (2.29 pct)
  Add:   210013.97 (0.00 pct)    221088.45 (5.27 pct)
Triad:   207602.00 (0.00 pct)    218072.52 (5.04 pct)

NPS4

Test:         tip                     SIS_UTIL
 Copy:   136421.15 (0.00 pct)    140894.11 (3.27 pct)
Scale:   191217.59 (0.00 pct)    190554.17 (-0.34 pct)
  Add:   189229.52 (0.00 pct)    190871.88 (0.86 pct)
Triad:   188052.99 (0.00 pct)    188417.63 (0.19 pct)

- 100 runs

NPS1

Test:       tip                     SIS_UTIL
 Copy:   244693.32 (0.00 pct)    232328.05 (-5.05 pct)
Scale:   221874.99 (0.00 pct)    216858.39 (-2.26 pct)
  Add:   268363.89 (0.00 pct)    265449.16 (-1.08 pct)
Triad:   260945.24 (0.00 pct)    252240.56 (-3.33 pct)

NPS2

Test:       tip                     SIS_UTIL
 Copy:   211262.00 (0.00 pct)    225240.03 (6.61 pct)
Scale:   222493.34 (0.00 pct)    219094.65 (-1.52 pct)
  Add:   280277.17 (0.00 pct)    275677.73 (-1.64 pct)
Triad:   265860.49 (0.00 pct)    262584.22 (-1.23 pct)

NPS4

Test:       tip                     SIS_UTIL
 Copy:   250171.40 (0.00 pct)    230983.60 (-7.66 pct)
Scale:   222293.56 (0.00 pct)    215984.34 (-2.83 pct)
  Add:   279222.16 (0.00 pct)    270402.64 (-3.15 pct)
Triad:   262013.92 (0.00 pct)    254820.60 (-2.74 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip:      303718.33 (var: 1.31)
SIS_UTIL:       303529.33 (var: 0.67)    (-0.06%)

NPS2

sched-tip:      304536.33 (var: 2.46)
SIS_UTIL:       303730.33 (var: 1.57)    (-0.26%)

NPS4

sched-tip:      301192.33 (var: 1.81)
SIS_UTIL:       300101.33 (var: 0.35)   (-0.36%)

~~~~~~~~~~~~~~~~~~

Notes:

- There seems to be some noticeable regression for hackbench
  with 16 groups in NPS1 mode.
- There seems to be regression in tbench for case with number
  of workers in range 32-128 (12.5% loaded to 50% loaded)
- tbench reaches saturation early when system is fully loaded

This probably show that the strategy in the initial v1 RFC
seems to work better with our LLC where number of CPUs per LLC
is low compared to systems with unified LLC. Given this is
showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
can be enabled based on the the size of LLC.

> [..snip..]
>
> [3]
> Prateek mentioned that we should scan aggressively in an LLC domain
> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> negligible. The current patch aims to propose a generic solution and only
> considers the util_avg. A follow-up change could enhance the scan policy
> to adjust the scan_percent according to the CPU number in LLC.

Following are some additional numbers I would like to share comparing SIS_PROP and
SIS_UTIL:

o Hackbench with 1 group

With 1 group, following are the chances of SIS_PROP
and SIS_UTIL finding an idle CPU when an idle CPU
exists in LLC:

+-----------------+---------------------------+---------------------------+--------+
| Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count  |
+-----------------+---------------------------+---------------------------+--------+
|        1        |             0             |             0             | 66444  |
|        1        |             0             |             1             | 34153  |
|        1        |             1             |             0             | 57204  |
|        1        |             1             |             1             | 119263 |
+-----------------+---------------------------+---------------------------+--------+

SIS_PROP vs no SIS_PROP CPU search stats:

Total time without SIS_PROP: 90653653
Total time with SIS_PROP: 53558942 (-40.92 pct)
Total time saved: 37094711

Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):

+--------------+-------+
| CPU Searched | Count |
+--------------+-------+
|      0       | 10520 |
|      1       |  7770 |
|      2       | 11976 |
|      3       | 17554 |
|      4       | 13932 |
|      5       | 15051 |
|      6       |  8398 |
|      7       |  4544 |
|      8       |  3712 |
|      9       |  2337 |
|      10      |  4541 |
|      11      |  1947 |
|      12      |  3846 |
|      13      |  3645 |
|      14      |  2686 |
|      15      |  8390 |
|      16      | 26157 |
+--------------+-------+

- SIS_UTIL might be bailing out too early in some of these cases.

o Hackbench with 16 group

the success rate looks as follows:

+-----------------+---------------------------+---------------------------+---------+
| Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU |  Count  |
+-----------------+---------------------------+---------------------------+---------+
|        1        |             0             |             0             | 1313745 |
|        1        |             0             |             1             |  694132 |
|        1        |             1             |             0             | 2888450 |
|        1        |             1             |             1             | 5343065 |
+-----------------+---------------------------+---------------------------+---------+

SIS_PROP vs no SIS_PROP CPU search stats:

Total time without SIS_PROP: 5227299388
Total time with SIS_PROP: 3866575188 (-26.03 pct)
Total time saved: 1360724200

Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):

+--------------+---------+
| CPU Searched |  Count  |
+--------------+---------+
|      0       |  150351 |
|      1       |  105116 |
|      2       |  214291 |
|      3       |  440053 |
|      4       |  914116 |
|      5       | 1757984 |
|      6       | 2410484 |
|      7       | 1867668 |
|      8       |  379888 |
|      9       |  84055  |
|      10      |  55389  |
|      11      |  26795  |
|      12      |  43113  |
|      13      |  24579  |
|      14      |  32896  |
|      15      |  70059  |
|      16      |  150858 |
+--------------+---------+

- SIS_UTIL might be bailing out too early in most of these cases

o tbench with 256 workers

For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
of CPUs to search.

SIS_PROP vs no SIS_PROP CPU search stats:

Total time without SIS_PROP: 64004752959
Total time with SIS_PROP: 34695004390 (-45.79 pct)
Total time saved: 29309748569

Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):

+--------------+----------+
| CPU Searched |  Count   |
+--------------+----------+
|      0       |  500077  |
|      1       |  543865  |
|      2       | 4257684  |
|      3       | 27457498 |
|      4       | 40208673 |
|      5       | 3264358  |
|      6       |  191631  |
|      7       |  24658   |
|      8       |   2469   |
|      9       |   1374   |
|      10      |   2008   |
|      11      |   1300   |
|      12      |   1226   |
|      13      |   1179   |
|      14      |   1631   |
|      15      |  11678   |
|      16      |   7793   |
+--------------+----------+

- This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
  at restricting search space well.

o Observations

SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
with 16 groups compared to than with 1 group.
Also SIS_PROP is more aggressive at saving time for 1-group compared to the
case with 16-groups.

The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
performance gain in a fully loaded system.

Note: There might be some inaccuracies for the numbers presented for metrics that
directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
when gathering these data points and the results from SIS_PROP were returned from
search_idle_cpu(). All the numbers for the above analysis were gathered in NPS1 mode.

> [..snip..]

--
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-13  6:37 ` K Prateek Nayak
@ 2022-05-14 10:55   ` Chen Yu
  2022-05-16 10:52     ` K Prateek Nayak
  0 siblings, 1 reply; 12+ messages in thread
From: Chen Yu @ 2022-05-14 10:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Yicong Yang,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

Hi Prateek,
On Fri, May 13, 2022 at 12:07:00PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> Sorry for the delay with analysis.
>
Thanks very much for the test and analysis in detail.
> 
> Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
> various NPS configuration:
May I know if in all NPS mode, all LLC domains have 16 CPUs?
> 
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:      5.18-rc1 tip sched/core
> - SIS_UTIL:    5.18-rc1 tip sched/core + this patch
> 
> When we began testing, tip was at:
> 
> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
> 
> Following are the results from the benchmark:
> 
> * - Data points of concern
> 
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
> 
> NPS1
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.64 (0.00 pct)         4.70 (-1.29 pct)
>  2-groups:         5.38 (0.00 pct)         5.45 (-1.30 pct)
>  4-groups:         6.15 (0.00 pct)         6.10 (0.81 pct)
>  8-groups:         7.42 (0.00 pct)         7.42 (0.00 pct)
> 16-groups:        10.70 (0.00 pct)        11.69 (-9.25 pct)  *
> 
> NPS2
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.70 (0.00 pct)         4.70 (0.00 pct)
>  2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
>  4-groups:         6.13 (0.00 pct)         6.05 (1.30 pct)
>  8-groups:         7.30 (0.00 pct)         7.05 (3.42 pct)
> 16-groups:        10.30 (0.00 pct)        10.12 (1.74 pct)
> 
> NPS4
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.60 (0.00 pct)         4.75 (-3.26 pct)  *
>  2-groups:         5.41 (0.00 pct)         5.42 (-0.18 pct)
>  4-groups:         6.12 (0.00 pct)         6.00 (1.96 pct)
>  8-groups:         7.22 (0.00 pct)         7.10 (1.66 pct)
> 16-groups:        10.24 (0.00 pct)        10.11 (1.26 pct)
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS 1
> 
> #workers:   tip                     SIS_UTIL
>   1:      29.00 (0.00 pct)        21.00 (27.58 pct)
>   2:      28.00 (0.00 pct)        28.00 (0.00 pct)
>   4:      31.50 (0.00 pct)        31.00 (1.58 pct)
>   8:      42.00 (0.00 pct)        39.00 (7.14 pct)
>  16:      56.50 (0.00 pct)        54.50 (3.53 pct)
>  32:      94.50 (0.00 pct)        94.00 (0.52 pct)
>  64:     176.00 (0.00 pct)       175.00 (0.56 pct)
> 128:     404.00 (0.00 pct)       394.00 (2.47 pct)
> 256:     869.00 (0.00 pct)       863.00 (0.69 pct)
> 512:     58432.00 (0.00 pct)     55424.00 (5.14 pct)
> 
> NPS2
> 
> #workers:      tip                     SIS_UTIL
>   1:      26.50 (0.00 pct)        25.00 (5.66 pct)
>   2:      26.50 (0.00 pct)        25.50 (3.77 pct)
>   4:      34.50 (0.00 pct)        34.00 (1.44 pct)
>   8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
>  16:      56.50 (0.00 pct)        60.50 (-7.07 pct)        *
>  32:      95.50 (0.00 pct)        93.00 (2.61 pct)
>  64:     179.00 (0.00 pct)       179.00 (0.00 pct)
> 128:     369.00 (0.00 pct)       376.00 (-1.89 pct)
> 256:     898.00 (0.00 pct)       903.00 (-0.55 pct)
> 512:     56256.00 (0.00 pct)     57088.00 (-1.47 pct)
> 
> NPS4
> 
> #workers:    tip                     SIS_UTIL
>   1:      25.00 (0.00 pct)        21.00 (16.00 pct)
>   2:      28.00 (0.00 pct)        24.00 (14.28 pct)
>   4:      29.50 (0.00 pct)        29.50 (0.00 pct)
>   8:      41.00 (0.00 pct)        37.50 (8.53 pct)
>  16:      65.50 (0.00 pct)        64.00 (2.29 pct)
>  32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
>  64:     170.50 (0.00 pct)       175.50 (-2.93 pct)
> 128:     377.00 (0.00 pct)       368.50 (2.25 pct)
> 256:     867.00 (0.00 pct)       902.00 (-4.03 pct)
> 512:     58048.00 (0.00 pct)     55488.00 (4.41 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS 1
> 
> Clients:     tip                     SIS_UTIL
>     1    443.31 (0.00 pct)       456.19 (2.90 pct)
>     2    877.32 (0.00 pct)       875.24 (-0.23 pct)
>     4    1665.11 (0.00 pct)      1647.31 (-1.06 pct)
>     8    3016.68 (0.00 pct)      2993.23 (-0.77 pct)
>    16    5374.30 (0.00 pct)      5246.93 (-2.36 pct)
>    32    8763.86 (0.00 pct)      7878.18 (-10.10 pct)     *
>    64    15786.93 (0.00 pct)     12958.47 (-17.91 pct)    *
>   128    26826.08 (0.00 pct)     26741.14 (-0.31 pct)
>   256    24207.35 (0.00 pct)     52041.89 (114.98 pct)
>   512    51740.58 (0.00 pct)     52084.44 (0.66 pct)
>  1024    51177.82 (0.00 pct)     53126.29 (3.80 pct)
> 
> NPS 2
> 
> Clients:     tip                     SIS_UTIL
>     1    449.49 (0.00 pct)       447.96 (-0.34 pct)
>     2    867.28 (0.00 pct)       869.52 (0.25 pct)
>     4    1643.60 (0.00 pct)      1625.91 (-1.07 pct)
>     8    3047.35 (0.00 pct)      2952.82 (-3.10 pct)
>    16    5340.77 (0.00 pct)      5251.41 (-1.67 pct)
>    32    10536.85 (0.00 pct)     8843.49 (-16.07 pct)     *
>    64    16543.23 (0.00 pct)     14265.35 (-13.76 pct)    *
>   128    26400.40 (0.00 pct)     25595.42 (-3.04 pct)
>   256    23436.75 (0.00 pct)     47090.03 (100.92 pct)
>   512    50902.60 (0.00 pct)     50036.58 (-1.70 pct)
>  1024    50216.10 (0.00 pct)     50639.74 (0.84 pct)
> 
> NPS 4
> 
> Clients:     tip                     SIS_UTIL
>     1    443.82 (0.00 pct)       459.93 (3.62 pct)
>     2    849.14 (0.00 pct)       882.17 (3.88 pct)
>     4    1603.26 (0.00 pct)      1629.64 (1.64 pct)
>     8    2972.37 (0.00 pct)      3003.09 (1.03 pct)
>    16    5277.13 (0.00 pct)      5234.07 (-0.81 pct)
>    32    9744.73 (0.00 pct)      9347.90 (-4.07 pct)      *
>    64    15854.80 (0.00 pct)     14180.27 (-10.56 pct)    *
>   128    26116.97 (0.00 pct)     24597.45 (-5.81 pct)     *
>   256    22403.25 (0.00 pct)     47385.09 (111.50 pct)
>   512    48317.20 (0.00 pct)     49781.02 (3.02 pct)
>  1024    50445.41 (0.00 pct)     51607.53 (2.30 pct)
> 
> ~~~~~~
> Stream
> ~~~~~~
> 
> - 10 runs
> 
> NPS1
> 
>               tip                     SIS_UTIL
>  Copy:   189113.11 (0.00 pct)    188490.27 (-0.32 pct)
> Scale:   201190.61 (0.00 pct)    204526.15 (1.65 pct)
>   Add:   232654.21 (0.00 pct)    234948.01 (0.98 pct)
> Triad:   226583.57 (0.00 pct)    228844.43 (0.99 pct)
> 
> NPS2
> 
> Test:         tip                     SIS_UTIL
>  Copy:   155347.14 (0.00 pct)    169386.29 (9.03 pct)
> Scale:   191701.53 (0.00 pct)    196110.51 (2.29 pct)
>   Add:   210013.97 (0.00 pct)    221088.45 (5.27 pct)
> Triad:   207602.00 (0.00 pct)    218072.52 (5.04 pct)
> 
> NPS4
> 
> Test:         tip                     SIS_UTIL
>  Copy:   136421.15 (0.00 pct)    140894.11 (3.27 pct)
> Scale:   191217.59 (0.00 pct)    190554.17 (-0.34 pct)
>   Add:   189229.52 (0.00 pct)    190871.88 (0.86 pct)
> Triad:   188052.99 (0.00 pct)    188417.63 (0.19 pct)
> 
> - 100 runs
> 
> NPS1
> 
> Test:       tip                     SIS_UTIL
>  Copy:   244693.32 (0.00 pct)    232328.05 (-5.05 pct)
> Scale:   221874.99 (0.00 pct)    216858.39 (-2.26 pct)
>   Add:   268363.89 (0.00 pct)    265449.16 (-1.08 pct)
> Triad:   260945.24 (0.00 pct)    252240.56 (-3.33 pct)
> 
> NPS2
> 
> Test:       tip                     SIS_UTIL
>  Copy:   211262.00 (0.00 pct)    225240.03 (6.61 pct)
> Scale:   222493.34 (0.00 pct)    219094.65 (-1.52 pct)
>   Add:   280277.17 (0.00 pct)    275677.73 (-1.64 pct)
> Triad:   265860.49 (0.00 pct)    262584.22 (-1.23 pct)
> 
> NPS4
> 
> Test:       tip                     SIS_UTIL
>  Copy:   250171.40 (0.00 pct)    230983.60 (-7.66 pct)
> Scale:   222293.56 (0.00 pct)    215984.34 (-2.83 pct)
>   Add:   279222.16 (0.00 pct)    270402.64 (-3.15 pct)
> Triad:   262013.92 (0.00 pct)    254820.60 (-2.74 pct)
> 
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
> 
> NPS1
> 
> sched-tip:      303718.33 (var: 1.31)
> SIS_UTIL:       303529.33 (var: 0.67)    (-0.06%)
> 
> NPS2
> 
> sched-tip:      304536.33 (var: 2.46)
> SIS_UTIL:       303730.33 (var: 1.57)    (-0.26%)
> 
> NPS4
> 
> sched-tip:      301192.33 (var: 1.81)
> SIS_UTIL:       300101.33 (var: 0.35)   (-0.36%)
> 
> ~~~~~~~~~~~~~~~~~~
> 
> Notes:
> 
> - There seems to be some noticeable regression for hackbench
>   with 16 groups in NPS1 mode.
Did the hackbench use the default fd number(20) in every group? If
this is the case, then there are 16 * 20 * 2 = 640 threads in the
system. I thought this should be overloaded, either in SIS_PROP or
SIS_UTIL, the search depth might be 4 and 0 respectively. And it
is also very likely the SIS_PROP will not find an idle CPU after
searching for 4 CPUs. So in theory there should be not much performance
difference with vs without the patch applied. But if the fd number is set
to a smaller one, the regression could be explained as you mentioned,
SIS_PROP search more aggressively.
> - There seems to be regression in tbench for case with number
>   of workers in range 32-128 (12.5% loaded to 50% loaded)
> - tbench reaches saturation early when system is fully loaded
> 
> This probably show that the strategy in the initial v1 RFC
> seems to work better with our LLC where number of CPUs per LLC
> is low compared to systems with unified LLC. Given this is
> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
> can be enabled based on the the size of LLC.
> 
Yes, SIS_PROP searches more aggressively, but we attempts to replace
SIS_PROP with a more accurate policy.
> > [..snip..]
> >
> > [3]
> > Prateek mentioned that we should scan aggressively in an LLC domain
> > with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> > negligible. The current patch aims to propose a generic solution and only
> > considers the util_avg. A follow-up change could enhance the scan policy
> > to adjust the scan_percent according to the CPU number in LLC.
> 
> Following are some additional numbers I would like to share comparing SIS_PROP and
> SIS_UTIL:
> 
Nice analysis.
> o Hackbench with 1 group
> 
> With 1 group, following are the chances of SIS_PROP
> and SIS_UTIL finding an idle CPU when an idle CPU
> exists in LLC:
> 
> +-----------------+---------------------------+---------------------------+--------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count  |
> +-----------------+---------------------------+---------------------------+--------+
> |        1        |             0             |             0             | 66444  |
> |        1        |             0             |             1             | 34153  |
> |        1        |             1             |             0             | 57204  |
> |        1        |             1             |             1             | 119263 |
> +-----------------+---------------------------+---------------------------+--------+
> 
So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
16 CPUs.
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 90653653
> Total time with SIS_PROP: 53558942 (-40.92 pct)
> Total time saved: 37094711
> 
What does no SIS_PROP mean? Is it with SIS_PROP disabled and
SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
If it is the latter, is there any performance difference between
the two?
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+-------+
> | CPU Searched | Count |
> +--------------+-------+
> |      0       | 10520 |
> |      1       |  7770 |
> |      2       | 11976 |
> |      3       | 17554 |
> |      4       | 13932 |
> |      5       | 15051 |
> |      6       |  8398 |
> |      7       |  4544 |
> |      8       |  3712 |
> |      9       |  2337 |
> |      10      |  4541 |
> |      11      |  1947 |
> |      12      |  3846 |
> |      13      |  3645 |
> |      14      |  2686 |
> |      15      |  8390 |
> |      16      | 26157 |
> +--------------+-------+
> 
> - SIS_UTIL might be bailing out too early in some of these cases.
>
Right. 
> o Hackbench with 16 group
> 
> the success rate looks as follows:
> 
> +-----------------+---------------------------+---------------------------+---------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU |  Count  |
> +-----------------+---------------------------+---------------------------+---------+
> |        1        |             0             |             0             | 1313745 |
> |        1        |             0             |             1             |  694132 |
> |        1        |             1             |             0             | 2888450 |
> |        1        |             1             |             1             | 5343065 |
> +-----------------+---------------------------+---------------------------+---------+
> 
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 5227299388
> Total time with SIS_PROP: 3866575188 (-26.03 pct)
> Total time saved: 1360724200
> 
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+---------+
> | CPU Searched |  Count  |
> +--------------+---------+
> |      0       |  150351 |
> |      1       |  105116 |
> |      2       |  214291 |
> |      3       |  440053 |
> |      4       |  914116 |
> |      5       | 1757984 |
> |      6       | 2410484 |
> |      7       | 1867668 |
> |      8       |  379888 |
> |      9       |  84055  |
> |      10      |  55389  |
> |      11      |  26795  |
> |      12      |  43113  |
> |      13      |  24579  |
> |      14      |  32896  |
> |      15      |  70059  |
> |      16      |  150858 |
> +--------------+---------+
> 
> - SIS_UTIL might be bailing out too early in most of these cases
> 
It might be interesting to see what the current sum of util_avg is, and this suggested that,
even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
> o tbench with 256 workers
> 
> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
> of CPUs to search.
> 
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 64004752959
> Total time with SIS_PROP: 34695004390 (-45.79 pct)
> Total time saved: 29309748569
> 
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+----------+
> | CPU Searched |  Count   |
> +--------------+----------+
> |      0       |  500077  |
> |      1       |  543865  |
> |      2       | 4257684  |
> |      3       | 27457498 |
> |      4       | 40208673 |
> |      5       | 3264358  |
> |      6       |  191631  |
> |      7       |  24658   |
> |      8       |   2469   |
> |      9       |   1374   |
> |      10      |   2008   |
> |      11      |   1300   |
> |      12      |   1226   |
> |      13      |   1179   |
> |      14      |   1631   |
> |      15      |  11678   |
> |      16      |   7793   |
> +--------------+----------+
> 
> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
>   at restricting search space well.
> 
> o Observations
> 
> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
> with 16 groups compared to than with 1 group.
> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
> case with 16-groups.
> 
> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
> performance gain in a fully loaded system.
> 
> Note: There might be some inaccuracies for the numbers presented for metrics that
> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
> when gathering these data points and the results from SIS_PROP were returned from
> search_idle_cpu().
Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
and SIS_PROP enabled?
> All the numbers for the above analysis were gathered in NPS1 mode.
> 
I'm thinking of taking nr_llc number into consideration to adjust the search depth,
something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd52fc5a034b..39b914599dce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
        llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
        nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
        nr_scan = max(nr_scan, 0);
+       if (nr_llc <= 16 && nr_scan)
+               nr_scan = nr_llc;
+
        WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
 }

I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.

thanks,
Chenyu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-14 10:55   ` Chen Yu
@ 2022-05-16 10:52     ` K Prateek Nayak
  2022-05-19 18:19       ` Chen Yu
  0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2022-05-16 10:52 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Yicong Yang,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

Hello Chenyu,

Thank you taking a look at the results.

On 5/14/2022 4:25 PM, Chen Yu wrote:
> [..snip..]
> May I know if in all NPS mode, all LLC domains have 16 CPUs?
Yes. The number of CPUs in LLC domain is always 16 irrespective of the NPS mode.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>     Total 2 NUMA nodes in the dual socket machine.
>>
>>     Node 0: 0-63,   128-191
>>     Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>     Total 4 NUMA nodes exist over 2 socket.
>>    
>>     Node 0: 0-31,   128-159
>>     Node 1: 32-63,  160-191
>>     Node 2: 64-95,  192-223
>>     Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>     Total 8 NUMA nodes exist over 2 socket.
>>    
>>     Node 0: 0-15,    128-143
>>     Node 1: 16-31,   144-159
>>     Node 2: 32-47,   160-175
>>     Node 3: 48-63,   176-191
>>     Node 4: 64-79,   192-207
>>     Node 5: 80-95,   208-223
>>     Node 6: 96-111,  223-231
>>     Node 7: 112-127, 232-255
>>
>> Kernel versions:
>> - tip:      5.18-rc1 tip sched/core
>> - SIS_UTIL:    5.18-rc1 tip sched/core + this patch
>>
>> When we began testing, tip was at:
>>
>> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
>>
>> Following are the results from the benchmark:
>>
>> * - Data points of concern
>>
>> ~~~~~~~~~
>> hackbench
>> ~~~~~~~~~
>>
>> NPS1
>>
>> Test:                   tip                     SIS_UTIL
>>  1-groups:         4.64 (0.00 pct)         4.70 (-1.29 pct)
>>  2-groups:         5.38 (0.00 pct)         5.45 (-1.30 pct)
>>  4-groups:         6.15 (0.00 pct)         6.10 (0.81 pct)
>>  8-groups:         7.42 (0.00 pct)         7.42 (0.00 pct)
>> 16-groups:        10.70 (0.00 pct)        11.69 (-9.25 pct)  *
>>
>> NPS2
>>
>> Test:                   tip                     SIS_UTIL
>>  1-groups:         4.70 (0.00 pct)         4.70 (0.00 pct)
>>  2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
>>  4-groups:         6.13 (0.00 pct)         6.05 (1.30 pct)
>>  8-groups:         7.30 (0.00 pct)         7.05 (3.42 pct)
>> 16-groups:        10.30 (0.00 pct)        10.12 (1.74 pct)
>>
>> NPS4
>>
>> Test:                   tip                     SIS_UTIL
>>  1-groups:         4.60 (0.00 pct)         4.75 (-3.26 pct)  *
>>  2-groups:         5.41 (0.00 pct)         5.42 (-0.18 pct)
>>  4-groups:         6.12 (0.00 pct)         6.00 (1.96 pct)
>>  8-groups:         7.22 (0.00 pct)         7.10 (1.66 pct)
>> 16-groups:        10.24 (0.00 pct)        10.11 (1.26 pct)
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> NPS 1
>>
>> #workers:   tip                     SIS_UTIL
>>   1:      29.00 (0.00 pct)        21.00 (27.58 pct)
>>   2:      28.00 (0.00 pct)        28.00 (0.00 pct)
>>   4:      31.50 (0.00 pct)        31.00 (1.58 pct)
>>   8:      42.00 (0.00 pct)        39.00 (7.14 pct)
>>  16:      56.50 (0.00 pct)        54.50 (3.53 pct)
>>  32:      94.50 (0.00 pct)        94.00 (0.52 pct)
>>  64:     176.00 (0.00 pct)       175.00 (0.56 pct)
>> 128:     404.00 (0.00 pct)       394.00 (2.47 pct)
>> 256:     869.00 (0.00 pct)       863.00 (0.69 pct)
>> 512:     58432.00 (0.00 pct)     55424.00 (5.14 pct)
>>
>> NPS2
>>
>> #workers:      tip                     SIS_UTIL
>>   1:      26.50 (0.00 pct)        25.00 (5.66 pct)
>>   2:      26.50 (0.00 pct)        25.50 (3.77 pct)
>>   4:      34.50 (0.00 pct)        34.00 (1.44 pct)
>>   8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
>>  16:      56.50 (0.00 pct)        60.50 (-7.07 pct)        *
>>  32:      95.50 (0.00 pct)        93.00 (2.61 pct)
>>  64:     179.00 (0.00 pct)       179.00 (0.00 pct)
>> 128:     369.00 (0.00 pct)       376.00 (-1.89 pct)
>> 256:     898.00 (0.00 pct)       903.00 (-0.55 pct)
>> 512:     56256.00 (0.00 pct)     57088.00 (-1.47 pct)
>>
>> NPS4
>>
>> #workers:    tip                     SIS_UTIL
>>   1:      25.00 (0.00 pct)        21.00 (16.00 pct)
>>   2:      28.00 (0.00 pct)        24.00 (14.28 pct)
>>   4:      29.50 (0.00 pct)        29.50 (0.00 pct)
>>   8:      41.00 (0.00 pct)        37.50 (8.53 pct)
>>  16:      65.50 (0.00 pct)        64.00 (2.29 pct)
>>  32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
>>  64:     170.50 (0.00 pct)       175.50 (-2.93 pct)
>> 128:     377.00 (0.00 pct)       368.50 (2.25 pct)
>> 256:     867.00 (0.00 pct)       902.00 (-4.03 pct)
>> 512:     58048.00 (0.00 pct)     55488.00 (4.41 pct)
>>
>> ~~~~~~
>> tbench
>> ~~~~~~
>>
>> NPS 1
>>
>> Clients:     tip                     SIS_UTIL
>>     1    443.31 (0.00 pct)       456.19 (2.90 pct)
>>     2    877.32 (0.00 pct)       875.24 (-0.23 pct)
>>     4    1665.11 (0.00 pct)      1647.31 (-1.06 pct)
>>     8    3016.68 (0.00 pct)      2993.23 (-0.77 pct)
>>    16    5374.30 (0.00 pct)      5246.93 (-2.36 pct)
>>    32    8763.86 (0.00 pct)      7878.18 (-10.10 pct)     *
>>    64    15786.93 (0.00 pct)     12958.47 (-17.91 pct)    *
>>   128    26826.08 (0.00 pct)     26741.14 (-0.31 pct)
>>   256    24207.35 (0.00 pct)     52041.89 (114.98 pct)
>>   512    51740.58 (0.00 pct)     52084.44 (0.66 pct)
>>  1024    51177.82 (0.00 pct)     53126.29 (3.80 pct)
>>
>> NPS 2
>>
>> Clients:     tip                     SIS_UTIL
>>     1    449.49 (0.00 pct)       447.96 (-0.34 pct)
>>     2    867.28 (0.00 pct)       869.52 (0.25 pct)
>>     4    1643.60 (0.00 pct)      1625.91 (-1.07 pct)
>>     8    3047.35 (0.00 pct)      2952.82 (-3.10 pct)
>>    16    5340.77 (0.00 pct)      5251.41 (-1.67 pct)
>>    32    10536.85 (0.00 pct)     8843.49 (-16.07 pct)     *
>>    64    16543.23 (0.00 pct)     14265.35 (-13.76 pct)    *
>>   128    26400.40 (0.00 pct)     25595.42 (-3.04 pct)
>>   256    23436.75 (0.00 pct)     47090.03 (100.92 pct)
>>   512    50902.60 (0.00 pct)     50036.58 (-1.70 pct)
>>  1024    50216.10 (0.00 pct)     50639.74 (0.84 pct)
>>
>> NPS 4
>>
>> Clients:     tip                     SIS_UTIL
>>     1    443.82 (0.00 pct)       459.93 (3.62 pct)
>>     2    849.14 (0.00 pct)       882.17 (3.88 pct)
>>     4    1603.26 (0.00 pct)      1629.64 (1.64 pct)
>>     8    2972.37 (0.00 pct)      3003.09 (1.03 pct)
>>    16    5277.13 (0.00 pct)      5234.07 (-0.81 pct)
>>    32    9744.73 (0.00 pct)      9347.90 (-4.07 pct)      *
>>    64    15854.80 (0.00 pct)     14180.27 (-10.56 pct)    *
>>   128    26116.97 (0.00 pct)     24597.45 (-5.81 pct)     *
>>   256    22403.25 (0.00 pct)     47385.09 (111.50 pct)
>>   512    48317.20 (0.00 pct)     49781.02 (3.02 pct)
>>  1024    50445.41 (0.00 pct)     51607.53 (2.30 pct)
>>
>> ~~~~~~
>> Stream
>> ~~~~~~
>>
>> - 10 runs
>>
>> NPS1
>>
>>               tip                     SIS_UTIL
>>  Copy:   189113.11 (0.00 pct)    188490.27 (-0.32 pct)
>> Scale:   201190.61 (0.00 pct)    204526.15 (1.65 pct)
>>   Add:   232654.21 (0.00 pct)    234948.01 (0.98 pct)
>> Triad:   226583.57 (0.00 pct)    228844.43 (0.99 pct)
>>
>> NPS2
>>
>> Test:         tip                     SIS_UTIL
>>  Copy:   155347.14 (0.00 pct)    169386.29 (9.03 pct)
>> Scale:   191701.53 (0.00 pct)    196110.51 (2.29 pct)
>>   Add:   210013.97 (0.00 pct)    221088.45 (5.27 pct)
>> Triad:   207602.00 (0.00 pct)    218072.52 (5.04 pct)
>>
>> NPS4
>>
>> Test:         tip                     SIS_UTIL
>>  Copy:   136421.15 (0.00 pct)    140894.11 (3.27 pct)
>> Scale:   191217.59 (0.00 pct)    190554.17 (-0.34 pct)
>>   Add:   189229.52 (0.00 pct)    190871.88 (0.86 pct)
>> Triad:   188052.99 (0.00 pct)    188417.63 (0.19 pct)
>>
>> - 100 runs
>>
>> NPS1
>>
>> Test:       tip                     SIS_UTIL
>>  Copy:   244693.32 (0.00 pct)    232328.05 (-5.05 pct)
>> Scale:   221874.99 (0.00 pct)    216858.39 (-2.26 pct)
>>   Add:   268363.89 (0.00 pct)    265449.16 (-1.08 pct)
>> Triad:   260945.24 (0.00 pct)    252240.56 (-3.33 pct)
>>
>> NPS2
>>
>> Test:       tip                     SIS_UTIL
>>  Copy:   211262.00 (0.00 pct)    225240.03 (6.61 pct)
>> Scale:   222493.34 (0.00 pct)    219094.65 (-1.52 pct)
>>   Add:   280277.17 (0.00 pct)    275677.73 (-1.64 pct)
>> Triad:   265860.49 (0.00 pct)    262584.22 (-1.23 pct)
>>
>> NPS4
>>
>> Test:       tip                     SIS_UTIL
>>  Copy:   250171.40 (0.00 pct)    230983.60 (-7.66 pct)
>> Scale:   222293.56 (0.00 pct)    215984.34 (-2.83 pct)
>>   Add:   279222.16 (0.00 pct)    270402.64 (-3.15 pct)
>> Triad:   262013.92 (0.00 pct)    254820.60 (-2.74 pct)
>>
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> NPS1
>>
>> sched-tip:      303718.33 (var: 1.31)
>> SIS_UTIL:       303529.33 (var: 0.67)    (-0.06%)
>>
>> NPS2
>>
>> sched-tip:      304536.33 (var: 2.46)
>> SIS_UTIL:       303730.33 (var: 1.57)    (-0.26%)
>>
>> NPS4
>>
>> sched-tip:      301192.33 (var: 1.81)
>> SIS_UTIL:       300101.33 (var: 0.35)   (-0.36%)
>>
>> ~~~~~~~~~~~~~~~~~~
>>
>> Notes:
>>
>> - There seems to be some noticeable regression for hackbench
>>   with 16 groups in NPS1 mode.
> Did the hackbench use the default fd number(20) in every group? If
> this is the case, then there are 16 * 20 * 2 = 640 threads in the
> system. I thought this should be overloaded, either in SIS_PROP or
> SIS_UTIL, the search depth might be 4 and 0 respectively. And it
> is also very likely the SIS_PROP will not find an idle CPU after
> searching for 4 CPUs. So in theory there should be not much performance
> difference with vs without the patch applied. But if the fd number is set
> to a smaller one, the regression could be explained as you mentioned,
> SIS_PROP search more aggressively.
Yes, I'm using fd number(20). The logs from hackbench run show that it is
indeed running 640 threads with 16 groups:

# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 16 groups == 640 threads run

This is indeed counterintuitive and I don't have
a good explanation for this other than that SIS_PROP
probably finding slightly greater success at finding
an idle CPU even in this overloaded environment.

I've ran the benchmark in two sets of 3 runs rebooting
in between on each kernel version:

- tip

Test:                   tip-r0                  tip-r1                  tip-r2
 1-groups:         4.64 (0.00 pct)         4.90 (-5.60 pct)        4.99 (-7.54 pct)
 2-groups:         5.54 (0.00 pct)         5.56 (-0.36 pct)        5.58 (-0.72 pct)
 4-groups:         6.24 (0.00 pct)         6.18 (0.96 pct)         6.20 (0.64 pct)
 8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.54 (0.00 pct)
16-groups:        10.85 (0.00 pct)        11.17 (-2.94 pct)       10.91 (-0.55 pct)

Test:                   tip-r3                  tip-r4                  tip-r5
 1-groups:         4.68 (0.00 pct)         4.97 (-6.19 pct)        4.98 (-6.41 pct)
 2-groups:         5.60 (0.00 pct)         5.62 (-0.35 pct)        5.66 (-1.07 pct)
 4-groups:         6.24 (0.00 pct)         6.23 (0.16 pct)         6.24 (0.00 pct)
 8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.46 (1.06 pct)
16-groups:        10.81 (0.00 pct)        10.84 (-0.27 pct)       10.81 (0.00 pct)

- SIS_UTIL


Test:                SIS_UTIL-r0              SIS_UTIL-r1             SIS_UTIL-r2
 1-groups:         4.68 (0.00 pct)         5.03 (-7.47 pct)        4.96 (-5.98 pct)
 2-groups:         5.45 (0.00 pct)         5.48 (-0.55 pct)        5.50 (-0.91 pct)
 4-groups:         6.10 (0.00 pct)         6.07 (0.49 pct)         6.14 (-0.65 pct)
 8-groups:         7.52 (0.00 pct)         7.51 (0.13 pct)         7.52 (0.00 pct)
16-groups:        11.63 (0.00 pct)        11.48 (1.28 pct)        11.51 (1.03 pct)

Test:                SIS_UTIL-r3              SIS_UTIL-r4             SIS_UTIL-r5
 1-groups:         4.80 (0.00 pct)         5.00 (-4.16 pct)        5.06 (-5.41 pct)
 2-groups:         5.51 (0.00 pct)         5.58 (-1.27 pct)        5.58 (-1.27 pct)
 4-groups:         6.14 (0.00 pct)         6.11 (0.48 pct)         6.06 (1.30 pct)
 8-groups:         7.35 (0.00 pct)         7.38 (-0.40 pct)        7.40 (-0.68 pct)
16-groups:        11.03 (0.00 pct)        11.29 (-2.35 pct)       11.14 (-0.99 pct)

- Comparing the best and bad data points for 16-groups with each
kernel version:

Test:                   tip-good             SIS_UTIL-good
 1-groups:         4.68 (0.00 pct)         4.80 (-2.56 pct)
 2-groups:         5.60 (0.00 pct)         5.51 (1.60 pct)
 4-groups:         6.24 (0.00 pct)         6.14 (1.60 pct)
 8-groups:         7.54 (0.00 pct)         7.35 (2.51 pct)
16-groups:        10.81 (0.00 pct)        11.03 (-2.03 pct)

Test:                   tip-good             SIS_UTIL-bad
 1-groups:         4.68 (0.00 pct)         4.68 (0.00 pct)
 2-groups:         5.60 (0.00 pct)         5.45 (2.67 pct)
 4-groups:         6.24 (0.00 pct)         6.10 (2.24 pct)
 8-groups:         7.54 (0.00 pct)         7.52 (0.26 pct)
16-groups:        10.81 (0.00 pct)        11.63 (-7.58 pct)

Test:                   tip-bad             SIS_UTIL-good
 1-groups:         4.90 (0.00 pct)         4.80 (2.04 pct)
 2-groups:         5.56 (0.00 pct)         5.51 (0.89 pct)
 4-groups:         6.18 (0.00 pct)         6.14 (0.64 pct)
 8-groups:         7.50 (0.00 pct)         7.35 (2.00 pct)
16-groups:        11.17 (0.00 pct)        11.03 (1.25 pct)

Test:                   tip-bad             SIS_UTIL-bad
 1-groups:         4.90 (0.00 pct)         4.68 (4.48 pct)
 2-groups:         5.56 (0.00 pct)         5.45 (1.97 pct)
 4-groups:         6.18 (0.00 pct)         6.10 (1.29 pct)
 8-groups:         7.50 (0.00 pct)         7.52 (-0.26 pct)
16-groups:        11.17 (0.00 pct)        11.63 (-4.11 pct)

Hackbench consistently reports > 11 for 16-group
case with SIS_UTIL however only once with SIS_PROP

>> - There seems to be regression in tbench for case with number
>>   of workers in range 32-128 (12.5% loaded to 50% loaded)
>> - tbench reaches saturation early when system is fully loaded
>>
>> This probably show that the strategy in the initial v1 RFC
>> seems to work better with our LLC where number of CPUs per LLC
>> is low compared to systems with unified LLC. Given this is
>> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
>> can be enabled based on the the size of LLC.
>>
> Yes, SIS_PROP searches more aggressively, but we attempts to replace
> SIS_PROP with a more accurate policy.
>>> [..snip..]
>>>
>>> [3]
>>> Prateek mentioned that we should scan aggressively in an LLC domain
>>> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
>>> negligible. The current patch aims to propose a generic solution and only
>>> considers the util_avg. A follow-up change could enhance the scan policy
>>> to adjust the scan_percent according to the CPU number in LLC.
>> Following are some additional numbers I would like to share comparing SIS_PROP and
>> SIS_UTIL:
>>
> Nice analysis.
>> o Hackbench with 1 group
>>
>> With 1 group, following are the chances of SIS_PROP
>> and SIS_UTIL finding an idle CPU when an idle CPU
>> exists in LLC:
>>
>> +-----------------+---------------------------+---------------------------+--------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count  |
>> +-----------------+---------------------------+---------------------------+--------+
>> |        1        |             0             |             0             | 66444  |
>> |        1        |             0             |             1             | 34153  |
>> |        1        |             1             |             0             | 57204  |
>> |        1        |             1             |             1             | 119263 |
>> +-----------------+---------------------------+---------------------------+--------+
>>
> So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
> 16 CPUs.
Yes!
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 90653653
>> Total time with SIS_PROP: 53558942 (-40.92 pct)
>> Total time saved: 37094711
>>
> What does no SIS_PROP mean? Is it with SIS_PROP disabled and
> SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
> If it is the latter, is there any performance difference between
> the two?

Sorry for not being clear here. No SIS_PROP mean we are searching the
entire LLC all the time for an idle CPU.This data aims to find how much time SIS_PROP is saving compared tocase where it is disabled.

>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+-------+
>> | CPU Searched | Count |
>> +--------------+-------+
>> |      0       | 10520 |
>> |      1       |  7770 |
>> |      2       | 11976 |
>> |      3       | 17554 |
>> |      4       | 13932 |
>> |      5       | 15051 |
>> |      6       |  8398 |
>> |      7       |  4544 |
>> |      8       |  3712 |
>> |      9       |  2337 |
>> |      10      |  4541 |
>> |      11      |  1947 |
>> |      12      |  3846 |
>> |      13      |  3645 |
>> |      14      |  2686 |
>> |      15      |  8390 |
>> |      16      | 26157 |
>> +--------------+-------+
>>
>> - SIS_UTIL might be bailing out too early in some of these cases.
>>
> Right. 
>> o Hackbench with 16 group
>>
>> the success rate looks as follows:
>>
>> +-----------------+---------------------------+---------------------------+---------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU |  Count  |
>> +-----------------+---------------------------+---------------------------+---------+
>> |        1        |             0             |             0             | 1313745 |
>> |        1        |             0             |             1             |  694132 |
>> |        1        |             1             |             0             | 2888450 |
>> |        1        |             1             |             1             | 5343065 |
>> +-----------------+---------------------------+---------------------------+---------+
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 5227299388
>> Total time with SIS_PROP: 3866575188 (-26.03 pct)
>> Total time saved: 1360724200
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+---------+
>> | CPU Searched |  Count  |
>> +--------------+---------+
>> |      0       |  150351 |
>> |      1       |  105116 |
>> |      2       |  214291 |
>> |      3       |  440053 |
>> |      4       |  914116 |
>> |      5       | 1757984 |
>> |      6       | 2410484 |
>> |      7       | 1867668 |
>> |      8       |  379888 |
>> |      9       |  84055  |
>> |      10      |  55389  |
>> |      11      |  26795  |
>> |      12      |  43113  |
>> |      13      |  24579  |
>> |      14      |  32896  |
>> |      15      |  70059  |
>> |      16      |  150858 |
>> +--------------+---------+
>>
>> - SIS_UTIL might be bailing out too early in most of these cases
>>
> It might be interesting to see what the current sum of util_avg is, and this suggested that,
> even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
I agree. Let me know if there is any data you would like me to collect wrt this.
>> o tbench with 256 workers
>>
>> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
>> of CPUs to search.
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 64004752959
>> Total time with SIS_PROP: 34695004390 (-45.79 pct)
>> Total time saved: 29309748569
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+----------+
>> | CPU Searched |  Count   |
>> +--------------+----------+
>> |      0       |  500077  |
>> |      1       |  543865  |
>> |      2       | 4257684  |
>> |      3       | 27457498 |
>> |      4       | 40208673 |
>> |      5       | 3264358  |
>> |      6       |  191631  |
>> |      7       |  24658   |
>> |      8       |   2469   |
>> |      9       |   1374   |
>> |      10      |   2008   |
>> |      11      |   1300   |
>> |      12      |   1226   |
>> |      13      |   1179   |
>> |      14      |   1631   |
>> |      15      |  11678   |
>> |      16      |   7793   |
>> +--------------+----------+
>>
>> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
>>   at restricting search space well.
>>
>> o Observations
>>
>> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
>> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
>> with 16 groups compared to than with 1 group.
>> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
>> case with 16-groups.
>>
>> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
>> performance gain in a fully loaded system.
>>
>> Note: There might be some inaccuracies for the numbers presented for metrics that
>> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
>> when gathering these data points and the results from SIS_PROP were returned from
>> search_idle_cpu().
> Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
> and SIS_PROP enabled?
Yes, the table
"Number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size)"
was obtained by enabling both the features - SIS_PROP and SIS_UTIL, and
comparing the nr values suggested by SIS_UTIL when SIS_PROP allowed
searching for the entire LLC.
>> All the numbers for the above analysis were gathered in NPS1 mode.
>>
> I'm thinking of taking nr_llc number into consideration to adjust the search depth,
> something like:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd52fc5a034b..39b914599dce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
>         llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
>         nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
>         nr_scan = max(nr_scan, 0);
> +       if (nr_llc <= 16 && nr_scan)
> +               nr_scan = nr_llc;
> +
This will behave closer to the initial RFC on systems with smaller LLC.
I can do some preliminary testing with this and get back to you.
>         WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
>  }
>
> I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
Thank you for looking into this.

--
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-04-28 18:24 [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
                   ` (2 preceding siblings ...)
  2022-05-13  6:37 ` K Prateek Nayak
@ 2022-05-17 12:50 ` Mel Gorman
  2022-05-19 17:36   ` Chen Yu
  3 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2022-05-17 12:50 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Yicong Yang, K Prateek Nayak,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> Introduce the quadratic function:
> 
> y = a - bx^2
> 
> x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> as sum_util increases. When sum_util hits 85% or above, the scan stops.
> Choosing 85% is because it is the threshold of an overloaded LLC sched group
> (imbalance_pct = 117). Choosing quadratic function is because:
> 
> [1] Compared to the linear function, it scans more aggressively when the
>     sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
>     and the number of CPUs to be scanned. Use heuristic scan for now.
> 
> The steps to calculate scan_nr are as followed:
> [1] scan_percent = 100 - (x/8.5)^2
>     when utilization reaches 85%, scan_percent becomes 0.
> [2] scan_nr = nr_llc * scan_percent / 100
> [3] scan_nr = max(scan_nr, 0)
> 
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85 ...
> scan_ns   112  112  108  103  92  80  64   47   24    0 ...
> 
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112ms would decay
> to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache false sharing. And Tim also mentioned
> that, it is allowed to be non-optimal in terms of scheduling for the
> short term variations, but if there is a long term trend in the load
> behavior, the scheduler can adjust for that.
> 

Seems fair other than the 85% is hardcoded based on an imbalance_pct of
117. If that value should ever change, it'll drift apart.

> SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> will use the nr_scan calculated by SIS_UTIL instead of the one from
> SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
> 

I agree with Peter, this should be enabled by default. I am somewhat
embarassed that I initially queued this patch blindly for testing when it
was mailed thinking "I would like to have some results before reviewing"
and then when I sit down to do the review, the results are all useless
because the feature was disabled. My initial thinking starting the review
was "Weird, none of these results are conclusive in any way."

I don't think you need to explicitly check for both being enabled given
that it's a sched_feat. Someone poking in there is meant to be debugging
something but the vast majority of people will never go looking.

> Limitations:
> [1]
> This patch is based on the util_avg, which is very sensitive to the CPU
> frequency invariance. The util_avg would decay quite fast when the
> CPU is idle, if the max frequency has been limited by the user.
> Patch [3] should be applied if turbo is disabled manually on Intel
> platforms.
> 

It is worth mentioning in the changelog if there is a patch that this
implicitly depends upon. It affects the ordering patches should be merged
or bisections for a regression may unfairly identify your patch as the
source of the problem.

At least then if they merge in the wrong order there will something
obvious to check.

> [2]
> There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> CPU0~CPU6, while CPU7 is idle:
> 
>           CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
> util_avg  1024    1024    1024    1024    1024    1024    1024    0
> 
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected.
> 
> A possible workaround to mitigate this problem is that the nr_scan should
> be increased if idle CPUs are detected during periodic load balance. And the
> problem could be mitigated by idle load balance that, CPU7 might pull some
> tasks on it.
> 

While a valid concern, it's no worse than what is there now and I think
this case is unlikely. It could naturally happen if there was 6 busy tasks
running entirely in userspace which is harmless. Normal load balancing
would use CPU7 if there was any stacking on CPU0 to CPU6 or NEWLY_IDLE
ILB on CPU7 would pull something if there was any other activity on CPU7.

> [3]
> Prateek mentioned that we should scan aggressively in an LLC domain
> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> negligible. The current patch aims to propose a generic solution and only
> considers the util_avg. A follow-up change could enhance the scan policy
> to adjust the scan_percent according to the CPU number in LLC.
> 

Yes, anything along those lines is a separate patch.

> v2->v3:
>  - Use 85% as the threshold again, because the CPU frequency invariance issue
>    has been fixed and the patch is queued for 5.19.
> 

Note in changelog exactly what this fix is in case the patches go in the
wrong order.

>  - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
>    According to the feedback from Yicong, it might be better to stop scanning
>    entirely when the LLC is overloaded.
> 

I think only workloads like hackbench benefit from "search 4 CPUs"
heuristic because the machine is heavily overloaded with tasks that are
rapidly idling. Be wary of a patch that sets it back to 4 with hackbench
being the only justification.

> Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
> Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net #2
> Link: https://lore.kernel.org/lkml/20220407234258.569681-1-yu.c.chen@intel.com #3
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 58 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		nr_idle_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23c7d0f617ee..50c9d5b2b338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +	struct sched_domain_shared *sd_share;
>  	struct rq *this_rq = this_rq();
>  	int this = smp_processor_id();
>  	struct sched_domain *this_sd;
> @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> +	if (sched_feat(SIS_UTIL)) {
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share) {
> +			/* because !--nr is the condition to stop scan */
> +			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> +			/* overloaded LLC is unlikely to have idle cpu/core */
> +			if (nr == 1)
> +				return -1;
> +		}
> +	}
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static inline void update_idle_cpu_scan(struct lb_env *env,
> +					unsigned long sum_util)
> +{

Don't inline. This is a long function with one callsite, the compiler
should be able to figure it out.

> +	struct sched_domain_shared *sd_share;
> +	int nr_scan, nr_llc, llc_util_pct;
> +
> +	if (!sched_feat(SIS_UTIL))
> +		return;

Move the CPU_NEWLY_IDLE check here because it's essentially a free check
and avoids the per_cpu lookup.

Also comment on why CPU_NEWLY_IDLE is ignored. I assume it's because you
are updating sd_shared which is a shared cache line write and
CPU_NEWLY_IDLE can fire way more frequently than periodic load
balancing.

> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of this hint
> +	 * occurs during periodic load balancing, rather than frequent
> +	 * newidle balance.
> +	 */
> +	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> +	if (env->idle == CPU_NEWLY_IDLE ||
> +	    env->sd->span_weight != nr_llc)
> +		return;
> +

This caught me because nr_llc made me think it was "the number
of last level caches in a NUMA node" because that's what it means
in kernel/sched/topology.c. This is the number of CPUs sharing an
LLC to llc_weight would be appropriate given that it's compared to
span_weight. It's not mandatory for me but it's preferable.

> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * The number of CPUs to search drops as sum_util increases, when
> +	 * sum_util hits 85% or above, the scan stops.
> +	 * The reason to choose 85% as the threshold is because this is the
> +	 * imbalance_pct when a LLC sched group is overloaded.
> +	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> +	 * y is the percentage of CPUs to be scanned in the LLC
> +	 * domain, x is the ratio of sum_util compared to the
> +	 * CPU capacity, which ranges in [0, 100], thus
> +	 * nr_scan = nr_llc * y / 100
> +	 */
> +	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> +	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> +	nr_scan = max(nr_scan, 0);

Peter commented on this already.

> +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}

To avoid excessive unnecessary cache line bounces;

	if (nr_scan != sd_share->nr_idle_scan)
		WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);

I expect the two very common values represent "machine is mostly idle
scan everything" and "machine is heavily overloaded, scan nothing" and
I'm betting the cost of the branch is less than the cost of rewriting
the same values.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-17 12:50 ` Mel Gorman
@ 2022-05-19 17:36   ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-05-19 17:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Vincent Guittot, Yicong Yang, K Prateek Nayak,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

Hi Mel,
On Tue, May 17, 2022 at 01:50:47PM +0100, Mel Gorman wrote:
> On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> > Introduce the quadratic function:
> > 
> > y = a - bx^2
> > 
> > x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> > of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> > as sum_util increases. When sum_util hits 85% or above, the scan stops.
> > Choosing 85% is because it is the threshold of an overloaded LLC sched group
> > (imbalance_pct = 117). Choosing quadratic function is because:
> > 
> > [1] Compared to the linear function, it scans more aggressively when the
> >     sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> >     and the number of CPUs to be scanned. Use heuristic scan for now.
> > 
> > The steps to calculate scan_nr are as followed:
> > [1] scan_percent = 100 - (x/8.5)^2
> >     when utilization reaches 85%, scan_percent becomes 0.
> > [2] scan_nr = nr_llc * scan_percent / 100
> > [3] scan_nr = max(scan_nr, 0)
> > 
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util%   0    5   15   25  35  45  55   65   75   85 ...
> > scan_ns   112  112  108  103  92  80  64   47   24    0 ...
> > 
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112ms would decay
> > to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> > reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache false sharing. And Tim also mentioned
> > that, it is allowed to be non-optimal in terms of scheduling for the
> > short term variations, but if there is a long term trend in the load
> > behavior, the scheduler can adjust for that.
> > 
> 
> Seems fair other than the 85% is hardcoded based on an imbalance_pct of
> 117. If that value should ever change, it'll drift apart.
> 
OK, I can change the calculation based on imbalance_pct rather than hard code
to 85%.
> > SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> > will use the nr_scan calculated by SIS_UTIL instead of the one from
> > SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
> > 
> 
> I agree with Peter, this should be enabled by default. I am somewhat
> embarassed that I initially queued this patch blindly for testing when it
> was mailed thinking "I would like to have some results before reviewing"
> and then when I sit down to do the review, the results are all useless
> because the feature was disabled. My initial thinking starting the review
> was "Weird, none of these results are conclusive in any way."
> 
Thank you for the review and testing this change, and sorry for the inconvenience.
> I don't think you need to explicitly check for both being enabled given
> that it's a sched_feat. Someone poking in there is meant to be debugging
> something but the vast majority of people will never go looking.
> 
OK, I'll change it to enabled by default.
> > Limitations:
> > [1]
> > This patch is based on the util_avg, which is very sensitive to the CPU
> > frequency invariance. The util_avg would decay quite fast when the
> > CPU is idle, if the max frequency has been limited by the user.
> > Patch [3] should be applied if turbo is disabled manually on Intel
> > platforms.
> > 
> 
> It is worth mentioning in the changelog if there is a patch that this
> implicitly depends upon. It affects the ordering patches should be merged
> or bisections for a regression may unfairly identify your patch as the
> source of the problem.
> 
> At least then if they merge in the wrong order there will something
> obvious to check.
>
OK, I'll keep this informantion in the changelog. And the patch mentioned
has been queued for 5.19. 
> > [2]
> > There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> > suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> > CPU0~CPU6, while CPU7 is idle:
> > 
> >           CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
> > util_avg  1024    1024    1024    1024    1024    1024    1024    0
> > 
> > Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> > select_idle_cpu() will not scan, thus CPU7 is undetected.
> > 
> > A possible workaround to mitigate this problem is that the nr_scan should
> > be increased if idle CPUs are detected during periodic load balance. And the
> > problem could be mitigated by idle load balance that, CPU7 might pull some
> > tasks on it.
> > 
> 
> While a valid concern, it's no worse than what is there now and I think
> this case is unlikely. It could naturally happen if there was 6 busy tasks
> running entirely in userspace which is harmless. Normal load balancing
> would use CPU7 if there was any stacking on CPU0 to CPU6 or NEWLY_IDLE
> ILB on CPU7 would pull something if there was any other activity on CPU7.
> 
Yes agreed.
> > [3]
> > Prateek mentioned that we should scan aggressively in an LLC domain
> > with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> > negligible. The current patch aims to propose a generic solution and only
> > considers the util_avg. A follow-up change could enhance the scan policy
> > to adjust the scan_percent according to the CPU number in LLC.
> > 
> 
> Yes, anything along those lines is a separate patch.
> 
OK, I'm working with Prateek on this, to collect test data and make the search
policy more friendly to small LLC size platform.
> > v2->v3:
> >  - Use 85% as the threshold again, because the CPU frequency invariance issue
> >    has been fixed and the patch is queued for 5.19.
> > 
> 
> Note in changelog exactly what this fix is in case the patches go in the
> wrong order.
> 
OK, will do.
> >  - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
> >    According to the feedback from Yicong, it might be better to stop scanning
> >    entirely when the LLC is overloaded.
> > 
> 
> I think only workloads like hackbench benefit from "search 4 CPUs"
> heuristic because the machine is heavily overloaded with tasks that are
> rapidly idling. Be wary of a patch that sets it back to 4 with hackbench
> being the only justification.
> 
Yes, this is also my concern. As Prateek mentioned, in a LLC with 16 CPUs, and when
the system is extremely overloaded, there is slight regression on hackbench, because
with this patch applied, the SIS would gives up scanning CPUs, while SIS_PROP would
scan at least 4 CPUs.
> >  }
> >
[snip]  
> > +static inline void update_idle_cpu_scan(struct lb_env *env,
> > +					unsigned long sum_util)
> > +{
> 
> Don't inline. This is a long function with one callsite, the compiler
> should be able to figure it out.
> 
OK, will do in next version.
> > +	struct sched_domain_shared *sd_share;
> > +	int nr_scan, nr_llc, llc_util_pct;
> > +
> > +	if (!sched_feat(SIS_UTIL))
> > +		return;
> 
> Move the CPU_NEWLY_IDLE check here because it's essentially a free check
> and avoids the per_cpu lookup.
> 
OK.
> Also comment on why CPU_NEWLY_IDLE is ignored. I assume it's because you
> are updating sd_shared which is a shared cache line write and
> CPU_NEWLY_IDLE can fire way more frequently than periodic load
> balancing.
> 
Yes, I'll add the comment here.
> > +	/*
> > +	 * Update the number of CPUs to scan in LLC domain, which could
> > +	 * be used as a hint in select_idle_cpu(). The update of this hint
> > +	 * occurs during periodic load balancing, rather than frequent
> > +	 * newidle balance.
> > +	 */
> > +	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> > +	if (env->idle == CPU_NEWLY_IDLE ||
> > +	    env->sd->span_weight != nr_llc)
> > +		return;
> > +
> 
> This caught me because nr_llc made me think it was "the number
> of last level caches in a NUMA node" because that's what it means
> in kernel/sched/topology.c. This is the number of CPUs sharing an
> LLC to llc_weight would be appropriate given that it's compared to
> span_weight. It's not mandatory for me but it's preferable.
> 
OK, I'll change it to llc_weight.
> > +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> > +	if (!sd_share)
> > +		return;
> > +
> > +	/*
> > +	 * The number of CPUs to search drops as sum_util increases, when
> > +	 * sum_util hits 85% or above, the scan stops.
> > +	 * The reason to choose 85% as the threshold is because this is the
> > +	 * imbalance_pct when a LLC sched group is overloaded.
> > +	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> > +	 * y is the percentage of CPUs to be scanned in the LLC
> > +	 * domain, x is the ratio of sum_util compared to the
> > +	 * CPU capacity, which ranges in [0, 100], thus
> > +	 * nr_scan = nr_llc * y / 100
> > +	 */
> > +	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> > +	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> > +	nr_scan = max(nr_scan, 0);
> 
> Peter commented on this already.
> 
> > +	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > +}
> 
> To avoid excessive unnecessary cache line bounces;
> 
> 	if (nr_scan != sd_share->nr_idle_scan)
> 		WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> 
> I expect the two very common values represent "machine is mostly idle
> scan everything" and "machine is heavily overloaded, scan nothing" and
> I'm betting the cost of the branch is less than the cost of rewriting
> the same values.
> 
OK, will do in next version.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-05-16 10:52     ` K Prateek Nayak
@ 2022-05-19 18:19       ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-05-19 18:19 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Yicong Yang,
	Tim Chen, Chen Yu, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Zhang Rui, linux-kernel,
	Daniel Bristot de Oliveira

On Mon, May 16, 2022 at 04:22:34PM +0530, K Prateek Nayak wrote:
[snip] 
> I've ran the benchmark in two sets of 3 runs rebooting
> in between on each kernel version:
> 
> - tip
> 
> Test:                   tip-r0                  tip-r1                  tip-r2
>  1-groups:         4.64 (0.00 pct)         4.90 (-5.60 pct)        4.99 (-7.54 pct)
>  2-groups:         5.54 (0.00 pct)         5.56 (-0.36 pct)        5.58 (-0.72 pct)
>  4-groups:         6.24 (0.00 pct)         6.18 (0.96 pct)         6.20 (0.64 pct)
>  8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.54 (0.00 pct)
> 16-groups:        10.85 (0.00 pct)        11.17 (-2.94 pct)       10.91 (-0.55 pct)
> 
> Test:                   tip-r3                  tip-r4                  tip-r5
>  1-groups:         4.68 (0.00 pct)         4.97 (-6.19 pct)        4.98 (-6.41 pct)
>  2-groups:         5.60 (0.00 pct)         5.62 (-0.35 pct)        5.66 (-1.07 pct)
>  4-groups:         6.24 (0.00 pct)         6.23 (0.16 pct)         6.24 (0.00 pct)
>  8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.46 (1.06 pct)
> 16-groups:        10.81 (0.00 pct)        10.84 (-0.27 pct)       10.81 (0.00 pct)
> 
> - SIS_UTIL
> 
> 
> Test:                SIS_UTIL-r0              SIS_UTIL-r1             SIS_UTIL-r2
>  1-groups:         4.68 (0.00 pct)         5.03 (-7.47 pct)        4.96 (-5.98 pct)
>  2-groups:         5.45 (0.00 pct)         5.48 (-0.55 pct)        5.50 (-0.91 pct)
>  4-groups:         6.10 (0.00 pct)         6.07 (0.49 pct)         6.14 (-0.65 pct)
>  8-groups:         7.52 (0.00 pct)         7.51 (0.13 pct)         7.52 (0.00 pct)
> 16-groups:        11.63 (0.00 pct)        11.48 (1.28 pct)        11.51 (1.03 pct)
> 
> Test:                SIS_UTIL-r3              SIS_UTIL-r4             SIS_UTIL-r5
>  1-groups:         4.80 (0.00 pct)         5.00 (-4.16 pct)        5.06 (-5.41 pct)
>  2-groups:         5.51 (0.00 pct)         5.58 (-1.27 pct)        5.58 (-1.27 pct)
>  4-groups:         6.14 (0.00 pct)         6.11 (0.48 pct)         6.06 (1.30 pct)
>  8-groups:         7.35 (0.00 pct)         7.38 (-0.40 pct)        7.40 (-0.68 pct)
> 16-groups:        11.03 (0.00 pct)        11.29 (-2.35 pct)       11.14 (-0.99 pct)
> 
> - Comparing the best and bad data points for 16-groups with each
> kernel version:
> 
> Test:                   tip-good             SIS_UTIL-good
>  1-groups:         4.68 (0.00 pct)         4.80 (-2.56 pct)
>  2-groups:         5.60 (0.00 pct)         5.51 (1.60 pct)
>  4-groups:         6.24 (0.00 pct)         6.14 (1.60 pct)
>  8-groups:         7.54 (0.00 pct)         7.35 (2.51 pct)
> 16-groups:        10.81 (0.00 pct)        11.03 (-2.03 pct)
> 
> Test:                   tip-good             SIS_UTIL-bad
>  1-groups:         4.68 (0.00 pct)         4.68 (0.00 pct)
>  2-groups:         5.60 (0.00 pct)         5.45 (2.67 pct)
>  4-groups:         6.24 (0.00 pct)         6.10 (2.24 pct)
>  8-groups:         7.54 (0.00 pct)         7.52 (0.26 pct)
> 16-groups:        10.81 (0.00 pct)        11.63 (-7.58 pct)
> 
> Test:                   tip-bad             SIS_UTIL-good
>  1-groups:         4.90 (0.00 pct)         4.80 (2.04 pct)
>  2-groups:         5.56 (0.00 pct)         5.51 (0.89 pct)
>  4-groups:         6.18 (0.00 pct)         6.14 (0.64 pct)
>  8-groups:         7.50 (0.00 pct)         7.35 (2.00 pct)
> 16-groups:        11.17 (0.00 pct)        11.03 (1.25 pct)
> 
> Test:                   tip-bad             SIS_UTIL-bad
>  1-groups:         4.90 (0.00 pct)         4.68 (4.48 pct)
>  2-groups:         5.56 (0.00 pct)         5.45 (1.97 pct)
>  4-groups:         6.18 (0.00 pct)         6.10 (1.29 pct)
>  8-groups:         7.50 (0.00 pct)         7.52 (-0.26 pct)
> 16-groups:        11.17 (0.00 pct)        11.63 (-4.11 pct)
> 
> Hackbench consistently reports > 11 for 16-group
> case with SIS_UTIL however only once with SIS_PROP
>
Mel has mentioned that, although it is 'overloaded', the nature of hackbench
would benefit from scanning for more CPUs because there is frequent context switch,
thus there is more chance to get idle.
> > I'm thinking of taking nr_llc number into consideration to adjust the search depth,
> > something like:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dd52fc5a034b..39b914599dce 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
> >         llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> >         nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> >         nr_scan = max(nr_scan, 0);
> > +       if (nr_llc <= 16 && nr_scan)
> > +               nr_scan = nr_llc;
> > +
> This will behave closer to the initial RFC on systems with smaller LLC.
> I can do some preliminary testing with this and get back to you.
> >         WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> >  }
> >
> > I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
> Thank you for looking into this.
>
OK, I've done some tests and recorded the number of CPUs SIS_PROP and SIS_UTIL
want to scan(the nr).

1. 16 CPUs online, 16 groups of hackbench (total 64 threads)
   Most of time SIS_PROP would scan for 4 CPUs, while SIS_UTIL scans for 2 CPUs.

2. 112 CPUs online, 16 groups of hackbench (total 448 threads)
   Most of time SIS_PROP would scan for 4 CPUs, but there is a small part of
   SIS_PROP would scan the entire LLC, which could be time costly. 
   The number of CPUs scanned by SIS_UTIL ranges in [2, 20] and it looks like a
   Normal Distribution.

(I'll send you the plot picture offline so you can have a view of how these data
look like)

This means, for 16 CPUs case, the extrem overloaded hackbench would benefit from
scanning for more CPUs. But for 112 CPUs platform, using SIS_UTIL seems to be more
reasonable because it does not have 'jitters' to scan for the whole LLC.

Let me revise the patch according to Peter and Mel's suggestion, and send
a v4 (then cook the complementary patch to deal with system having 16
CPUs per LLC domain).

thanks,
Chenyu
> --
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-19 18:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 18:24 [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
2022-05-06  8:38 ` Peter Zijlstra
2022-05-07 14:44   ` Chen Yu
2022-05-10 12:41 ` Yicong Yang
2022-05-12  8:14   ` Chen Yu
2022-05-12 13:10     ` Yicong Yang
2022-05-13  6:37 ` K Prateek Nayak
2022-05-14 10:55   ` Chen Yu
2022-05-16 10:52     ` K Prateek Nayak
2022-05-19 18:19       ` Chen Yu
2022-05-17 12:50 ` Mel Gorman
2022-05-19 17:36   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).