All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
@ 2022-06-12 16:34 Chen Yu
  2022-06-13  7:40 ` Yicong Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chen Yu @ 2022-06-12 16:34 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Barry Song, Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li,
	Abel Wu, Daniel Bristot de Oliveira, Tim Chen, linux-kernel,
	Yicong Yang, K Prateek Nayak, Chen Yu, Mohini Narkhede

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

v3->v4:
   No fundamental change since v3.
 - Enable SIS_UTIL by default. (Peter Zijlstra, Mel Gorman)
 - Replace percentage with SCHED_CAPACITY_SCALE based calculation.
   (Peter Zijlstra)
 - Use imbalance_pct for threshold rather than hardcoded of 85%.
   (Mel Gorman)
 - Add description of the patch frequency invariance dependence in
   changelog.(Mel Gorman)
 - Remove the inline of update_idle_cpu_scan(). (Mel Gorman)
 - Move the check of CPU_NEWLY_IDLE earlier, to avoid unnecessary
   percpu cache contention. (Mel Gorman)
 - Add comments on why CPU_NEWLY_IDLE is ignored in update_idle_cpu_scan(),
   because updating sd_shared which is a shared cache line write and
   CPU_NEWLY_IDLE can fire way more frequently than periodic load
   balancing. (Mel Gorman)
 - Rename nr_llc to llc_weight to avoid confusion. (Mel Gorman)
 - Avoid writing the same value to sd_share->nr_idle_scan to reduce
   cache line bounces. (Mel Gorman)

v2->v3:
 - Use 85% as the threshold again, because the CPU frequency invariance issue
   has been fixed and the patch is queued for 5.19.

 - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
   (Yicong Yang)

 - Replace linear scan with quadratic function scan, to let the SIS scan
   aggressively when the LLC is not busy. Prateek mentioned there was slight
   regression from ycsb-mongodb in v2, which might be due to fewer CPUs
   scanned when the utilization is around 20%. (K Prateek Nayak)

 - Add back the logic to stop the CPU scan even if has_idle_core is true.
   It might be a waste of time to search for an idle Core if the LLC is
   overloaded. Besides, according to the tbench result from Prateek, stop idle
   Core scan would bring extra performance improvement. (K Prateek Nayak)

 - Provide the SIS search statistics in the commit log, based on Mel Gorman's
   patch. (Abel Wu)

 - Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
   directly, which can be reviewed easier.

v2->v1:
 - As suggested by Peter, introduce an idle CPU scan strategy that is based on
   the util_avg metric. When util_avg is very low it scans more, while when
   util_avg hits the threshold we naturally stop scanning entirely. The threshold
   has been decreased from 85% to 50%, because this is the threshold when the
   CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
   when the LLC is overloaded, to keep it consistent with the current logic of
   select_idle_cpu().

v1:
 - Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
   the LLC domain has reached 85%.

Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
Link: https://lore.kernel.org/lkml/20220505122331.42696-1-wuyun.abel@bytedance.com #2
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 87 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 89 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..816df6cc444e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		nr_idle_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..3fb857a35b16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6336,6 +6336,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6375,6 +6376,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
+	if (sched_feat(SIS_UTIL)) {
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+		if (sd_share) {
+			/* because !--nr is the condition to stop scan */
+			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			/* overloaded LLC is unlikely to have idle cpu/core */
+			if (nr == 1)
+				return -1;
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -9222,6 +9234,77 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static void update_idle_cpu_scan(struct lb_env *env,
+				 unsigned long sum_util)
+{
+	struct sched_domain_shared *sd_share;
+	int llc_weight, pct;
+	u64 x, y, tmp;
+	/*
+	 * Update the number of CPUs to scan in LLC domain, which could
+	 * be used as a hint in select_idle_cpu(). The update of sd_share
+	 * could be expensive because it is within a shared cache line.
+	 * So the write of this hint only occurs during periodic load
+	 * balancing, rather than CPU_NEWLY_IDLE, because the latter
+	 * can fire way more frequently than the former.
+	 */
+	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
+	if (env->sd->span_weight != llc_weight)
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+	if (!sd_share)
+		return;
+
+	/*
+	 * The number of CPUs to search drops as sum_util increases, when
+	 * sum_util hits 85% or above, the scan stops.
+	 * The reason to choose 85% as the threshold is because this is the
+	 * imbalance_pct(117) when a LLC sched group is overloaded.
+	 *
+	 * let y = SCHED_CAPACITY_SCALE - p * x^2                       [1]
+	 * and y'= y / SCHED_CAPACITY_SCALE
+	 *
+	 * x is the ratio of sum_util compared to the CPU capacity:
+	 * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
+	 * y' is the ratio of CPUs to be scanned in the LLC domain,
+	 * and the number of CPUs to scan is calculated by:
+	 *
+	 * nr_scan = llc_weight * y'                                    [2]
+	 *
+	 * When x hits the threshold of overloaded, AKA, when
+	 * x = 100 / pct, y drops to 0. According to [1],
+	 * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000
+	 *
+	 * Scale x by SCHED_CAPACITY_SCALE:
+	 * x' = sum_util / llc_weight;                                  [3]
+	 *
+	 * and finally [1] becomes:
+	 * y = SCHED_CAPACITY_SCALE -
+	 *     x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE)            [4]
+	 *
+	 */
+	/* equation [3] */
+	x = sum_util;
+	do_div(x, llc_weight);
+
+	/* equation [4] */
+	pct = env->sd->imbalance_pct;
+	tmp = x * x * pct * pct;
+	do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
+	tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);
+	y = SCHED_CAPACITY_SCALE - tmp;
+
+	/* equation [2] */
+	y *= llc_weight;
+	do_div(y, SCHED_CAPACITY_SCALE);
+	if ((int)y != sd_share->nr_idle_scan)
+		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -9234,6 +9317,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
+	unsigned long sum_util = 0;
 	int sg_status = 0;
 
 	do {
@@ -9266,6 +9350,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
+		sum_util += sgs->group_util;
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
@@ -9291,6 +9376,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
+
+	update_idle_cpu_scan(env, sum_util);
 }
 
 #define NUMA_IMBALANCE_MIN 2
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1cf435bbcd9c..3334a1b93fc6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_UTIL, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-12 16:34 [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
@ 2022-06-13  7:40 ` Yicong Yang
  2022-06-13  8:06   ` Chen Yu
  2022-06-22  6:36 ` K Prateek Nayak
  2022-06-28  7:16 ` [tip: sched/core] " tip-bot2 for Chen Yu
  2 siblings, 1 reply; 12+ messages in thread
From: Yicong Yang @ 2022-06-13  7:40 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Vincent Guittot, Mel Gorman
  Cc: yangyicong, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Barry Song, Srikar Dronamraju, Len Brown,
	Ben Segall, Aubrey Li, Abel Wu, Daniel Bristot de Oliveira,
	Tim Chen, linux-kernel, K Prateek Nayak, Mohini Narkhede

On 2022/6/13 0:34, Chen Yu wrote:
> [Problem Statement]
> select_idle_cpu() might spend too much time searching for an idle CPU,
> when the system is overloaded.
> 
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instances of netperf on a system with 112 CPUs
> per LLC domain:
> 
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
> 
> netperf latency usecs:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> 
> The time spent in select_idle_cpu() is visible to netperf and might have a negative
> impact.
> 
> [Symptom analysis]
> The patch [1] from Mel Gorman has been applied to track the efficiency
> of select_idle_sibling. Copy the indicators here:
> 
> SIS Search Efficiency(se_eff%):
>         A ratio expressed as a percentage of runqueues scanned versus
>         idle CPUs found. A 100% efficiency indicates that the target,
>         prev or recent CPU of a task was idle at wakeup. The lower the
>         efficiency, the more runqueues were scanned before an idle CPU
>         was found.
> 
> SIS Domain Search Efficiency(dom_eff%):
>         Similar, except only for the slower SIS
> 	patch.
> 
> SIS Fast Success Rate(fast_rate%):
>         Percentage of SIS that used target, prev or
> 	recent CPUs.
> 
> SIS Success rate(success_rate%):
>         Percentage of scans that found an idle CPU.
> 
> The test is based on Aubrey's schedtests tool, including netperf, hackbench,
> schbench and tbench.
> 
> Test on vanilla kernel:
> schedstat_parse.py -f netperf_vanilla.log
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
> TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
> TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
> TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
> TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
> TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
> TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
> TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
> UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
> UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
> UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
> UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
> UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
> UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
> UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
> UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
> 
> schedstat_parse.py -f schbench_vanilla.log
> (each group has 28 tasks)
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
> normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
> normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
> normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
> normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
> normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
> normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
> normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
> 
> schedstat_parse.py -f hackbench_vanilla.log
> (each group has 28 tasks)
> case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
> process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
> process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
> process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
> process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
> process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
> process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
> process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
> process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
> process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
> process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
> process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
> process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
> process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
> process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
> process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
> process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
> threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
> threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
> threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
> threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
> threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
> threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
> threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
> threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
> threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
> threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
> threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
> threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
> threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
> threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
> threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
> threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
> 
> schedstat_parse.py -f tbench_vanilla.log
> case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
> loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
> loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
> loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
> loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
> loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
> 
> According to the test above, if the system becomes busy, the
> SIS Search Efficiency(se_eff%) drops significantly. Although some
> benchmarks would finally find an idle CPU(success_rate% = 100%), it is
> doubtful whether it is worth it to search the whole LLC domain.
> 
> [Proposal]
> It would be ideal to have a crystal ball to answer this question:
> How many CPUs must a wakeup path walk down, before it can find an idle
> CPU? Many potential metrics could be used to predict the number.
> One candidate is the sum of util_avg in this LLC domain. The benefit
> of choosing util_avg is that it is a metric of accumulated historic
> activity, which seems to be smoother than instantaneous metrics
> (such as rq->nr_running). Besides, choosing the sum of util_avg
> would help predict the load of the LLC domain more precisely, because
> SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> time.
> 
> In summary, the lower the util_avg is, the more select_idle_cpu()
> should scan for idle CPU, and vice versa. When the sum of util_avg
> in this LLC domain hits 85% or above, the scan stops. The reason to
> choose 85% as the threshold is that this is the imbalance_pct(117)
> when a LLC sched group is overloaded.
> 
> Introduce the quadratic function:
> 
> y = SCHED_CAPACITY_SCALE - p * x^2
> and y'= y / SCHED_CAPACITY_SCALE
> 
> x is the ratio of sum_util compared to the CPU capacity:
> x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
> y' is the ratio of CPUs to be scanned in the LLC domain,
> and the number of CPUs to scan is calculated by:
> 
> nr_scan = llc_weight * y'
> 
> Choosing quadratic function is because:
> [1] Compared to the linear function, it scans more aggressively when the
>     sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
>     and the number of CPUs to be scanned. Use heuristic scan for now.
> 
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...
> 
> For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...
> 
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112 ms would
> decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
> in reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache contention. Tim also mentioned that,
> it is allowed to be non-optimal in terms of scheduling for the
> short-term variations, but if there is a long-term trend in the load
> behavior, the scheduler can adjust for that.
> 
> When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
> calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
> Mel suggested, SIS_UTIL should be enabled by default.
> 
> This patch is based on the util_avg, which is very sensitive to the
> CPU frequency invariance. There is an issue that, when the max frequency
> has been clamp, the util_avg would decay insanely fast when
> the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
> in frequency invariance") could be used to mitigate this symptom, by adjusting
> the arch_max_freq_ratio when turbo is disabled. But this issue is still
> not thoroughly fixed, because the current code is unaware of the user-specified
> max CPU frequency.
> 
> [Test result]
> 
> netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
> 175% 200% of CPU number respectively. Hackbench and schbench were launched
> by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
> 
> The following is the benchmark result comparison between
> baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
> indicates better performance.
> 
> Each netperf test is a:
> netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
> netperf.throughput
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
> TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
> TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
> TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
> TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
> TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
> TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
> TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
> UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
> UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
> UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
> UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
> UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
> UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
> UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
> UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)
> 
> Take the 224 threads as an example, the SIS search metrics changes are
> illustrated below:
> 
>     vanilla                    patched
>    4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
>      38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
>   128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
>    5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
> 
> There is -87.9% less CPU scans after patched, which indicates lower overhead.
> Besides, with this patch applied, there is -13% less rq lock contention
> in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> .try_to_wake_up.default_wake_function.woken_wake_function.
> This might help explain the performance improvement - Because this patch allows
> the waking task to remain on the previous CPU, rather than grabbing other CPUs'
> lock.
> 
> Each hackbench test is a:
> hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
> hackbench.throughput
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
> process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
> process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
> process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
> process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
> process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
> process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
> process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
> threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
> threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
> threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
> threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
> threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
> threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
> threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
> threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)
> 
> Each tbench test is a:
> tbench -t 100 $job 127.0.0.1
> tbench.throughput
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
> loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
> loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
> loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
> loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
> loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
> loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
> loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)
> 
> Each schbench test is a:
> schbench -m $job -t 28 -r 100 -s 30000 -c 30000
> schbench.latency_90%_us
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
> normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
> normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
> normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)
> 
> *Consider the Standard Deviation, this -7.36% regression might not be valid.
> 
> Also, a OLTP workload with a commercial RDBMS has been tested, and there
> is no significant change.
> 
> There were concerns that unbalanced tasks among CPUs would cause problems.
> For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
> bound to CPU0~CPU6, while CPU7 is idle:
> 
>           CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
> util_avg  1024    1024    1024    1024    1024    1024    1024    0
> 
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
> But according to Mel, it is unlikely the CPU7 will be idle all the time
> because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
> 
> lkp(kernel test robot) has reported a regression on stress-ng.sock on a
> very busy system. According to the sched_debug statistics, it might be caused
> by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
> might introduce more context switch, especially involuntary preemption, which
> impacts a busy stress-ng. This regression has shown that, not all benchmarks
> in every scenario benefit from idle CPU scan limit, and it needs further
> investigation.
> 
> Besides, there is slight regression in hackbench's 16 groups case when the
> LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
> in an LLC domain with 16 CPUs. Because the cost to search for an idle one
> among 16 CPUs is negligible. The current patch aims to propose a generic
> solution and only considers the util_avg. Something like the below could
> be applied on top of the current patch to fulfill the requirement:
> 
> 	if (llc_weight <= 16)
> 		nr_scan = nr_scan * 32 / llc_weight;
> 
> For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
> The smaller the CPU number this LLC domain has, the larger nr_scan will be
> expanded. This needs further investigation.
> 
> There is also ongoing work[2] from Abel to filter out the busy CPUs during
> wakeup, to further speed up the idle CPU scan. And it could be a following-up
> optimization on top of this change.
> 
> v3->v4:
>    No fundamental change since v3.
>  - Enable SIS_UTIL by default. (Peter Zijlstra, Mel Gorman)
>  - Replace percentage with SCHED_CAPACITY_SCALE based calculation.
>    (Peter Zijlstra)
>  - Use imbalance_pct for threshold rather than hardcoded of 85%.
>    (Mel Gorman)
>  - Add description of the patch frequency invariance dependence in
>    changelog.(Mel Gorman)
>  - Remove the inline of update_idle_cpu_scan(). (Mel Gorman)
>  - Move the check of CPU_NEWLY_IDLE earlier, to avoid unnecessary
>    percpu cache contention. (Mel Gorman)
>  - Add comments on why CPU_NEWLY_IDLE is ignored in update_idle_cpu_scan(),
>    because updating sd_shared which is a shared cache line write and
>    CPU_NEWLY_IDLE can fire way more frequently than periodic load
>    balancing. (Mel Gorman)
>  - Rename nr_llc to llc_weight to avoid confusion. (Mel Gorman)
>  - Avoid writing the same value to sd_share->nr_idle_scan to reduce
>    cache line bounces. (Mel Gorman)
> 
> v2->v3:
>  - Use 85% as the threshold again, because the CPU frequency invariance issue
>    has been fixed and the patch is queued for 5.19.
> 
>  - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
>    (Yicong Yang)
> 
>  - Replace linear scan with quadratic function scan, to let the SIS scan
>    aggressively when the LLC is not busy. Prateek mentioned there was slight
>    regression from ycsb-mongodb in v2, which might be due to fewer CPUs
>    scanned when the utilization is around 20%. (K Prateek Nayak)
> 
>  - Add back the logic to stop the CPU scan even if has_idle_core is true.
>    It might be a waste of time to search for an idle Core if the LLC is
>    overloaded. Besides, according to the tbench result from Prateek, stop idle
>    Core scan would bring extra performance improvement. (K Prateek Nayak)
> 
>  - Provide the SIS search statistics in the commit log, based on Mel Gorman's
>    patch. (Abel Wu)
> 
>  - Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
>    directly, which can be reviewed easier.
> 
> v2->v1:
>  - As suggested by Peter, introduce an idle CPU scan strategy that is based on
>    the util_avg metric. When util_avg is very low it scans more, while when
>    util_avg hits the threshold we naturally stop scanning entirely. The threshold
>    has been decreased from 85% to 50%, because this is the threshold when the
>    CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
>    when the LLC is overloaded, to keep it consistent with the current logic of
>    select_idle_cpu().
> 
> v1:
>  - Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
>    the LLC domain has reached 85%.
> 
> Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
> Link: https://lore.kernel.org/lkml/20220505122331.42696-1-wuyun.abel@bytedance.com #2
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 87 ++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 89 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		nr_idle_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77b2048a9326..3fb857a35b16 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6336,6 +6336,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +	struct sched_domain_shared *sd_share;
>  	struct rq *this_rq = this_rq();
>  	int this = smp_processor_id();
>  	struct sched_domain *this_sd;
> @@ -6375,6 +6376,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> +	if (sched_feat(SIS_UTIL)) {
> +		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> +		if (sd_share) {
> +			/* because !--nr is the condition to stop scan */
> +			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> +			/* overloaded LLC is unlikely to have idle cpu/core */
> +			if (nr == 1)
> +				return -1;
> +		}
> +	}
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9222,6 +9234,77 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static void update_idle_cpu_scan(struct lb_env *env,
> +				 unsigned long sum_util)
> +{
> +	struct sched_domain_shared *sd_share;
> +	int llc_weight, pct;
> +	u64 x, y, tmp;
> +	/*
> +	 * Update the number of CPUs to scan in LLC domain, which could
> +	 * be used as a hint in select_idle_cpu(). The update of sd_share
> +	 * could be expensive because it is within a shared cache line.
> +	 * So the write of this hint only occurs during periodic load
> +	 * balancing, rather than CPU_NEWLY_IDLE, because the latter
> +	 * can fire way more frequently than the former.
> +	 */
> +	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
> +	if (env->sd->span_weight != llc_weight)
> +		return;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> +	if (!sd_share)
> +		return;
> +
> +	/*
> +	 * The number of CPUs to search drops as sum_util increases, when
> +	 * sum_util hits 85% or above, the scan stops.
> +	 * The reason to choose 85% as the threshold is because this is the
> +	 * imbalance_pct(117) when a LLC sched group is overloaded.
> +	 *
> +	 * let y = SCHED_CAPACITY_SCALE - p * x^2                       [1]
> +	 * and y'= y / SCHED_CAPACITY_SCALE
> +	 *
> +	 * x is the ratio of sum_util compared to the CPU capacity:
> +	 * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
> +	 * y' is the ratio of CPUs to be scanned in the LLC domain,
> +	 * and the number of CPUs to scan is calculated by:
> +	 *
> +	 * nr_scan = llc_weight * y'                                    [2]
> +	 *
> +	 * When x hits the threshold of overloaded, AKA, when
> +	 * x = 100 / pct, y drops to 0. According to [1],
> +	 * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000
> +	 *
> +	 * Scale x by SCHED_CAPACITY_SCALE:
> +	 * x' = sum_util / llc_weight;                                  [3]
> +	 *
> +	 * and finally [1] becomes:
> +	 * y = SCHED_CAPACITY_SCALE -
> +	 *     x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE)            [4]
> +	 *
> +	 */
> +	/* equation [3] */
> +	x = sum_util;
> +	do_div(x, llc_weight);
> +
> +	/* equation [4] */
> +	pct = env->sd->imbalance_pct;
> +	tmp = x * x * pct * pct;
> +	do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
> +	tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);
> +	y = SCHED_CAPACITY_SCALE - tmp;
> +
> +	/* equation [2] */
> +	y *= llc_weight;
> +	do_div(y, SCHED_CAPACITY_SCALE);
> +	if ((int)y != sd_share->nr_idle_scan)
> +		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -9234,6 +9317,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
> +	unsigned long sum_util = 0;
>  	int sg_status = 0;
>  
>  	do {
> @@ -9266,6 +9350,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
>  
> +		sum_util += sgs->group_util;
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  
> @@ -9291,6 +9376,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>  		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>  	}
> +
> +	update_idle_cpu_scan(env, sum_util);
>  }
>  
>  #define NUMA_IMBALANCE_MIN 2
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1cf435bbcd9c..3334a1b93fc6 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>   */
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_UTIL, true)
>  

confused here that shouldn't we have SCHED_FEAT(SIS_PROP, false)? With SIS_UTIL enabled, SIS_PROP will have no
effect since nr is overridden by SIS_UTIL.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-13  7:40 ` Yicong Yang
@ 2022-06-13  8:06   ` Chen Yu
  2022-06-13  8:54     ` Mel Gorman
  0 siblings, 1 reply; 12+ messages in thread
From: Chen Yu @ 2022-06-13  8:06 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, yangyicong,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Barry Song, Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li,
	Abel Wu, Daniel Bristot de Oliveira, Tim Chen, linux-kernel,
	K Prateek Nayak, Mohini Narkhede

On Mon, Jun 13, 2022 at 03:40:52PM +0800, Yicong Yang wrote:
> On 2022/6/13 0:34, Chen Yu wrote:
> >  
[cut...]
> >  #define NUMA_IMBALANCE_MIN 2
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 1cf435bbcd9c..3334a1b93fc6 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_UTIL, true)
> >  
> 
> confused here that shouldn't we have SCHED_FEAT(SIS_PROP, false)? With SIS_UTIL enabled, SIS_PROP will have no
> effect since nr is overridden by SIS_UTIL.
Yes, no matter what SIS_PROP is set, the result of SIS_UTIL would be used to decide
the scan depth. We don't change the default value of SIS_PROP here, as this patch
tends to only touch one feature at one time. And the options could be tuned by user via
sysfs manually. Besides, the target is to replace SIS_PROP with another search policy,
Peter mentioned that "And ideally we're remove SIS_PROP after a few releases if this
works out", so I assume that changing the default value of SIS_PROP does not matter
in current patch.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-13  8:06   ` Chen Yu
@ 2022-06-13  8:54     ` Mel Gorman
  2022-06-13  9:08       ` Chen Yu
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2022-06-13  8:54 UTC (permalink / raw)
  To: Chen Yu
  Cc: Yicong Yang, Peter Zijlstra, Vincent Guittot, yangyicong,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Barry Song, Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li,
	Abel Wu, Daniel Bristot de Oliveira, Tim Chen, linux-kernel,
	K Prateek Nayak, Mohini Narkhede

On Mon, Jun 13, 2022 at 04:06:36PM +0800, Chen Yu wrote:
> On Mon, Jun 13, 2022 at 03:40:52PM +0800, Yicong Yang wrote:
> > On 2022/6/13 0:34, Chen Yu wrote:
> > >  
> [cut...]
> > >  #define NUMA_IMBALANCE_MIN 2
> > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > > index 1cf435bbcd9c..3334a1b93fc6 100644
> > > --- a/kernel/sched/features.h
> > > +++ b/kernel/sched/features.h
> > > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > >   */
> > >  SCHED_FEAT(SIS_PROP, true)
> > > +SCHED_FEAT(SIS_UTIL, true)
> > >  
> > 
> > confused here that shouldn't we have SCHED_FEAT(SIS_PROP, false)? With SIS_UTIL enabled, SIS_PROP will have no
> > effect since nr is overridden by SIS_UTIL.
> Yes, no matter what SIS_PROP is set, the result of SIS_UTIL would be used to decide
> the scan depth. We don't change the default value of SIS_PROP here, as this patch
> tends to only touch one feature at one time. And the options could be tuned by user via
> sysfs manually. Besides, the target is to replace SIS_PROP with another search policy,
> Peter mentioned that "And ideally we're remove SIS_PROP after a few releases if this
> works out", so I assume that changing the default value of SIS_PROP does not matter
> in current patch.
> 

I had expected it to be disabled given that SIS_PROP does work to
calculcate nr, then discards it, and uses SIS_UTIL. If SIS_UTIL shows a
regression and reports a bug, the first step would be to disable
SIS_UTIL and enable SIS_PROP via sched_feat.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-13  8:54     ` Mel Gorman
@ 2022-06-13  9:08       ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-06-13  9:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Yicong Yang, Peter Zijlstra, Vincent Guittot, yangyicong,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Barry Song, Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li,
	Abel Wu, Daniel Bristot de Oliveira, Tim Chen, linux-kernel,
	K Prateek Nayak, Mohini Narkhede

On Mon, Jun 13, 2022 at 09:54:37AM +0100, Mel Gorman wrote:
> On Mon, Jun 13, 2022 at 04:06:36PM +0800, Chen Yu wrote:
> > On Mon, Jun 13, 2022 at 03:40:52PM +0800, Yicong Yang wrote:
> > > On 2022/6/13 0:34, Chen Yu wrote:
> > > >  
> > [cut...]
> > > >  #define NUMA_IMBALANCE_MIN 2
> > > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > > > index 1cf435bbcd9c..3334a1b93fc6 100644
> > > > --- a/kernel/sched/features.h
> > > > +++ b/kernel/sched/features.h
> > > > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > > >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > > >   */
> > > >  SCHED_FEAT(SIS_PROP, true)
> > > > +SCHED_FEAT(SIS_UTIL, true)
> > > >  
> > > 
> > > confused here that shouldn't we have SCHED_FEAT(SIS_PROP, false)? With SIS_UTIL enabled, SIS_PROP will have no
> > > effect since nr is overridden by SIS_UTIL.
> > Yes, no matter what SIS_PROP is set, the result of SIS_UTIL would be used to decide
> > the scan depth. We don't change the default value of SIS_PROP here, as this patch
> > tends to only touch one feature at one time. And the options could be tuned by user via
> > sysfs manually. Besides, the target is to replace SIS_PROP with another search policy,
> > Peter mentioned that "And ideally we're remove SIS_PROP after a few releases if this
> > works out", so I assume that changing the default value of SIS_PROP does not matter
> > in current patch.
> > 
> 
> I had expected it to be disabled given that SIS_PROP does work to
> calculcate nr,
I see, disable SIS_PROP would reduce duplicated nr calculation.
> then discards it, and uses SIS_UTIL. If SIS_UTIL shows a
> regression and reports a bug, the first step would be to disable
> SIS_UTIL and enable SIS_PROP via sched_feat.
OK, I'll change it in next version.

thanks,
Chenyu
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-12 16:34 [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
  2022-06-13  7:40 ` Yicong Yang
@ 2022-06-22  6:36 ` K Prateek Nayak
  2022-06-24  2:07   ` Chen Yu
  2022-06-24  7:29   ` Peter Zijlstra
  2022-06-28  7:16 ` [tip: sched/core] " tip-bot2 for Chen Yu
  2 siblings, 2 replies; 12+ messages in thread
From: K Prateek Nayak @ 2022-06-22  6:36 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Barry Song, Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li,
	Abel Wu, Daniel Bristot de Oliveira, Tim Chen, linux-kernel,
	Yicong Yang, Mohini Narkhede

Hello Chenyu,

I'm sorry for the delay. The testing took a while but below are
the results from testing on our system.

tl;dr

o We ran all the tests with with SIS_PROP disabled.
o tbench reaches close to saturation early with 256 clients.
o schbench shows improvements for low worker counts.
o All other benchmark results seem comparable to tip.
  We don't see any serious regressions with v4.

I've added detailed benchmark results and some analysis below.

On 6/12/2022 10:04 PM, Chen Yu wrote:
> [Problem Statement]
> select_idle_cpu() might spend too much time searching for an idle CPU,
> when the system is overloaded.
> 
> The following histogram is the time spent in select_idle_cpu(),
> when running 224 instances of netperf on a system with 112 CPUs
> per LLC domain:
> 
> @usecs:
> [0]                  533 |                                                    |
> [1]                 5495 |                                                    |
> [2, 4)             12008 |                                                    |
> [4, 8)            239252 |                                                    |
> [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> [512, 1K)        2600472 |@@@@@@@@@                                           |
> [1K, 2K)          927912 |@@@                                                 |
> [2K, 4K)          218720 |                                                    |
> [4K, 8K)           98161 |                                                    |
> [8K, 16K)          37722 |                                                    |
> [16K, 32K)          6715 |                                                    |
> [32K, 64K)           477 |                                                    |
> [64K, 128K)            7 |                                                    |
> 
> netperf latency usecs:
> =======
> case            	load    	    Lat_99th	    std%
> TCP_RR          	thread-224	      257.39	(  0.21)
> 
> The time spent in select_idle_cpu() is visible to netperf and might have a negative
> impact.
> 
> [Symptom analysis]
> The patch [1] from Mel Gorman has been applied to track the efficiency
> of select_idle_sibling. Copy the indicators here:
> 
> SIS Search Efficiency(se_eff%):
>         A ratio expressed as a percentage of runqueues scanned versus
>         idle CPUs found. A 100% efficiency indicates that the target,
>         prev or recent CPU of a task was idle at wakeup. The lower the
>         efficiency, the more runqueues were scanned before an idle CPU
>         was found.
> 
> SIS Domain Search Efficiency(dom_eff%):
>         Similar, except only for the slower SIS
> 	patch.
> 
> SIS Fast Success Rate(fast_rate%):
>         Percentage of SIS that used target, prev or
> 	recent CPUs.
> 
> SIS Success rate(success_rate%):
>         Percentage of scans that found an idle CPU.
> 
> The test is based on Aubrey's schedtests tool, including netperf, hackbench,
> schbench and tbench.
> 
> Test on vanilla kernel:
> schedstat_parse.py -f netperf_vanilla.log
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
> TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
> TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
> TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
> TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
> TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
> TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
> TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
> UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
> UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
> UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
> UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
> UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
> UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
> UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
> UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
> 
> schedstat_parse.py -f schbench_vanilla.log
> (each group has 28 tasks)
> case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
> normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
> normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
> normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
> normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
> normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
> normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
> normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
> 
> schedstat_parse.py -f hackbench_vanilla.log
> (each group has 28 tasks)
> case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
> process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
> process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
> process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
> process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
> process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
> process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
> process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
> process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
> process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
> process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
> process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
> process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
> process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
> process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
> process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
> process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
> threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
> threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
> threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
> threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
> threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
> threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
> threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
> threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
> threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
> threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
> threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
> threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
> threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
> threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
> threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
> threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
> 
> schedstat_parse.py -f tbench_vanilla.log
> case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
> loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
> loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
> loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
> loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
> loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
> loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
> 
> According to the test above, if the system becomes busy, the
> SIS Search Efficiency(se_eff%) drops significantly. Although some
> benchmarks would finally find an idle CPU(success_rate% = 100%), it is
> doubtful whether it is worth it to search the whole LLC domain.
> 
> [Proposal]
> It would be ideal to have a crystal ball to answer this question:
> How many CPUs must a wakeup path walk down, before it can find an idle
> CPU? Many potential metrics could be used to predict the number.
> One candidate is the sum of util_avg in this LLC domain. The benefit
> of choosing util_avg is that it is a metric of accumulated historic
> activity, which seems to be smoother than instantaneous metrics
> (such as rq->nr_running). Besides, choosing the sum of util_avg
> would help predict the load of the LLC domain more precisely, because
> SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> time.
> 
> In summary, the lower the util_avg is, the more select_idle_cpu()
> should scan for idle CPU, and vice versa. When the sum of util_avg
> in this LLC domain hits 85% or above, the scan stops. The reason to
> choose 85% as the threshold is that this is the imbalance_pct(117)
> when a LLC sched group is overloaded.
> 
> Introduce the quadratic function:
> 
> y = SCHED_CAPACITY_SCALE - p * x^2
> and y'= y / SCHED_CAPACITY_SCALE
> 
> x is the ratio of sum_util compared to the CPU capacity:
> x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
> y' is the ratio of CPUs to be scanned in the LLC domain,
> and the number of CPUs to scan is calculated by:
> 
> nr_scan = llc_weight * y'
> 
> Choosing quadratic function is because:
> [1] Compared to the linear function, it scans more aggressively when the
>     sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
>     and the number of CPUs to be scanned. Use heuristic scan for now.
> 
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...
> 
> For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
> sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...
> 
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112 ms would
> decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
> in reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache contention. Tim also mentioned that,
> it is allowed to be non-optimal in terms of scheduling for the
> short-term variations, but if there is a long-term trend in the load
> behavior, the scheduler can adjust for that.
> 
> When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
> calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
> Mel suggested, SIS_UTIL should be enabled by default.
> 
> This patch is based on the util_avg, which is very sensitive to the
> CPU frequency invariance. There is an issue that, when the max frequency
> has been clamp, the util_avg would decay insanely fast when
> the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
> in frequency invariance") could be used to mitigate this symptom, by adjusting
> the arch_max_freq_ratio when turbo is disabled. But this issue is still
> not thoroughly fixed, because the current code is unaware of the user-specified
> max CPU frequency.
> 
> [Test result]
> 
> netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
> 175% 200% of CPU number respectively. Hackbench and schbench were launched
> by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
> 
> The following is the benchmark result comparison between
> baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
> indicates better performance.
> 
> Each netperf test is a:
> netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
> netperf.throughput
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
> TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
> TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
> TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
> TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
> TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
> TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
> TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
> UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
> UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
> UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
> UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
> UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
> UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
> UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
> UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)
> 
> Take the 224 threads as an example, the SIS search metrics changes are
> illustrated below:
> 
>     vanilla                    patched
>    4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
>      38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
>   128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
>    5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
> 
> There is -87.9% less CPU scans after patched, which indicates lower overhead.
> Besides, with this patch applied, there is -13% less rq lock contention
> in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> .try_to_wake_up.default_wake_function.woken_wake_function.
> This might help explain the performance improvement - Because this patch allows
> the waking task to remain on the previous CPU, rather than grabbing other CPUs'
> lock.
> 
> Each hackbench test is a:
> hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
> hackbench.throughput
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
> process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
> process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
> process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
> process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
> process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
> process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
> process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
> threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
> threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
> threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
> threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
> threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
> threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
> threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
> threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)
> 
> Each tbench test is a:
> tbench -t 100 $job 127.0.0.1
> tbench.throughput
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
> loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
> loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
> loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
> loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
> loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
> loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
> loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)
> 
> Each schbench test is a:
> schbench -m $job -t 28 -r 100 -s 30000 -c 30000
> schbench.latency_90%_us
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
> normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
> normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
> normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)


Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
various NPS configuration:

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:         5.19-rc2 tip sched/core
- SIS_UTIL:    5.19-rc2 tip sched/core + this patch

When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test:                   tip                     SIS_UTIL
 1-groups:         4.64 (0.00 pct)         4.77 (-2.80 pct)
 2-groups:         5.22 (0.00 pct)         5.17 (0.95 pct)
 4-groups:         5.43 (0.00 pct)         5.29 (2.57 pct)
 8-groups:         5.85 (0.00 pct)         5.75 (1.70 pct)
16-groups:         7.54 (0.00 pct)         7.62 (-1.06 pct)

NPS2

Test:                   tip                     SIS_UTIL
 1-groups:         4.61 (0.00 pct)         4.79 (-3.90 pct)
 2-groups:         5.00 (0.00 pct)         4.94 (1.20 pct)
 4-groups:         5.14 (0.00 pct)         5.00 (2.72 pct)
 8-groups:         5.66 (0.00 pct)         5.49 (3.00 pct)
16-groups:         7.54 (0.00 pct)         7.33 (2.78 pct)

NPS4

Test:                   tip                     SIS_UTIL
 1-groups:         4.64 (0.00 pct)         4.69 (-1.07 pct)
 2-groups:         5.03 (0.00 pct)         4.98 (0.99 pct)
 4-groups:         5.66 (0.00 pct)         5.88 (-3.88 pct)
 8-groups:         6.16 (0.00 pct)         6.14 (0.32 pct)
16-groups:         7.37 (0.00 pct)         9.60 (-30.25 pct)    * (System overloaded)
16-groups:         7.38 (0.00 pct)         7.99 (-8.26 pct)     [Verification Run]

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers:     tip                     SIS_UTIL
  1:      23.50 (0.00 pct)        20.00 (14.89 pct)
  2:      33.00 (0.00 pct)        29.50 (10.60 pct)
  4:      43.50 (0.00 pct)        40.00 (8.04 pct)
  8:      52.50 (0.00 pct)        50.00 (4.76 pct)
 16:      70.00 (0.00 pct)        72.50 (-3.57 pct)
 32:     103.50 (0.00 pct)       100.50 (2.89 pct)
 64:     175.50 (0.00 pct)       183.00 (-4.27 pct)
128:     362.00 (0.00 pct)       368.50 (-1.79 pct)
256:     867.00 (0.00 pct)       867.00 (0.00 pct)
512:     60224.00 (0.00 pct)     58368.00 (3.08 pct)

NPS2

#workers:     tip                     SIS_UTIL
  1:      19.50 (0.00 pct)        17.00 (12.82 pct)
  2:      31.50 (0.00 pct)        21.50 (31.74 pct)
  4:      39.00 (0.00 pct)        31.50 (19.23 pct)
  8:      54.50 (0.00 pct)        46.00 (15.59 pct)
 16:      73.50 (0.00 pct)        78.00 (-6.12 pct)     *
 16:      74.00 (0.00 pct)        76.00 (-2.70 pct)     [Verification Run]
 32:     105.00 (0.00 pct)       100.00 (4.76 pct)
 64:     181.50 (0.00 pct)       176.00 (3.03 pct)
128:     368.50 (0.00 pct)       368.00 (0.13 pct)
256:     885.00 (0.00 pct)       875.00 (1.12 pct)
512:     58752.00 (0.00 pct)     59520.00 (-1.30 pct)

NPS4

#workers:     tip                     SIS_UTIL
  1:      19.00 (0.00 pct)        15.50 (18.42 pct)
  2:      32.00 (0.00 pct)        21.50 (32.81 pct)
  4:      36.50 (0.00 pct)        29.00 (20.54 pct)
  8:      47.50 (0.00 pct)        51.00 (-7.36 pct)     *
  8:      49.50 (0.00 pct)        44.50 (10.10 pct)     [Verification Run]
 16:      74.50 (0.00 pct)        78.00 (-4.69 pct)     *
 16:      81.50 (0.00 pct)        73.00 (10.42 pct)     [Verification Run]
 32:      98.50 (0.00 pct)       101.50 (-3.04 pct)
 64:     182.00 (0.00 pct)       185.50 (-1.92 pct)
128:     369.50 (0.00 pct)       384.00 (-3.92 pct)
256:     920.00 (0.00 pct)       901.00 (2.06 pct)
512:     60224.00 (0.00 pct)     59136.00 (1.80 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients:      tip                     SIS_UTIL
    1    444.41 (0.00 pct)       445.90 (0.33 pct)
    2    879.23 (0.00 pct)       871.32 (-0.89 pct)
    4    1648.83 (0.00 pct)      1648.23 (-0.03 pct)
    8    3263.81 (0.00 pct)      3251.66 (-0.37 pct)
   16    6011.19 (0.00 pct)      5997.98 (-0.21 pct)
   32    12058.31 (0.00 pct)     11625.00 (-3.59 pct)
   64    21258.21 (0.00 pct)     20847.13 (-1.93 pct)
  128    30795.27 (0.00 pct)     29286.06 (-4.90 pct)   *
  128    29848.21 (0.00 pct)     31613.76 (5.91 pct)    [Verification run]
  256    25138.43 (0.00 pct)     51160.59 (103.51 pct)
  512    51287.93 (0.00 pct)     51829.94 (1.05 pct)
 1024    53176.97 (0.00 pct)     53211.32 (0.06 pct)

NPS2

Clients:       tip                     SIS_UTIL
    1    445.45 (0.00 pct)       447.64 (0.49 pct)
    2    869.24 (0.00 pct)       868.63 (-0.07 pct)
    4    1644.28 (0.00 pct)      1632.35 (-0.72 pct)
    8    3120.83 (0.00 pct)      3157.00 (1.15 pct)
   16    5972.29 (0.00 pct)      5679.18 (-4.90 pct)    *
   16    5668.91 (0.00 pct)      5701.57 (0.57 pct)     [Verification run]
   32    11776.38 (0.00 pct)     11253.96 (-4.43 pct)   *
   32    11668.66 (0.00 pct)     11272.02 (-3.39 pct)   [Verification run]
   64    20933.15 (0.00 pct)     20717.28 (-1.03 pct)
  128    32195.00 (0.00 pct)     30400.11 (-5.57 pct)   *
  128    30248.19 (0.00 pct)     30781.22 (1.76 pct)    [Verification run]
  256    24641.52 (0.00 pct)     44940.70 (82.37 pct)
  512    50806.96 (0.00 pct)     51937.08 (2.22 pct)
 1024    51993.96 (0.00 pct)     52154.38 (0.30 pct)

NPS4

Clients:      tip                     SIS_UTIL
    1    442.10 (0.00 pct)       449.20 (1.60 pct)
    2    870.94 (0.00 pct)       875.15 (0.48 pct)
    4    1615.30 (0.00 pct)      1636.92 (1.33 pct)
    8    3195.95 (0.00 pct)      3222.69 (0.83 pct)
   16    5937.41 (0.00 pct)      5705.23 (-3.91 pct)
   32    11800.41 (0.00 pct)     11337.91 (-3.91 pct)
   64    20844.71 (0.00 pct)     20123.99 (-3.45 pct)
  128    31003.62 (0.00 pct)     30219.39 (-2.52 pct)
  256    27476.37 (0.00 pct)     49333.89 (79.55 pct)
  512    52276.72 (0.00 pct)     50807.17 (-2.81 pct)
 1024    51372.10 (0.00 pct)     51566.42 (0.37 pct)

Note: tbench resuts for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

Test:            tip                    SIS_UTIL
 Copy:   152431.37 (0.00 pct)    165782.13 (8.75 pct)
Scale:   187983.72 (0.00 pct)    180133.46 (-4.17 pct)
  Add:   211713.09 (0.00 pct)    205588.71 (-2.89 pct)
Triad:   207302.09 (0.00 pct)    201103.81 (-2.98 pct)

NPS2

Test:           tip                     SIS_UTIL
 Copy:   134099.98 (0.00 pct)    146487.66 (9.23 pct)
Scale:   168404.01 (0.00 pct)    180551.46 (7.21 pct)
  Add:   184326.77 (0.00 pct)    197117.20 (6.93 pct)
Triad:   182707.29 (0.00 pct)    195282.60 (6.88 pct)

NPS4

Test:            tip                    SIS_UTIL
 Copy:   123058.63 (0.00 pct)    129624.17 (5.33 pct)
Scale:   178696.74 (0.00 pct)    182611.49 (2.19 pct)
  Add:   169836.95 (0.00 pct)    179869.80 (5.90 pct)
Triad:   170036.21 (0.00 pct)    177249.46 (4.24 pct)

- 100 runs

NPS1

Test:            tip                     SIS_UTIL
 Copy:   215860.05 (0.00 pct)    205953.63 (-4.58 pct)
Scale:   207886.55 (0.00 pct)    203384.29 (-2.16 pct)
  Add:   253513.05 (0.00 pct)    243351.95 (-4.00 pct)
Triad:   239471.82 (0.00 pct)    232221.90 (-3.02 pct)

NPS2

Test:            tip                     SIS_UTIL
 Copy:   223991.94 (0.00 pct)    217920.18 (-2.71 pct)
Scale:   205631.20 (0.00 pct)    213060.40 (3.61 pct)
  Add:   252292.90 (0.00 pct)    266848.26 (5.76 pct)
Triad:   239838.71 (0.00 pct)    252369.51 (5.22 pct)

NPS4

Test:            tip                     SIS_UTIL
 Copy:   225480.09 (0.00 pct)    218902.02 (-2.91 pct)
Scale:   218218.59 (0.00 pct)    210839.93 (-3.38 pct)
  Add:   273879.95 (0.00 pct)    261761.62 (-4.42 pct)
Triad:   255765.98 (0.00 pct)    246971.11 (-3.43 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip:      301330.33 (var: 3.28)
SIS_UTIL:       295360.33 (var: 0.76)    (-1.98%)

NPS2

sched-tip:      287786.00 (var: 4.24)
SIS_UTIL:       288888.33 (var: 1.58)    (+0.38%)

NPS4

sched-tip:      293671.00 (var: 0.89)
SIS_UTIL:       295682.33 (var: 0.92)    (+0.68%)


~~~~~
Notes
~~~~~

o tbench reaches close to saturation at 256 clients which was
  previously an unreliable data point and usually showed regression
  compared to the result with 128 clients.
o schbench improves for low worker count. It is not strictly because
  of SIS_UTIL.
o Most serious regression seen seem to reverse with a rerun suggesting
  some run to run variance with few data points on tip as well as with
  this patch.
o Any small regression or improvements seen are within the margin of
  run to run variance seen on the tip as well. The results seem to be
  more stable with SIS_UTIL compared to SIS_PROP

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SIS Efficiency Stats for Hackbench
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Following are the system wide SIS Efficiency stats for SIS_PROP and SIS_UTIL
when running hackbench with Mel's patch applied as is on both kernels:
(https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/)

Metrics and the labels assigned for better readability

SIS Search		: Number of calls to select_idle_sibling
SIS Domain Search	: Number of times the domain was searched (fast path failed)
SIS Scanned		: Number of runqueues scanned
SIS Failures		: Number of SIS calls that failed to find an idle CPU

SIS Logic:								   SIS_PROP	 SIS_UTIL	Diff (SIS_UTIL wrt SIS_PROP)

o 1-group

Benchmark Results (sec)                                    :                4.823         4.841		(-0.37 pct)
Number of calls to select_idle_sibling                     :              3154397       3166395		(0.38 pct)
Number of times the domain was searched (fast path failed) :               931530       1349865		(44.91 pct)
Number of runqueues scanned                                :              7846894      11026784		(40.52 pct)
Number of SIS calls that failed to find an idle CPU        :                76463        118968		(55.59 pct)
Avg. No. of runqueues scanned per domain search            :                 8.42          8.16		(-3.09 pct)

o 2-groups

Benchmark Results (sec)                                    :                4.705         4.912		(-4.40 pct)
Number of calls to select_idle_sibling                     :              3521182       4879821		(38.58 pct)
Number of times the domain was searched (fast path failed) :              2049034       2979202		(45.40 pct)
Number of runqueues scanned                                :             16717385      24743444		(48.01 pct)
Number of SIS calls that failed to find an idle CPU        :               366643        241789		(-34.05 pct)
Avg. No. of runqueues scanned per domain search            :                 8.15          8.30		(1.84 pct)

o 4-groups

Benchmark Results (sec)                                    :                5.503         5.268		(4.27 pct)
Number of calls to select_idle_sibling                     :             13293368      11006088		(-17.21 pct)
Number of times the domain was searched (fast path failed) :              5487436       4604635		(-16.09 pct)
Number of runqueues scanned                                :             53028113      43238439		(-18.46 pct)
Number of SIS calls that failed to find an idle CPU        :              1171727       1040776		(-11.18 pct)
Avg. No. of runqueues scanned per domain search            :                 9.66          9.39		(-2.80 pct)

o 8-groups

Benchmark Results (sec)                                    :                5.794         5.752		(0.72 pct)
Number of calls to select_idle_sibling                     :             26367244      24734896		(-6.19 pct)
Number of times the domain was searched (fast path failed) :             11137288       9528659		(-14.44 pct)
Number of runqueues scanned                                :            106216549      91895107		(-13.48 pct)
Number of SIS calls that failed to find an idle CPU        :              3154674       3012751		(-4.50 pct)
Avg. No. of runqueues scanned per domain search            :                 9.53          9.64		(1.15 pct)

o 16-groups

Benchmark Results (sec)                                    :                7.405         7.363		(0.57 pct)
Number of calls to select_idle_sibling                     :             57323447      49331195		(-13.94 pct)
Number of times the domain was searched (fast path failed) :             27853188      23892530		(-14.22 pct)
Number of runqueues scanned                                :            248062785     180150761		(-27.38 pct)
Number of SIS calls that failed to find an idle CPU        :             12182277      14125960		(15.96 pct)
Avg. No. of runqueues scanned per domain search            :                 8.90          7.54		(-15.28 pct)

For 16 groups, when comparing SIS_UTIL to SIS_PROP, the
"Avg. No. of runqueues scanned per domain search" goes down and we
know there is high chance we won't find an idle CPU but it is
still relatively high for lower number of groups where the
opportunity to find idle cpus is more.

> 
> [..snip..]
>  
>  #define NUMA_IMBALANCE_MIN 2
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1cf435bbcd9c..3334a1b93fc6 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>   */
>  SCHED_FEAT(SIS_PROP, true)

SIS_PROP was disabled in our testing as follows:

--
-SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_PROP, false)
--

> +SCHED_FEAT(SIS_UTIL, true)
>  
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
>
> [..snip..]
>

With v4 on the current tip, I don't see any need for
a special case for systems with smaller LLCs with
SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
seems to be better with SIS_UTIL for hackbench.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-22  6:36 ` K Prateek Nayak
@ 2022-06-24  2:07   ` Chen Yu
  2022-06-24  4:04     ` K Prateek Nayak
  2022-06-24  7:29   ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Chen Yu @ 2022-06-24  2:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Ingo Molnar,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Barry Song,
	Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li, Abel Wu,
	Daniel Bristot de Oliveira, Tim Chen, linux-kernel, Yicong Yang,
	Mohini Narkhede

Hi Prateek, 
On Wed, Jun 22, 2022 at 12:06:55PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> I'm sorry for the delay. The testing took a while but below are
> the results from testing on our system.
> 
> tl;dr
> 
> o We ran all the tests with with SIS_PROP disabled.
> o tbench reaches close to saturation early with 256 clients.
> o schbench shows improvements for low worker counts.
> o All other benchmark results seem comparable to tip.
>   We don't see any serious regressions with v4.
> 
> I've added detailed benchmark results and some analysis below.
> 
Thanks very much for the test.
> On 6/12/2022 10:04 PM, Chen Yu wrote:
> > [Problem Statement]
> > select_idle_cpu() might spend too much time searching for an idle CPU,
> > when the system is overloaded.
> > 
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instances of netperf on a system with 112 CPUs
> > per LLC domain:
> > 
> > @usecs:
> > [0]                  533 |                                                    |
> > [1]                 5495 |                                                    |
> > [2, 4)             12008 |                                                    |
> > [4, 8)            239252 |                                                    |
> > [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> > [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> > [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> > [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> > [512, 1K)        2600472 |@@@@@@@@@                                           |
> > [1K, 2K)          927912 |@@@                                                 |
> > [2K, 4K)          218720 |                                                    |
> > [4K, 8K)           98161 |                                                    |
> > [8K, 16K)          37722 |                                                    |
> > [16K, 32K)          6715 |                                                    |
> > [32K, 64K)           477 |                                                    |
> > [64K, 128K)            7 |                                                    |
> > 
> > netperf latency usecs:
> > =======
> > case            	load    	    Lat_99th	    std%
> > TCP_RR          	thread-224	      257.39	(  0.21)
> > 
> > The time spent in select_idle_cpu() is visible to netperf and might have a negative
> > impact.
> > 
> > [Symptom analysis]
> > The patch [1] from Mel Gorman has been applied to track the efficiency
> > of select_idle_sibling. Copy the indicators here:
> > 
> > SIS Search Efficiency(se_eff%):
> >         A ratio expressed as a percentage of runqueues scanned versus
> >         idle CPUs found. A 100% efficiency indicates that the target,
> >         prev or recent CPU of a task was idle at wakeup. The lower the
> >         efficiency, the more runqueues were scanned before an idle CPU
> >         was found.
> > 
> > SIS Domain Search Efficiency(dom_eff%):
> >         Similar, except only for the slower SIS
> > 	patch.
> > 
> > SIS Fast Success Rate(fast_rate%):
> >         Percentage of SIS that used target, prev or
> > 	recent CPUs.
> > 
> > SIS Success rate(success_rate%):
> >         Percentage of scans that found an idle CPU.
> > 
> > The test is based on Aubrey's schedtests tool, including netperf, hackbench,
> > schbench and tbench.
> > 
> > Test on vanilla kernel:
> > schedstat_parse.py -f netperf_vanilla.log
> > case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> > TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
> > TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
> > TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
> > TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
> > TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
> > TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
> > TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
> > TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
> > UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
> > UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
> > UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
> > UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
> > UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
> > UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
> > UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
> > UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
> > 
> > schedstat_parse.py -f schbench_vanilla.log
> > (each group has 28 tasks)
> > case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
> > normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
> > normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
> > normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
> > normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
> > normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
> > normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
> > normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
> > normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
> > 
> > schedstat_parse.py -f hackbench_vanilla.log
> > (each group has 28 tasks)
> > case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
> > process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
> > process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
> > process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
> > process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
> > process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
> > process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
> > process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
> > process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
> > process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
> > process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
> > process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
> > process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
> > process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
> > process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
> > process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
> > process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
> > threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
> > threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
> > threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
> > threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
> > threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
> > threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
> > threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
> > threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
> > threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
> > threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
> > threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
> > threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
> > threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
> > threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
> > threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
> > threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
> > 
> > schedstat_parse.py -f tbench_vanilla.log
> > case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
> > loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
> > loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
> > loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
> > loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
> > loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
> > loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
> > loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
> > loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
> > 
> > According to the test above, if the system becomes busy, the
> > SIS Search Efficiency(se_eff%) drops significantly. Although some
> > benchmarks would finally find an idle CPU(success_rate% = 100%), it is
> > doubtful whether it is worth it to search the whole LLC domain.
> > 
> > [Proposal]
> > It would be ideal to have a crystal ball to answer this question:
> > How many CPUs must a wakeup path walk down, before it can find an idle
> > CPU? Many potential metrics could be used to predict the number.
> > One candidate is the sum of util_avg in this LLC domain. The benefit
> > of choosing util_avg is that it is a metric of accumulated historic
> > activity, which seems to be smoother than instantaneous metrics
> > (such as rq->nr_running). Besides, choosing the sum of util_avg
> > would help predict the load of the LLC domain more precisely, because
> > SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> > time.
> > 
> > In summary, the lower the util_avg is, the more select_idle_cpu()
> > should scan for idle CPU, and vice versa. When the sum of util_avg
> > in this LLC domain hits 85% or above, the scan stops. The reason to
> > choose 85% as the threshold is that this is the imbalance_pct(117)
> > when a LLC sched group is overloaded.
> > 
> > Introduce the quadratic function:
> > 
> > y = SCHED_CAPACITY_SCALE - p * x^2
> > and y'= y / SCHED_CAPACITY_SCALE
> > 
> > x is the ratio of sum_util compared to the CPU capacity:
> > x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
> > y' is the ratio of CPUs to be scanned in the LLC domain,
> > and the number of CPUs to scan is calculated by:
> > 
> > nr_scan = llc_weight * y'
> > 
> > Choosing quadratic function is because:
> > [1] Compared to the linear function, it scans more aggressively when the
> >     sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> >     and the number of CPUs to be scanned. Use heuristic scan for now.
> > 
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> > scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...
> > 
> > For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
> > sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
> > scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...
> > 
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112 ms would
> > decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
> > in reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache contention. Tim also mentioned that,
> > it is allowed to be non-optimal in terms of scheduling for the
> > short-term variations, but if there is a long-term trend in the load
> > behavior, the scheduler can adjust for that.
> > 
> > When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
> > calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
> > Mel suggested, SIS_UTIL should be enabled by default.
> > 
> > This patch is based on the util_avg, which is very sensitive to the
> > CPU frequency invariance. There is an issue that, when the max frequency
> > has been clamp, the util_avg would decay insanely fast when
> > the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
> > in frequency invariance") could be used to mitigate this symptom, by adjusting
> > the arch_max_freq_ratio when turbo is disabled. But this issue is still
> > not thoroughly fixed, because the current code is unaware of the user-specified
> > max CPU frequency.
> > 
> > [Test result]
> > 
> > netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
> > 175% 200% of CPU number respectively. Hackbench and schbench were launched
> > by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
> > 
> > The following is the benchmark result comparison between
> > baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
> > indicates better performance.
> > 
> > Each netperf test is a:
> > netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
> > netperf.throughput
> > =======
> > case            	load    	baseline(std%)	compare%( std%)
> > TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
> > TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
> > TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
> > TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
> > TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
> > TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
> > TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
> > TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
> > UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
> > UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
> > UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
> > UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
> > UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
> > UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
> > UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
> > UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)
> > 
> > Take the 224 threads as an example, the SIS search metrics changes are
> > illustrated below:
> > 
> >     vanilla                    patched
> >    4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
> >      38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
> >   128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
> >    5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
> > 
> > There is -87.9% less CPU scans after patched, which indicates lower overhead.
> > Besides, with this patch applied, there is -13% less rq lock contention
> > in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> > .try_to_wake_up.default_wake_function.woken_wake_function.
> > This might help explain the performance improvement - Because this patch allows
> > the waking task to remain on the previous CPU, rather than grabbing other CPUs'
> > lock.
> > 
> > Each hackbench test is a:
> > hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
> > hackbench.throughput
> > =========
> > case            	load    	baseline(std%)	compare%( std%)
> > process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
> > process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
> > process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
> > process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
> > process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
> > process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
> > process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
> > process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
> > threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
> > threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
> > threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
> > threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
> > threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
> > threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
> > threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
> > threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)
> > 
> > Each tbench test is a:
> > tbench -t 100 $job 127.0.0.1
> > tbench.throughput
> > ======
> > case            	load    	baseline(std%)	compare%( std%)
> > loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
> > loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
> > loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
> > loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
> > loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
> > loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
> > loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
> > loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)
> > 
> > Each schbench test is a:
> > schbench -m $job -t 28 -r 100 -s 30000 -c 30000
> > schbench.latency_90%_us
> > ========
> > case            	load    	baseline(std%)	compare%( std%)
> > normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
> > normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
> > normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
> > normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)
> 
> 
> Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
> various NPS configuration:
> 
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:         5.19-rc2 tip sched/core
> - SIS_UTIL:    5.19-rc2 tip sched/core + this patch
> 
> When we started testing, the tip was at:
> commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
> 
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
> 
> NPS1
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.64 (0.00 pct)         4.77 (-2.80 pct)
>  2-groups:         5.22 (0.00 pct)         5.17 (0.95 pct)
>  4-groups:         5.43 (0.00 pct)         5.29 (2.57 pct)
>  8-groups:         5.85 (0.00 pct)         5.75 (1.70 pct)
> 16-groups:         7.54 (0.00 pct)         7.62 (-1.06 pct)
> 
> NPS2
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.61 (0.00 pct)         4.79 (-3.90 pct)
>  2-groups:         5.00 (0.00 pct)         4.94 (1.20 pct)
>  4-groups:         5.14 (0.00 pct)         5.00 (2.72 pct)
>  8-groups:         5.66 (0.00 pct)         5.49 (3.00 pct)
> 16-groups:         7.54 (0.00 pct)         7.33 (2.78 pct)
> 
> NPS4
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.64 (0.00 pct)         4.69 (-1.07 pct)
>  2-groups:         5.03 (0.00 pct)         4.98 (0.99 pct)
>  4-groups:         5.66 (0.00 pct)         5.88 (-3.88 pct)
>  8-groups:         6.16 (0.00 pct)         6.14 (0.32 pct)
> 16-groups:         7.37 (0.00 pct)         9.60 (-30.25 pct)    * (System overloaded)
> 16-groups:         7.38 (0.00 pct)         7.99 (-8.26 pct)     [Verification Run]
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS1
> 
> #workers:     tip                     SIS_UTIL
>   1:      23.50 (0.00 pct)        20.00 (14.89 pct)
>   2:      33.00 (0.00 pct)        29.50 (10.60 pct)
>   4:      43.50 (0.00 pct)        40.00 (8.04 pct)
>   8:      52.50 (0.00 pct)        50.00 (4.76 pct)
>  16:      70.00 (0.00 pct)        72.50 (-3.57 pct)
>  32:     103.50 (0.00 pct)       100.50 (2.89 pct)
>  64:     175.50 (0.00 pct)       183.00 (-4.27 pct)
> 128:     362.00 (0.00 pct)       368.50 (-1.79 pct)
> 256:     867.00 (0.00 pct)       867.00 (0.00 pct)
> 512:     60224.00 (0.00 pct)     58368.00 (3.08 pct)
> 
> NPS2
> 
> #workers:     tip                     SIS_UTIL
>   1:      19.50 (0.00 pct)        17.00 (12.82 pct)
>   2:      31.50 (0.00 pct)        21.50 (31.74 pct)
>   4:      39.00 (0.00 pct)        31.50 (19.23 pct)
>   8:      54.50 (0.00 pct)        46.00 (15.59 pct)
>  16:      73.50 (0.00 pct)        78.00 (-6.12 pct)     *
>  16:      74.00 (0.00 pct)        76.00 (-2.70 pct)     [Verification Run]
>  32:     105.00 (0.00 pct)       100.00 (4.76 pct)
>  64:     181.50 (0.00 pct)       176.00 (3.03 pct)
> 128:     368.50 (0.00 pct)       368.00 (0.13 pct)
> 256:     885.00 (0.00 pct)       875.00 (1.12 pct)
> 512:     58752.00 (0.00 pct)     59520.00 (-1.30 pct)
> 
> NPS4
> 
> #workers:     tip                     SIS_UTIL
>   1:      19.00 (0.00 pct)        15.50 (18.42 pct)
>   2:      32.00 (0.00 pct)        21.50 (32.81 pct)
>   4:      36.50 (0.00 pct)        29.00 (20.54 pct)
>   8:      47.50 (0.00 pct)        51.00 (-7.36 pct)     *
>   8:      49.50 (0.00 pct)        44.50 (10.10 pct)     [Verification Run]
>  16:      74.50 (0.00 pct)        78.00 (-4.69 pct)     *
>  16:      81.50 (0.00 pct)        73.00 (10.42 pct)     [Verification Run]
>  32:      98.50 (0.00 pct)       101.50 (-3.04 pct)
>  64:     182.00 (0.00 pct)       185.50 (-1.92 pct)
> 128:     369.50 (0.00 pct)       384.00 (-3.92 pct)
> 256:     920.00 (0.00 pct)       901.00 (2.06 pct)
> 512:     60224.00 (0.00 pct)     59136.00 (1.80 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS1
> 
> Clients:      tip                     SIS_UTIL
>     1    444.41 (0.00 pct)       445.90 (0.33 pct)
>     2    879.23 (0.00 pct)       871.32 (-0.89 pct)
>     4    1648.83 (0.00 pct)      1648.23 (-0.03 pct)
>     8    3263.81 (0.00 pct)      3251.66 (-0.37 pct)
>    16    6011.19 (0.00 pct)      5997.98 (-0.21 pct)
>    32    12058.31 (0.00 pct)     11625.00 (-3.59 pct)
>    64    21258.21 (0.00 pct)     20847.13 (-1.93 pct)
>   128    30795.27 (0.00 pct)     29286.06 (-4.90 pct)   *
>   128    29848.21 (0.00 pct)     31613.76 (5.91 pct)    [Verification run]
>   256    25138.43 (0.00 pct)     51160.59 (103.51 pct)
>   512    51287.93 (0.00 pct)     51829.94 (1.05 pct)
>  1024    53176.97 (0.00 pct)     53211.32 (0.06 pct)
> 
> NPS2
> 
> Clients:       tip                     SIS_UTIL
>     1    445.45 (0.00 pct)       447.64 (0.49 pct)
>     2    869.24 (0.00 pct)       868.63 (-0.07 pct)
>     4    1644.28 (0.00 pct)      1632.35 (-0.72 pct)
>     8    3120.83 (0.00 pct)      3157.00 (1.15 pct)
>    16    5972.29 (0.00 pct)      5679.18 (-4.90 pct)    *
>    16    5668.91 (0.00 pct)      5701.57 (0.57 pct)     [Verification run]
>    32    11776.38 (0.00 pct)     11253.96 (-4.43 pct)   *
>    32    11668.66 (0.00 pct)     11272.02 (-3.39 pct)   [Verification run]
>    64    20933.15 (0.00 pct)     20717.28 (-1.03 pct)
>   128    32195.00 (0.00 pct)     30400.11 (-5.57 pct)   *
>   128    30248.19 (0.00 pct)     30781.22 (1.76 pct)    [Verification run]
>   256    24641.52 (0.00 pct)     44940.70 (82.37 pct)
>   512    50806.96 (0.00 pct)     51937.08 (2.22 pct)
>  1024    51993.96 (0.00 pct)     52154.38 (0.30 pct)
> 
> NPS4
> 
> Clients:      tip                     SIS_UTIL
>     1    442.10 (0.00 pct)       449.20 (1.60 pct)
>     2    870.94 (0.00 pct)       875.15 (0.48 pct)
>     4    1615.30 (0.00 pct)      1636.92 (1.33 pct)
>     8    3195.95 (0.00 pct)      3222.69 (0.83 pct)
>    16    5937.41 (0.00 pct)      5705.23 (-3.91 pct)
>    32    11800.41 (0.00 pct)     11337.91 (-3.91 pct)
>    64    20844.71 (0.00 pct)     20123.99 (-3.45 pct)
>   128    31003.62 (0.00 pct)     30219.39 (-2.52 pct)
>   256    27476.37 (0.00 pct)     49333.89 (79.55 pct)
>   512    52276.72 (0.00 pct)     50807.17 (-2.81 pct)
>  1024    51372.10 (0.00 pct)     51566.42 (0.37 pct)
> 
> Note: tbench resuts for 256 workers are known to have
> run to run variation on the test machine. Any regression
> seen for the data point can be safely ignored.
> 
> ~~~~~~
> Stream
> ~~~~~~
> 
> - 10 runs
> 
> NPS1
> 
> Test:            tip                    SIS_UTIL
>  Copy:   152431.37 (0.00 pct)    165782.13 (8.75 pct)
> Scale:   187983.72 (0.00 pct)    180133.46 (-4.17 pct)
>   Add:   211713.09 (0.00 pct)    205588.71 (-2.89 pct)
> Triad:   207302.09 (0.00 pct)    201103.81 (-2.98 pct)
> 
> NPS2
> 
> Test:           tip                     SIS_UTIL
>  Copy:   134099.98 (0.00 pct)    146487.66 (9.23 pct)
> Scale:   168404.01 (0.00 pct)    180551.46 (7.21 pct)
>   Add:   184326.77 (0.00 pct)    197117.20 (6.93 pct)
> Triad:   182707.29 (0.00 pct)    195282.60 (6.88 pct)
> 
> NPS4
> 
> Test:            tip                    SIS_UTIL
>  Copy:   123058.63 (0.00 pct)    129624.17 (5.33 pct)
> Scale:   178696.74 (0.00 pct)    182611.49 (2.19 pct)
>   Add:   169836.95 (0.00 pct)    179869.80 (5.90 pct)
> Triad:   170036.21 (0.00 pct)    177249.46 (4.24 pct)
> 
> - 100 runs
> 
> NPS1
> 
> Test:            tip                     SIS_UTIL
>  Copy:   215860.05 (0.00 pct)    205953.63 (-4.58 pct)
> Scale:   207886.55 (0.00 pct)    203384.29 (-2.16 pct)
>   Add:   253513.05 (0.00 pct)    243351.95 (-4.00 pct)
> Triad:   239471.82 (0.00 pct)    232221.90 (-3.02 pct)
> 
> NPS2
> 
> Test:            tip                     SIS_UTIL
>  Copy:   223991.94 (0.00 pct)    217920.18 (-2.71 pct)
> Scale:   205631.20 (0.00 pct)    213060.40 (3.61 pct)
>   Add:   252292.90 (0.00 pct)    266848.26 (5.76 pct)
> Triad:   239838.71 (0.00 pct)    252369.51 (5.22 pct)
> 
> NPS4
> 
> Test:            tip                     SIS_UTIL
>  Copy:   225480.09 (0.00 pct)    218902.02 (-2.91 pct)
> Scale:   218218.59 (0.00 pct)    210839.93 (-3.38 pct)
>   Add:   273879.95 (0.00 pct)    261761.62 (-4.42 pct)
> Triad:   255765.98 (0.00 pct)    246971.11 (-3.43 pct)
> 
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
> 
> NPS1
> 
> sched-tip:      301330.33 (var: 3.28)
> SIS_UTIL:       295360.33 (var: 0.76)    (-1.98%)
> 
> NPS2
> 
> sched-tip:      287786.00 (var: 4.24)
> SIS_UTIL:       288888.33 (var: 1.58)    (+0.38%)
> 
> NPS4
> 
> sched-tip:      293671.00 (var: 0.89)
> SIS_UTIL:       295682.33 (var: 0.92)    (+0.68%)
> 
> 
> ~~~~~
> Notes
> ~~~~~
> 
> o tbench reaches close to saturation at 256 clients which was
>   previously an unreliable data point and usually showed regression
>   compared to the result with 128 clients.
> o schbench improves for low worker count. It is not strictly because
>   of SIS_UTIL.
> o Most serious regression seen seem to reverse with a rerun suggesting
>   some run to run variance with few data points on tip as well as with
>   this patch.
> o Any small regression or improvements seen are within the margin of
>   run to run variance seen on the tip as well. The results seem to be
>   more stable with SIS_UTIL compared to SIS_PROP
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SIS Efficiency Stats for Hackbench
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Following are the system wide SIS Efficiency stats for SIS_PROP and SIS_UTIL
> when running hackbench with Mel's patch applied as is on both kernels:
> (https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/)
> 
> Metrics and the labels assigned for better readability
> 
> SIS Search		: Number of calls to select_idle_sibling
> SIS Domain Search	: Number of times the domain was searched (fast path failed)
> SIS Scanned		: Number of runqueues scanned
> SIS Failures		: Number of SIS calls that failed to find an idle CPU
> 
> SIS Logic:								   SIS_PROP	 SIS_UTIL	Diff (SIS_UTIL wrt SIS_PROP)
> 
> o 1-group
> 
> Benchmark Results (sec)                                    :                4.823         4.841		(-0.37 pct)
> Number of calls to select_idle_sibling                     :              3154397       3166395		(0.38 pct)
> Number of times the domain was searched (fast path failed) :               931530       1349865		(44.91 pct)
> Number of runqueues scanned                                :              7846894      11026784		(40.52 pct)
> Number of SIS calls that failed to find an idle CPU        :                76463        118968		(55.59 pct)
> Avg. No. of runqueues scanned per domain search            :                 8.42          8.16		(-3.09 pct)
> 
> o 2-groups
> 
> Benchmark Results (sec)                                    :                4.705         4.912		(-4.40 pct)
> Number of calls to select_idle_sibling                     :              3521182       4879821		(38.58 pct)
> Number of times the domain was searched (fast path failed) :              2049034       2979202		(45.40 pct)
> Number of runqueues scanned                                :             16717385      24743444		(48.01 pct)
> Number of SIS calls that failed to find an idle CPU        :               366643        241789		(-34.05 pct)
> Avg. No. of runqueues scanned per domain search            :                 8.15          8.30		(1.84 pct)
> 
> o 4-groups
> 
> Benchmark Results (sec)                                    :                5.503         5.268		(4.27 pct)
> Number of calls to select_idle_sibling                     :             13293368      11006088		(-17.21 pct)
> Number of times the domain was searched (fast path failed) :              5487436       4604635		(-16.09 pct)
> Number of runqueues scanned                                :             53028113      43238439		(-18.46 pct)
> Number of SIS calls that failed to find an idle CPU        :              1171727       1040776		(-11.18 pct)
> Avg. No. of runqueues scanned per domain search            :                 9.66          9.39		(-2.80 pct)
> 
> o 8-groups
> 
> Benchmark Results (sec)                                    :                5.794         5.752		(0.72 pct)
> Number of calls to select_idle_sibling                     :             26367244      24734896		(-6.19 pct)
> Number of times the domain was searched (fast path failed) :             11137288       9528659		(-14.44 pct)
> Number of runqueues scanned                                :            106216549      91895107		(-13.48 pct)
> Number of SIS calls that failed to find an idle CPU        :              3154674       3012751		(-4.50 pct)
> Avg. No. of runqueues scanned per domain search            :                 9.53          9.64		(1.15 pct)
> 
> o 16-groups
> 
> Benchmark Results (sec)                                    :                7.405         7.363		(0.57 pct)
> Number of calls to select_idle_sibling                     :             57323447      49331195		(-13.94 pct)
> Number of times the domain was searched (fast path failed) :             27853188      23892530		(-14.22 pct)
> Number of runqueues scanned                                :            248062785     180150761		(-27.38 pct)
> Number of SIS calls that failed to find an idle CPU        :             12182277      14125960		(15.96 pct)
> Avg. No. of runqueues scanned per domain search            :                 8.90          7.54		(-15.28 pct)
> 
> For 16 groups, when comparing SIS_UTIL to SIS_PROP, the
> "Avg. No. of runqueues scanned per domain search" goes down and we
> know there is high chance we won't find an idle CPU but it is
> still relatively high for lower number of groups where the
> opportunity to find idle cpus is more.
> 
> > 
> > [..snip..]
> >  
> >  #define NUMA_IMBALANCE_MIN 2
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 1cf435bbcd9c..3334a1b93fc6 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_PROP, true)
> 
> SIS_PROP was disabled in our testing as follows:
> 
> --
> -SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_PROP, false)
> --
> 
> > +SCHED_FEAT(SIS_UTIL, true)
> >  
> >  /*
> >   * Issue a WARN when we do multiple update_rq_clock() calls
> >
> > [..snip..]
> >
> 
> With v4 on the current tip, I don't see any need for
> a special case for systems with smaller LLCs with
> SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
> seems to be better with SIS_UTIL for hackbench.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Thanks again. Would you mind if I add this test report link into next patch
version?

thanks,
Chenyu 
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-24  2:07   ` Chen Yu
@ 2022-06-24  4:04     ` K Prateek Nayak
  2022-06-24 12:27       ` Chen Yu
  0 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2022-06-24  4:04 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Ingo Molnar,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Barry Song,
	Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li, Abel Wu,
	Daniel Bristot de Oliveira, Tim Chen, linux-kernel, Yicong Yang,
	Mohini Narkhede

Hello Chenyu,

On 6/24/2022 7:37 AM, Chen Yu wrote:

>
> [..snip..]>
>> With v4 on the current tip, I don't see any need for
>> a special case for systems with smaller LLCs with
>> SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
>> seems to be better with SIS_UTIL for hackbench.
>>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Thanks again. Would you mind if I add this test report link into next patch
> version?

Sure.
I'm assuming the next version will disables SIS_PROP and only
keep SIS_UTIL enabled which is the same configuration we ran
during this round of testing. The results should stay the same :)
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-22  6:36 ` K Prateek Nayak
  2022-06-24  2:07   ` Chen Yu
@ 2022-06-24  7:29   ` Peter Zijlstra
  2022-06-24 12:18     ` Chen Yu
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2022-06-24  7:29 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Chen Yu, Vincent Guittot, Mel Gorman, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Barry Song, Srikar Dronamraju,
	Len Brown, Ben Segall, Aubrey Li, Abel Wu,
	Daniel Bristot de Oliveira, Tim Chen, linux-kernel, Yicong Yang,
	Mohini Narkhede

On Wed, Jun 22, 2022 at 12:06:55PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> I'm sorry for the delay. The testing took a while but below are
> the results from testing on our system.
> 
> tl;dr
> 
> o We ran all the tests with with SIS_PROP disabled.
> o tbench reaches close to saturation early with 256 clients.
> o schbench shows improvements for low worker counts.
> o All other benchmark results seem comparable to tip.
>   We don't see any serious regressions with v4.
> 
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_PROP, true)
> 
> SIS_PROP was disabled in our testing as follows:
> 
> --
> -SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_PROP, false)

So how about I make this change.

> With v4 on the current tip, I don't see any need for
> a special case for systems with smaller LLCs with
> SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
> seems to be better with SIS_UTIL for hackbench.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> 

And apply this thing, lets see how it fares..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-24  7:29   ` Peter Zijlstra
@ 2022-06-24 12:18     ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-06-24 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Vincent Guittot, Mel Gorman, Ingo Molnar,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Barry Song,
	Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li, Abel Wu,
	Daniel Bristot de Oliveira, Tim Chen, linux-kernel, Yicong Yang,
	Mohini Narkhede

On Fri, Jun 24, 2022 at 09:29:58AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 22, 2022 at 12:06:55PM +0530, K Prateek Nayak wrote:
> > Hello Chenyu,
> > 
> > I'm sorry for the delay. The testing took a while but below are
> > the results from testing on our system.
> > 
> > tl;dr
> > 
> > o We ran all the tests with with SIS_PROP disabled.
> > o tbench reaches close to saturation early with 256 clients.
> > o schbench shows improvements for low worker counts.
> > o All other benchmark results seem comparable to tip.
> >   We don't see any serious regressions with v4.
> > 
> > > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > >   */
> > >  SCHED_FEAT(SIS_PROP, true)
> > 
> > SIS_PROP was disabled in our testing as follows:
> > 
> > --
> > -SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_PROP, false)
> 
> So how about I make this change.
> 
> > With v4 on the current tip, I don't see any need for
> > a special case for systems with smaller LLCs with
> > SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
> > seems to be better with SIS_UTIL for hackbench.
> > 
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> 
> 
> And apply this thing, lets see how it fares..

OK, thanks, Peter.


Best,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-24  4:04     ` K Prateek Nayak
@ 2022-06-24 12:27       ` Chen Yu
  0 siblings, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-06-24 12:27 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Mel Gorman, Ingo Molnar,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Barry Song,
	Srikar Dronamraju, Len Brown, Ben Segall, Aubrey Li, Abel Wu,
	Daniel Bristot de Oliveira, Tim Chen, linux-kernel, Yicong Yang,
	Mohini Narkhede

On Fri, Jun 24, 2022 at 09:34:49AM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 6/24/2022 7:37 AM, Chen Yu wrote:
> 
> >
> > [..snip..]>
> >> With v4 on the current tip, I don't see any need for
> >> a special case for systems with smaller LLCs with
> >> SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
> >> seems to be better with SIS_UTIL for hackbench.
> >>
> >> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Thanks again. Would you mind if I add this test report link into next patch
> > version?
> 
> Sure.
> I'm assuming the next version will disables SIS_PROP and only
> keep SIS_UTIL enabled which is the same configuration we ran
> during this round of testing. The results should stay the same :)
> --
Yes Peter has helped me change the default value of SIS_PROP and with
the current link in commit log, I assume we can find your data via it.

thanks,
Chenyu
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip: sched/core] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  2022-06-12 16:34 [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
  2022-06-13  7:40 ` Yicong Yang
  2022-06-22  6:36 ` K Prateek Nayak
@ 2022-06-28  7:16 ` tip-bot2 for Chen Yu
  2 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Chen Yu @ 2022-06-28  7:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tim Chen, Peter Zijlstra, Chen Yu, Yicong Yang, Mohini Narkhede,
	K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     70fb5ccf2ebb09a0c8ebba775041567812d45f86
Gitweb:        https://git.kernel.org/tip/70fb5ccf2ebb09a0c8ebba775041567812d45f86
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Mon, 13 Jun 2022 00:34:28 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 28 Jun 2022 09:08:30 +02:00

sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
---
 include/linux/sched/topology.h |  1 +-
 kernel/sched/fair.c            | 87 +++++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |  3 +-
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe4..816df6c 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		nr_idle_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7400600..f80ae86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6332,6 +6332,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6371,6 +6372,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 		time = cpu_clock(this);
 	}
 
+	if (sched_feat(SIS_UTIL)) {
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+		if (sd_share) {
+			/* because !--nr is the condition to stop scan */
+			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			/* overloaded LLC is unlikely to have idle cpu/core */
+			if (nr == 1)
+				return -1;
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -9224,6 +9236,77 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static void update_idle_cpu_scan(struct lb_env *env,
+				 unsigned long sum_util)
+{
+	struct sched_domain_shared *sd_share;
+	int llc_weight, pct;
+	u64 x, y, tmp;
+	/*
+	 * Update the number of CPUs to scan in LLC domain, which could
+	 * be used as a hint in select_idle_cpu(). The update of sd_share
+	 * could be expensive because it is within a shared cache line.
+	 * So the write of this hint only occurs during periodic load
+	 * balancing, rather than CPU_NEWLY_IDLE, because the latter
+	 * can fire way more frequently than the former.
+	 */
+	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
+	if (env->sd->span_weight != llc_weight)
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+	if (!sd_share)
+		return;
+
+	/*
+	 * The number of CPUs to search drops as sum_util increases, when
+	 * sum_util hits 85% or above, the scan stops.
+	 * The reason to choose 85% as the threshold is because this is the
+	 * imbalance_pct(117) when a LLC sched group is overloaded.
+	 *
+	 * let y = SCHED_CAPACITY_SCALE - p * x^2                       [1]
+	 * and y'= y / SCHED_CAPACITY_SCALE
+	 *
+	 * x is the ratio of sum_util compared to the CPU capacity:
+	 * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
+	 * y' is the ratio of CPUs to be scanned in the LLC domain,
+	 * and the number of CPUs to scan is calculated by:
+	 *
+	 * nr_scan = llc_weight * y'                                    [2]
+	 *
+	 * When x hits the threshold of overloaded, AKA, when
+	 * x = 100 / pct, y drops to 0. According to [1],
+	 * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000
+	 *
+	 * Scale x by SCHED_CAPACITY_SCALE:
+	 * x' = sum_util / llc_weight;                                  [3]
+	 *
+	 * and finally [1] becomes:
+	 * y = SCHED_CAPACITY_SCALE -
+	 *     x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE)            [4]
+	 *
+	 */
+	/* equation [3] */
+	x = sum_util;
+	do_div(x, llc_weight);
+
+	/* equation [4] */
+	pct = env->sd->imbalance_pct;
+	tmp = x * x * pct * pct;
+	do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
+	tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);
+	y = SCHED_CAPACITY_SCALE - tmp;
+
+	/* equation [2] */
+	y *= llc_weight;
+	do_div(y, SCHED_CAPACITY_SCALE);
+	if ((int)y != sd_share->nr_idle_scan)
+		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -9236,6 +9319,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
+	unsigned long sum_util = 0;
 	int sg_status = 0;
 
 	do {
@@ -9268,6 +9352,7 @@ next_group:
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
+		sum_util += sgs->group_util;
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
@@ -9293,6 +9378,8 @@ next_group:
 		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
+
+	update_idle_cpu_scan(env, sum_util);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1cf435b..ee7f23c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -60,7 +60,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_PROP, false)
+SCHED_FEAT(SIS_UTIL, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-06-28  7:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-12 16:34 [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg Chen Yu
2022-06-13  7:40 ` Yicong Yang
2022-06-13  8:06   ` Chen Yu
2022-06-13  8:54     ` Mel Gorman
2022-06-13  9:08       ` Chen Yu
2022-06-22  6:36 ` K Prateek Nayak
2022-06-24  2:07   ` Chen Yu
2022-06-24  4:04     ` K Prateek Nayak
2022-06-24 12:27       ` Chen Yu
2022-06-24  7:29   ` Peter Zijlstra
2022-06-24 12:18     ` Chen Yu
2022-06-28  7:16 ` [tip: sched/core] " tip-bot2 for Chen Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.