[RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance
@ 2023-06-12 16:17 Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared Chen Yu
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-12 16:17 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Gautham R . Shenoy, Len Brown, Chen Yu, Yicong Yang,
	linux-kernel, Chen Yu

Hi,

This is an attempt to reduce the cost of newidle balance which is
found to occupy noticeable CPU cycles on some high-core count systems.

For example, by running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

The main idea comes from the following question raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

The proposal ILB_UTIL mainly adjusts the newidle balance scan depth
within the current sched domain, based on the system utilization in
this domain. The more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

patch 1/4 is code cleanup.

patch 2/4 is to introduce a new variable in sched domain to indicate the
          number of groups, and will be used by patch 3 and patch 4.

patch 3/4 is to calculate the scan depth in each periodic load balance.

patch 4/4 is to limit the scan depth based on the result of patch 3,
          and the depth will be used by newidle_balance()->
          find_busiest_group() -> update_sd_lb_stats()

According to the test result, netperf/tbench shows some improvements
when the system is underloaded, while no noticeable difference from
hackbench/schbench. While I'm trying to run more benchmarks including
some macro-benchmarks, I send this draft patch out and seek for suggestion
from the community if this is the right thing to do and if we are in the
right direction.

[We also have other wild ideas like sorting the groups by their load
in the periodic load balance, later newidle_balance() can fetch the
corresponding group in O(1). And this change seems to get improvement
too according to the test result].

Any comments would be appreciated.

Chen Yu (4):
  sched/fair: Extract the function to get the sd_llc_shared
  sched/topology: Introduce nr_groups in sched_domain to indicate the
    number of groups
  sched/fair: Calculate the scan depth for idle balance based on system
    utilization
  sched/fair: Throttle the busiest group scanning in idle load balance

 include/linux/sched/topology.h |  5 +++
 kernel/sched/fair.c            | 74 +++++++++++++++++++++++++++++-----
 kernel/sched/features.h        |  1 +
 kernel/sched/topology.c        | 10 ++++-
 4 files changed, 79 insertions(+), 11 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared
  2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
@ 2023-06-12 16:18 ` Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-12 16:18 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Gautham R . Shenoy, Len Brown, Chen Yu, Yicong Yang,
	linux-kernel, Chen Yu

Introduce get_llc_shared() to get the sd_llc_shared of dst_cpu, if
the current domain is in LLC. Let SIS_UTIL be the first user to use
this function. Prepare for later use by ILB_UTIL.

No functional change is intended.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6189d1a45635..b3a24aead848 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10043,10 +10043,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/* Get the LLC shared information of dst CPU if doing balance in LLC */
+static struct sched_domain_shared *get_llc_shared(struct lb_env *env)
+{
+	struct sched_domain_shared *sd_share = NULL;
+
+	if (per_cpu(sd_llc_size, env->dst_cpu) == env->sd->span_weight)
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+
+	return sd_share;
+}
+
 static void update_idle_cpu_scan(struct lb_env *env,
-				 unsigned long sum_util)
+				 unsigned long sum_util,
+				 struct sched_domain_shared *sd_share)
 {
-	struct sched_domain_shared *sd_share;
 	int llc_weight, pct;
 	u64 x, y, tmp;
 	/*
@@ -10060,14 +10071,11 @@ static void update_idle_cpu_scan(struct lb_env *env,
 	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
 		return;
 
-	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
-	if (env->sd->span_weight != llc_weight)
-		return;
-
-	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
 	if (!sd_share)
 		return;
 
+	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
+
 	/*
 	 * The number of CPUs to search drops as sum_util increases, when
 	 * sum_util hits 85% or above, the scan stops.
@@ -10122,6 +10130,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
+	struct sched_domain_shared *sd_share = get_llc_shared(env);
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10190,7 +10199,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
 
-	update_idle_cpu_scan(env, sum_util);
+	update_idle_cpu_scan(env, sum_util, sd_share);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups
  2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared Chen Yu
@ 2023-06-12 16:18 ` Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-12 16:18 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Gautham R . Shenoy, Len Brown, Chen Yu, Yicong Yang,
	linux-kernel, Chen Yu

Record the number of sched groups within each sched domain. Prepare for
newidle_balance() scan depth calculation.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..1faececd5694 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -152,6 +152,7 @@ struct sched_domain {
 	struct sched_domain_shared *shared;
 
 	unsigned int span_weight;
+	unsigned int nr_groups;
 	/*
 	 * Span of all CPUs in this domain.
 	 *
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..255606e88956 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
-	int i;
+	int i, nr_groups = 0;
 
 	cpumask_clear(covered);
 
@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!sg)
 			goto fail;
 
+		nr_groups++;
+
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		last->next = first;
 	}
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 
@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	struct sd_data *sdd = sd->private;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered;
-	int i;
+	int i, nr_groups = 0;
 
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 
 		sg = get_group(i, sdd);
 
+		nr_groups++;
+
 		cpumask_or(covered, covered, sched_group_span(sg));
 
 		if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	}
 	last->next = first;
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared Chen Yu
  2023-06-12 16:18 ` [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
@ 2023-06-12 16:18 ` Chen Yu
  2023-06-15  6:01   ` Gautham R. Shenoy
  2023-06-21 11:17   ` Peter Zijlstra
  2023-06-12 16:19 ` [RFC PATCH 4/4] sched/fair: Throttle the busiest group scanning in idle load balance Chen Yu
  2023-06-15  4:22 ` [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Gautham R. Shenoy
  4 siblings, 2 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-12 16:18 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Gautham R . Shenoy, Len Brown, Chen Yu, Yicong Yang,
	linux-kernel, Chen Yu

When CPU is about to enter idle, it invokes newidle_balance() to pull
some tasks from other runqueues. Although there is per domain
max_newidle_lb_cost to throttle the newidle_balance(), it would be
good to further limit the scan based on overall system utilization.
The reason is that there is no limitation for newidle_balance() to
launch this balance simultaneously on multiple CPUs. Since each
newidle_balance() has to traverse all the CPUs to calculate the
statistics one by one, this total time cost on newidle_balance()
could be O(n^2). This is not good for performance or power saving.

For example, sqlite has spent quite some time on newidle balance()
on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

Based on this observation, limit the scan depth of newidle_balance()
by considering the utilization of the LLC domain. Let the number of
scanned groups be a linear function of the utilization ratio:

nr_groups_to_scan = nr_groups * (1 - util_ratio)

Besides, save the total_load, total_capacity of the current
sched domain in each periodic load balance. This statistic
can be reused later by CPU_NEWLY_IDLE load balance if it quits
the scan earlier. Introduce a sched feature ILB_UTIL to
control this.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  4 ++++
 kernel/sched/fair.c            | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 39 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 1faececd5694..d7b2bac9bdf3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+	/* ilb scan depth and load balance statistic snapshot */
+	int		ilb_nr_scan;
+	unsigned long ilb_total_load;
+	unsigned long ilb_total_capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3a24aead848..f999e838114e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10122,6 +10122,39 @@ static void update_idle_cpu_scan(struct lb_env *env,
 		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
 }
 
+static void update_ilb_group_scan(struct lb_env *env,
+				  unsigned long sum_util,
+				  struct sched_domain_shared *sd_share,
+				  struct sd_lb_stats *sds)
+{
+	u64 tmp, nr_scan;
+
+	if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	if (!sd_share)
+		return;
+	/*
+	 * Limit the newidle balance scan depth based on overall system
+	 * utilization:
+	 * nr_groups_scan = nr_groups * (1 - util_ratio)
+	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+	 */
+	nr_scan = env->sd->nr_groups * sum_util;
+	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+	do_div(nr_scan, tmp);
+	nr_scan = env->sd->nr_groups - nr_scan;
+	if ((int)nr_scan != sd_share->ilb_nr_scan)
+		WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
+
+	/* Also save the statistic snapshot of the periodic load balance */
+	if (sds->total_load != sd_share->ilb_total_load)
+		WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
+
+	if (sds->total_capacity != sd_share->ilb_total_capacity)
+		WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10200,6 +10233,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 
 	update_idle_cpu_scan(env, sum_util, sd_share);
+	update_ilb_group_scan(env, sum_util, sd_share, sds);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..8f6e5b08408d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)
 SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(ILB_UTIL, true)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/4] sched/fair: Throttle the busiest group scanning in idle load balance
  2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
                   ` (2 preceding siblings ...)
  2023-06-12 16:18 ` [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
@ 2023-06-12 16:19 ` Chen Yu
  2023-06-15  4:22 ` [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Gautham R. Shenoy
  4 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-12 16:19 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Gautham R . Shenoy, Len Brown, Chen Yu, Yicong Yang,
	linux-kernel, Chen Yu, kernel test robot

Scanning the whole sched domain to find the busiest group is time costly
during newidle_balance(). Limit the scan depth of newidle_balance()
to only scan for a limited number of sched groups to find a relatively
busy group, and pull from it. The scanning depth is suggested by
the previous periodic load balance based on its overall utilization.

Tested on top of sched/core v6.4-rc1,
Sapphire Rapids with 2 x 56C/112T = 224 CPUs,
cpufreq governor is set to performance, and C6 is disabled.

Overall some improvements were noticed when the system is underloaded
from tbench/netperf, and no noticeable difference from hackbench/schbench.
And the percentage of newidle_balance() has dropped accordingly.

[netperf]

Launches $nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  56-threads       1.00 (  1.98)  +18.45 ( 10.84)
TCP_RR                  112-threads      1.00 (  3.79)   +0.92 (  4.72)
TCP_RR                  168-threads      1.00 (  5.40)   -0.09 (  5.94)
TCP_RR                  224-threads      1.00 ( 42.40)   -1.37 ( 41.42)
TCP_RR                  280-threads      1.00 ( 15.95)   -0.30 ( 14.82)
TCP_RR                  336-threads      1.00 ( 27.84)   -0.08 ( 28.91)
TCP_RR                  392-threads      1.00 ( 41.85)   -0.56 ( 39.18)
TCP_RR                  448-threads      1.00 ( 45.95)   +1.62 ( 52.54)
UDP_RR                  56-threads       1.00 (  8.41)   +4.86 (  5.54)
UDP_RR                  112-threads      1.00 (  9.11)   -0.68 (  9.92)
UDP_RR                  168-threads      1.00 ( 10.48)   -0.15 ( 10.07)
UDP_RR                  224-threads      1.00 ( 40.98)   -3.80 ( 40.01)
UDP_RR                  280-threads      1.00 ( 23.50)   -0.53 ( 23.42)
UDP_RR                  336-threads      1.00 ( 35.87)   +0.38 ( 33.43)
UDP_RR                  392-threads      1.00 ( 49.47)   -0.27 ( 44.40)
UDP_RR                  448-threads      1.00 ( 53.09)   +1.81 ( 59.98)

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
=========
loopback                56-threads       1.00 (  1.12)   +1.41 (  0.43)
loopback                112-threads      1.00 (  0.43)   +0.30 (  0.73)
loopback                168-threads      1.00 (  6.88)   -5.73 (  7.74)
loopback                224-threads      1.00 ( 12.99)  +31.32 (  0.22)
loopback                280-threads      1.00 (  0.38)   -0.94 (  0.99)
loopback                336-threads      1.00 (  0.13)   +0.06 (  0.18)
loopback                392-threads      1.00 (  0.06)   -0.09 (  0.16)
loopback                448-threads      1.00 (  0.10)   -0.13 (  0.18)

[hackbench]

hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  6.09)   +2.61 (  9.27)
process-pipe            2-groups         1.00 (  7.15)   +6.22 (  5.59)
process-pipe            4-groups         1.00 (  3.40)   +2.01 (  5.45)
process-pipe            8-groups         1.00 (  0.44)   -1.57 (  0.70)
process-sockets         1-groups         1.00 (  0.69)   +0.86 (  1.84)
process-sockets         2-groups         1.00 (  5.04)   -6.31 (  0.60)
process-sockets         4-groups         1.00 (  0.22)   +0.01 (  0.75)
process-sockets         8-groups         1.00 (  0.49)   +0.46 (  0.49)
threads-pipe            1-groups         1.00 (  1.96)   -4.86 (  6.90)
threads-pipe            2-groups         1.00 (  3.02)   +0.21 (  2.72)
threads-pipe            4-groups         1.00 (  4.83)   -1.08 (  6.41)
threads-pipe            8-groups         1.00 (  3.86)   +4.19 (  3.82)
threads-sockets         1-groups         1.00 (  2.20)   +1.65 (  1.85)
threads-sockets         2-groups         1.00 (  3.09)   -0.36 (  2.14)
threads-sockets         4-groups         1.00 (  0.99)   -2.54 (  1.86)
threads-sockets         8-groups         1.00 (  0.27)   -0.01 (  0.79)

[schbench]

schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case            	load    	baseline(std%)	compare%( std%)
normal                  1-mthreads       1.00 (  1.10)   +0.88 (  0.84)
normal                  2-mthreads       1.00 (  0.73)   +0.00 (  0.73)
normal                  4-mthreads       1.00 (  1.46)   -0.60 (  2.74)
normal                  8-mthreads       1.00 (  4.09)   +1.08 (  4.60)

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f999e838114e..272e6c224b96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10168,7 +10168,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util = 0;
-	int sg_status = 0;
+	int sg_status = 0, nr_scan_ilb;
+	bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
+	    sd_share && READ_ONCE(sd_share->ilb_total_capacity);
+
+	if (ilb_util_enabled)
+		nr_scan_ilb = sd_share->ilb_nr_scan;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10186,6 +10191,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
 
+		if (ilb_util_enabled && --nr_scan_ilb <= 0) {
+			/* borrow the statistic of previous periodic load balance */
+			sds->total_load = READ_ONCE(sd_share->ilb_total_load);
+			sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
+
+			break;
+		}
+
 		if (local_group)
 			goto next_group;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance
  2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
                   ` (3 preceding siblings ...)
  2023-06-12 16:19 ` [RFC PATCH 4/4] sched/fair: Throttle the busiest group scanning in idle load balance Chen Yu
@ 2023-06-15  4:22 ` Gautham R. Shenoy
  4 siblings, 0 replies; 17+ messages in thread
From: Gautham R. Shenoy @ 2023-06-15  4:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hello Chen Yu,


On Tue, Jun 13, 2023 at 12:17:53AM +0800, Chen Yu wrote:
> Hi,
> 
> This is an attempt to reduce the cost of newidle balance which is
> found to occupy noticeable CPU cycles on some high-core count systems.
> 
> For example, by running sqlite on Intel Sapphire Rapids, which has
> 2 x 56C/112T = 224 CPUs:
> 
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> The main idea comes from the following question raised by Tim:
> Do we always have to find the busiest group and pull from it? Would
> a relatively busy group be enough?
> 
> The proposal ILB_UTIL mainly adjusts the newidle balance scan depth
> within the current sched domain, based on the system utilization in
> this domain. The more spare time there is in the domain, the more time
> each newidle balance can spend on scanning for a busy group. Although
> the newidle balance has per domain max_newidle_lb_cost to decide
> whether to launch the balance or not, the ILB_UTIL provides a smaller
> granularity to decide how many groups each newidle balance can scan.

Thanks for the patchset. This is an interesting approach to minimise
the time spent in newidle balance when the system is relatively
busy. In the past we have seen that newly idle CPUs whose avg idle
duration is slightly higher than the max_newidle_balance_cost spend
quite a bit of time trying to find a busiest group/cpu only to not
find any task to pull. But that's the opportunity lost to go idle. I
suppose this patch is targetted towards a SMT --> MC(DIE) kind of
topologies where we have a lot of groups in the DIE domain, but it
would be interesting to see if this can help when there are fewer
groups in each sched-domain.

Will try this on our setup.

> 
> patch 1/4 is code cleanup.
> 
> patch 2/4 is to introduce a new variable in sched domain to indicate the
>           number of groups, and will be used by patch 3 and patch 4.
> 
> patch 3/4 is to calculate the scan depth in each periodic load balance.
> 
> patch 4/4 is to limit the scan depth based on the result of patch 3,
>           and the depth will be used by newidle_balance()->
>           find_busiest_group() -> update_sd_lb_stats()
> 
> 
> According to the test result, netperf/tbench shows some improvements
> when the system is underloaded, while no noticeable difference from
> hackbench/schbench. While I'm trying to run more benchmarks including
> some macro-benchmarks, I send this draft patch out and seek for suggestion
> from the community if this is the right thing to do and if we are in the
> right direction.
> 
> [We also have other wild ideas like sorting the groups by their load
> in the periodic load balance, later newidle_balance() can fetch the
> corresponding group in O(1). And this change seems to get improvement
> too according to the test result].
> 
> Any comments would be appreciated.
> 
> Chen Yu (4):
>   sched/fair: Extract the function to get the sd_llc_shared
>   sched/topology: Introduce nr_groups in sched_domain to indicate the
>     number of groups
>   sched/fair: Calculate the scan depth for idle balance based on system
>     utilization
>   sched/fair: Throttle the busiest group scanning in idle load balance
> 
>  include/linux/sched/topology.h |  5 +++
>  kernel/sched/fair.c            | 74 +++++++++++++++++++++++++++++-----
>  kernel/sched/features.h        |  1 +
>  kernel/sched/topology.c        | 10 ++++-
>  4 files changed, 79 insertions(+), 11 deletions(-)
> 

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-12 16:18 ` [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
@ 2023-06-15  6:01   ` Gautham R. Shenoy
  2023-06-16  6:17     ` Chen Yu
                       ` (2 more replies)
  2023-06-21 11:17   ` Peter Zijlstra
  1 sibling, 3 replies; 17+ messages in thread
From: Gautham R. Shenoy @ 2023-06-15  6:01 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hello Chen Yu,


On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> When CPU is about to enter idle, it invokes newidle_balance() to pull
> some tasks from other runqueues. Although there is per domain
> max_newidle_lb_cost to throttle the newidle_balance(), it would be
> good to further limit the scan based on overall system utilization.
> The reason is that there is no limitation for newidle_balance() to
> launch this balance simultaneously on multiple CPUs. Since each
> newidle_balance() has to traverse all the CPUs to calculate the
> statistics one by one, this total time cost on newidle_balance()
> could be O(n^2). This is not good for performance or power saving.
> 
> For example, sqlite has spent quite some time on newidle balance()
> on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> Based on this observation, limit the scan depth of newidle_balance()
> by considering the utilization of the LLC domain. Let the number of
> scanned groups be a linear function of the utilization ratio:
>

Is there any particular reason why this is being limited only to the
LLC domain ?

On architectures where the LLC domain may not be so large (POWER9/10,
AMD), the additional cost is usually paid at the higher domains where
the number of groups is greater / equal to the number of groups in the
LLC domain and where sd_span is pretty large. It would be good to
explore avoiding the scan cost on those domains as well, right?

> nr_groups_to_scan = nr_groups * (1 - util_ratio)

If we can extend this logic to higher domains, on a Zen3 1 Socket
server with 128 CPUs at the DIE domain containing 8 groups, we can
expect a significant reduction in the time spent doing
update_sg_lb_stats() at higher utilizations.

util_ratio     nr_groups_to_scan        nr_cpus_scanned
========================================================
0.9              1                       16     (-87.5%)
0.75             2                       32     (-75%)
0.5              4                       64     (-50%)
0.25             6                       96     (-25%)
0.1              7                      112     (-12.5%) 


On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
12 groups, values will be:

util_ratio     nr_groups_to_scan        nr_cpus_scanned
========================================================
0.9              1                       16     (-91%)
0.75             3                       48     (-75%)
0.5              6                       96     (-50%)
0.25             9                      144     (-25%)
0.1             10                      160     (-16.7%)

> 
> Besides, save the total_load, total_capacity of the current
> sched domain in each periodic load balance. This statistic
> can be reused later by CPU_NEWLY_IDLE load balance if it quits
> the scan earlier. Introduce a sched feature ILB_UTIL to
> control this.
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |  4 ++++
>  kernel/sched/fair.c            | 34 ++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 39 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 1faececd5694..d7b2bac9bdf3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,6 +82,10 @@ struct sched_domain_shared {
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
>  	int		nr_idle_scan;
> +	/* ilb scan depth and load balance statistic snapshot */
> +	int		ilb_nr_scan;
> +	unsigned long ilb_total_load;
> +	unsigned long ilb_total_capacity;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3a24aead848..f999e838114e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10122,6 +10122,39 @@ static void update_idle_cpu_scan(struct lb_env *env,
>  		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
>  }
>  
> +static void update_ilb_group_scan(struct lb_env *env,
> +				  unsigned long sum_util,
> +				  struct sched_domain_shared *sd_share,
> +				  struct sd_lb_stats *sds)
> +{
> +	u64 tmp, nr_scan;
> +
> +	if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	if (!sd_share)
> +		return;
> +	/*
> +	 * Limit the newidle balance scan depth based on overall system
> +	 * utilization:
> +	 * nr_groups_scan = nr_groups * (1 - util_ratio)
> +	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> +	 */
> +	nr_scan = env->sd->nr_groups * sum_util;
> +	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> +	do_div(nr_scan, tmp);
> +	nr_scan = env->sd->nr_groups - nr_scan;
> +	if ((int)nr_scan != sd_share->ilb_nr_scan)
> +		WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
> +
> +	/* Also save the statistic snapshot of the periodic load balance */
> +	if (sds->total_load != sd_share->ilb_total_load)
> +		WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
> +
> +	if (sds->total_capacity != sd_share->ilb_total_capacity)
> +		WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -10200,6 +10233,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  
>  	update_idle_cpu_scan(env, sum_util, sd_share);
> +	update_ilb_group_scan(env, sum_util, sd_share, sds);
>  }
>  
>  /**
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..8f6e5b08408d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>  
>  SCHED_FEAT(RT_RUNTIME_SHARE, false)
>  SCHED_FEAT(LB_MIN, false)
> +SCHED_FEAT(ILB_UTIL, true)
>  SCHED_FEAT(ATTACH_AGE_LOAD, true)
>  
>  SCHED_FEAT(WA_IDLE, true)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-15  6:01   ` Gautham R. Shenoy
@ 2023-06-16  6:17     ` Chen Yu
  2023-06-21  7:29     ` Chen Yu
  2023-06-21 11:20     ` Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-16  6:17 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hi Gautham,
On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
> 
> 
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> > 
> > For example, sqlite has spent quite some time on newidle balance()
> > on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > 
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the LLC domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> >
> 
> Is there any particular reason why this is being limited only to the
> LLC domain ?
> 
Thank you for your interest in this change. The original thought as you
might know is that our LLC has a huge number of groups.
And since SIS_UTIL has done similar thing for LLC, we want to propose
a simple prototype to demonstrate the idea.
> On architectures where the LLC domain may not be so large (POWER9/10,
> AMD), the additional cost is usually paid at the higher domains where
> the number of groups is greater / equal to the number of groups in the
> LLC domain and where sd_span is pretty large. It would be good to
> explore avoiding the scan cost on those domains as well, right?
> 
Exactly.
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
> 
> If we can extend this logic to higher domains, on a Zen3 1 Socket
> server with 128 CPUs at the DIE domain containing 8 groups, we can
> expect a significant reduction in the time spent doing
> update_sg_lb_stats() at higher utilizations.
> 
> util_ratio     nr_groups_to_scan        nr_cpus_scanned
> ========================================================
> 0.9              1                       16     (-87.5%)
> 0.75             2                       32     (-75%)
> 0.5              4                       64     (-50%)
> 0.25             6                       96     (-25%)
> 0.1              7                      112     (-12.5%) 
> 
> 
> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
> 12 groups, values will be:
> 
> util_ratio     nr_groups_to_scan        nr_cpus_scanned
> ========================================================
> 0.9              1                       16     (-91%)
> 0.75             3                       48     (-75%)
> 0.5              6                       96     (-50%)
> 0.25             9                      144     (-25%)
> 0.1             10                      160     (-16.7%)
> 
Thanks for this information, I think we can extend the scan limitation
for not only LLC(MC domain) but also higher domains. I'll think about
it.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-15  6:01   ` Gautham R. Shenoy
  2023-06-16  6:17     ` Chen Yu
@ 2023-06-21  7:29     ` Chen Yu
  2023-06-22  6:01       ` Gautham R. Shenoy
  2023-07-10 11:06       ` K Prateek Nayak
  2023-06-21 11:20     ` Peter Zijlstra
  2 siblings, 2 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-21  7:29 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hi Gautham,
On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
> 
> 
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> > 
> > For example, sqlite has spent quite some time on newidle balance()
> > on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > 
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the LLC domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> >
> 
> Is there any particular reason why this is being limited only to the
> LLC domain ?
> 
> On architectures where the LLC domain may not be so large (POWER9/10,
> AMD), the additional cost is usually paid at the higher domains where
> the number of groups is greater / equal to the number of groups in the
> LLC domain and where sd_span is pretty large. It would be good to
> explore avoiding the scan cost on those domains as well, right?
> 
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
> 
> If we can extend this logic to higher domains, on a Zen3 1 Socket
> server with 128 CPUs at the DIE domain containing 8 groups, we can
> expect a significant reduction in the time spent doing
> update_sg_lb_stats() at higher utilizations.
> 
> util_ratio     nr_groups_to_scan        nr_cpus_scanned
> ========================================================
> 0.9              1                       16     (-87.5%)
> 0.75             2                       32     (-75%)
> 0.5              4                       64     (-50%)
> 0.25             6                       96     (-25%)
> 0.1              7                      112     (-12.5%) 
> 
> 
> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
> 12 groups, values will be:
> 
> util_ratio     nr_groups_to_scan        nr_cpus_scanned
> ========================================================
> 0.9              1                       16     (-91%)
> 0.75             3                       48     (-75%)
> 0.5              6                       96     (-50%)
> 0.25             9                      144     (-25%)
> 0.1             10                      160     (-16.7%)
>
I have an idea to limit scan depth for newidle balance for big domains.
These domains should have CPUs higher than/equals to LLC(MC domain).
However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
flag set will have the shared struct sched_domain_shared among the CPUs in this
domain. And this is reasonable because the cost to access the struct sched_domain_shared
is lower if the CPUs share cache. Since ILB_UTIL relies on the sched_domain_shared
to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
during sched_domain_shared assignment.
If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
the overhead should be not too high(only periodic load balance would
write to sched_domain_shared). Here is a untest patch which shows what
I'm thinking of, and I'll do some refinement based on this:

thanks,
Chenyu

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 67b573d5bf28..ce7ffbb7b3f8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+	/* ilb scan depth and load balance statistic snapshot */
+	int		ilb_nr_scan;
+	unsigned long ilb_total_load;
+	unsigned long ilb_total_capacity;
 };
 
 struct sched_domain {
@@ -152,6 +156,7 @@ struct sched_domain {
 	struct sched_domain_shared *shared;
 
 	unsigned int span_weight;
+	unsigned int nr_groups;
 	/*
 	 * Span of all CPUs in this domain.
 	 *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d724215826ae..34619dbb2f4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10162,6 +10162,54 @@ static void update_idle_cpu_scan(struct lb_env *env,
 		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
 }
 
+/*
+ * Get the domain shared information of dst CPU.
+ */
+static struct sched_domain_shared *get_sd_shared(struct lb_env *env)
+{
+	/*
+	 * Do not consider the domains smaller than LLC because those
+	 * small domains have low cost on idle load balance.
+	 */
+       if (env->sd->span_weight < per_cpu(sd_llc_size, env->dst_cpu))
+               return NULL;
+
+       return env->sd->shared;
+}
+
+static void update_ilb_group_scan(struct lb_env *env,
+				  unsigned long sum_util,
+				  struct sched_domain_shared *sd_share,
+				  struct sd_lb_stats *sds)
+{
+	u64 tmp, nr_scan;
+
+	if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	if(!sd_share)
+		return;
+	/*
+	 * Limit the newidle balance scan depth based on overall system
+	 * utilization:
+	 * nr_groups_scan = nr_groups * (1 - util_ratio)
+	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+	 */
+	nr_scan = env->sd->nr_groups * sum_util;
+	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+	do_div(nr_scan, tmp);
+	nr_scan = env->sd->nr_groups - nr_scan;
+	if ((int)nr_scan != sd_share->ilb_nr_scan)
+		WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
+
+	/* save the statistic snapshot of the periodic load balance */
+	if (sds->total_load != sd_share->ilb_total_load)
+		WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
+
+	if (sds->total_capacity != sd_share->ilb_total_capacity)
+		WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10170,11 +10218,17 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
+	struct sched_domain_shared *sd_share = get_sd_shared(env);
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util = 0;
-	int sg_status = 0;
+	int sg_status = 0, nr_scan_ilb;
+	bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
+	    sd_share && READ_ONCE(sd_share->ilb_total_capacity);
+
+	if (ilb_util_enabled)
+		nr_scan_ilb = sd_share->ilb_nr_scan;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10192,6 +10246,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
 
+		if (ilb_util_enabled && --nr_scan_ilb <= 0) {
+			/*
+			 * Borrow the statistic of previous periodic load balance.
+			 * The data might be stale and it is a trade-off.
+			 */
+			sds->total_load = READ_ONCE(sd_share->ilb_total_load);
+			sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
+
+			break;
+		}
+
 		if (local_group)
 			goto next_group;
 
@@ -10239,6 +10304,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 
 	update_idle_cpu_scan(env, sum_util);
+	update_ilb_group_scan(env, sum_util, sd_share, sds);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..8f6e5b08408d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)
 SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(ILB_UTIL, true)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..98bfac5f7836 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
-	int i;
+	int i, nr_groups = 0;
 
 	cpumask_clear(covered);
 
@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!sg)
 			goto fail;
 
+		nr_groups++;
+
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		last->next = first;
 	}
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 
@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	struct sd_data *sdd = sd->private;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered;
-	int i;
+	int i, nr_groups = 0;
 
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 
 		sg = get_group(i, sdd);
 
+		nr_groups++;
+
 		cpumask_or(covered, covered, sched_group_span(sg));
 
 		if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	}
 	last->next = first;
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 }
@@ -1641,14 +1647,12 @@ sd_init(struct sched_domain_topology_level *tl,
 	}
 
 	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
+	 * For all levels; connect a sched_domain_shared
 	 * instance.
 	 */
-	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
-		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-	}
+	sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+	atomic_inc(&sd->shared->ref);
+	atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
 
 	sd->private = sdd;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-12 16:18 ` [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
  2023-06-15  6:01   ` Gautham R. Shenoy
@ 2023-06-21 11:17   ` Peter Zijlstra
  2023-06-23 14:33     ` Chen Yu
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-06-21 11:17 UTC (permalink / raw)
  To: Chen Yu
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, K Prateek Nayak, Abel Wu, Gautham R . Shenoy,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> When CPU is about to enter idle, it invokes newidle_balance() to pull
> some tasks from other runqueues. Although there is per domain
> max_newidle_lb_cost to throttle the newidle_balance(), it would be
> good to further limit the scan based on overall system utilization.
> The reason is that there is no limitation for newidle_balance() to
> launch this balance simultaneously on multiple CPUs. Since each
> newidle_balance() has to traverse all the CPUs to calculate the
> statistics one by one, this total time cost on newidle_balance()
> could be O(n^2). This is not good for performance or power saving.

Another possible solution is to keep struct sg_lb_stats in
sd->child->shared (below the NUMA domains) and put a lock around it.

Then have update_sd_lb_stats() do something like:

	struct sg_lb_stats *sgs = &sds->sgs;

	if (raw_spin_trylock(&sds->sg_lock)) {
		struct sg_lb_stats tmp;

		... collect tmp

		sds->sgs = tmp;
		raw_spin_unlock(&sds->sg_lock);
	}

	... use sgs

Then you know you've always got a 'recent' copy but avoid the concurrent
updates.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-15  6:01   ` Gautham R. Shenoy
  2023-06-16  6:17     ` Chen Yu
  2023-06-21  7:29     ` Chen Yu
@ 2023-06-21 11:20     ` Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2023-06-21 11:20 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Chen Yu, Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen,
	Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

On Thu, Jun 15, 2023 at 11:31:07AM +0530, Gautham R. Shenoy wrote:

> Is there any particular reason why this is being limited only to the
> LLC domain ?

Yeah, agreed, this should be uniform across the domains below NUMA.
Above that we already have SD_SERIALIZE anyway -- and the interval is
fairly high.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-21  7:29     ` Chen Yu
@ 2023-06-22  6:01       ` Gautham R. Shenoy
  2023-07-10 11:06       ` K Prateek Nayak
  1 sibling, 0 replies; 17+ messages in thread
From: Gautham R. Shenoy @ 2023-06-22  6:01 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, K Prateek Nayak, Abel Wu,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

On Wed, Jun 21, 2023 at 03:29:27PM +0800, Chen Yu wrote:
> Hi Gautham,

[..snip..]

> >
> I have an idea to limit scan depth for newidle balance for big domains.
> These domains should have CPUs higher than/equals to LLC(MC domain).
> However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
> flag set will have the shared struct sched_domain_shared among the CPUs in this
> domain. And this is reasonable because the cost to access the struct sched_domain_shared
> is lower if the CPUs share cache.

Yes, this was a conscious choice. But now, we may need to explore
using it and consider the trade-off between cachelines bouncing across
LLCs while updating sd_shared on higher domains (I would still prefer
limiting it to below the NUMA domain!) vs having to recompute stuff at
the higher domains, if something has been recently computed for this
level by some other CPU in the domain.

> Since ILB_UTIL relies on the sched_domain_shared
> to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
> during sched_domain_shared assignment.

That's what Prateek was exploring as well. I wonder however, if we
should have sd_shared defined only for non-NUMA domains.

> If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
> the overhead should be not too high(only periodic load balance would
> write to sched_domain_shared).
> Here is a untest patch which shows what
> I'm thinking of, and I'll do some refinement based on this:

Thanks a lot for the patch. I will add it to our test queue.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-21 11:17   ` Peter Zijlstra
@ 2023-06-23 14:33     ` Chen Yu
  2023-06-23 14:43       ` Chen Yu
  0 siblings, 1 reply; 17+ messages in thread
From: Chen Yu @ 2023-06-23 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, K Prateek Nayak, Abel Wu, Gautham R . Shenoy,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hi Peter,
On 2023-06-21 at 13:17:21 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> 
> Another possible solution is to keep struct sg_lb_stats in
> sd->child->shared (below the NUMA domains) and put a lock around it.
> 
> Then have update_sd_lb_stats() do something like:
> 
> 	struct sg_lb_stats *sgs = &sds->sgs;
> 
> 	if (raw_spin_trylock(&sds->sg_lock)) {
> 		struct sg_lb_stats tmp;
> 
> 		... collect tmp
> 
> 		sds->sgs = tmp;
> 		raw_spin_unlock(&sds->sg_lock);
> 	}
> 
> 	... use sgs
> 
> Then you know you've always got a 'recent' copy but avoid the concurrent
> updates.
Thanks for taking a look and gave the suggestions! Yes, this is a good idea, by
doing this we can further limit the number of CPU to do update in parallel, and
allow the newidle CPU to reuse the data for idle load balance from others.
This lock only allow 1 CPU in that domain to iterate the whole group, and the
bottleneck might reply on how fast the CPU who grabs the lock can finish
collecting the tmp sgs data. For MC domain, it would not take too much time, and for
higher domains between MC and NUMA domain, it depends on how many CPUs there are in that
domain. I'll create one prototype based on your suggestion and measure the test data.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-23 14:33     ` Chen Yu
@ 2023-06-23 14:43       ` Chen Yu
  0 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2023-06-23 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, K Prateek Nayak, Abel Wu, Gautham R . Shenoy,
	Len Brown, Chen Yu, Yicong Yang, linux-kernel

On 2023-06-23 at 22:33:23 +0800, Chen Yu wrote:
> Hi Peter,
> On 2023-06-21 at 13:17:21 +0200, Peter Zijlstra wrote:
> > On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > > some tasks from other runqueues. Although there is per domain
> > > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > > good to further limit the scan based on overall system utilization.
> > > The reason is that there is no limitation for newidle_balance() to
> > > launch this balance simultaneously on multiple CPUs. Since each
> > > newidle_balance() has to traverse all the CPUs to calculate the
> > > statistics one by one, this total time cost on newidle_balance()
> > > could be O(n^2). This is not good for performance or power saving.
> > 
> > Another possible solution is to keep struct sg_lb_stats in
> > sd->child->shared (below the NUMA domains) and put a lock around it.
> > 
> > Then have update_sd_lb_stats() do something like:
> > 
> > 	struct sg_lb_stats *sgs = &sds->sgs;
> > 
> > 	if (raw_spin_trylock(&sds->sg_lock)) {
> > 		struct sg_lb_stats tmp;
> > 
> > 		... collect tmp
> > 
> > 		sds->sgs = tmp;
> > 		raw_spin_unlock(&sds->sg_lock);
> > 	}
> > 
> > 	... use sgs
> > 
> > Then you know you've always got a 'recent' copy but avoid the concurrent
> > updates.
> Thanks for taking a look and gave the suggestions! Yes, this is a good idea, by
> doing this we can further limit the number of CPU to do update in parallel, and
> allow the newidle CPU to reuse the data for idle load balance from others.
> This lock only allow 1 CPU in that domain to iterate the whole group, and the
> bottleneck might reply on how fast the CPU who grabs the lock can finish
> collecting the tmp sgs data. For MC domain, it would not take too much time, and for
> higher domains between MC and NUMA domain, it depends on how many CPUs there are in that
> domain.
I just realized that it's a trylock, so it should not block other CPUs who launch
the idle balance, but just to let 1 CPUs update the 'snapshot' at one time.
I'll do some tests.

thanks,
Chenyu
> I'll create one prototype based on your suggestion and measure the test data.
> 
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-06-21  7:29     ` Chen Yu
  2023-06-22  6:01       ` Gautham R. Shenoy
@ 2023-07-10 11:06       ` K Prateek Nayak
  2023-07-10 15:37         ` Chen Yu
  1 sibling, 1 reply; 17+ messages in thread
From: K Prateek Nayak @ 2023-07-10 11:06 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, Abel Wu,
	Gautham R. Shenoy, Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hello Chenyu,

Thank you for sharing this extended version. Sharing the results from
testing below.

tl;dr

- tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.

- schbench (old) sees a regression in tail latency once system is heavily 
  loaded. DeathStarBench and SPECjbb too see a small drop under those
  conditions.

- Rest of the benchmark results do not vary much.

On 6/21/2023 12:59 PM, Chen Yu wrote:
> Hi Gautham,
> On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
>> Hello Chen Yu,
>>
>>
>> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
>>> When CPU is about to enter idle, it invokes newidle_balance() to pull
>>> some tasks from other runqueues. Although there is per domain
>>> max_newidle_lb_cost to throttle the newidle_balance(), it would be
>>> good to further limit the scan based on overall system utilization.
>>> The reason is that there is no limitation for newidle_balance() to
>>> launch this balance simultaneously on multiple CPUs. Since each
>>> newidle_balance() has to traverse all the CPUs to calculate the
>>> statistics one by one, this total time cost on newidle_balance()
>>> could be O(n^2). This is not good for performance or power saving.
>>>
>>> For example, sqlite has spent quite some time on newidle balance()
>>> on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
>>> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
>>> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
>>>
>>> Based on this observation, limit the scan depth of newidle_balance()
>>> by considering the utilization of the LLC domain. Let the number of
>>> scanned groups be a linear function of the utilization ratio:
>>>
>>
>> Is there any particular reason why this is being limited only to the
>> LLC domain ?
>>
>> On architectures where the LLC domain may not be so large (POWER9/10,
>> AMD), the additional cost is usually paid at the higher domains where
>> the number of groups is greater / equal to the number of groups in the
>> LLC domain and where sd_span is pretty large. It would be good to
>> explore avoiding the scan cost on those domains as well, right?
>>
>>> nr_groups_to_scan = nr_groups * (1 - util_ratio)
>>
>> If we can extend this logic to higher domains, on a Zen3 1 Socket
>> server with 128 CPUs at the DIE domain containing 8 groups, we can
>> expect a significant reduction in the time spent doing
>> update_sg_lb_stats() at higher utilizations.
>>
>> util_ratio     nr_groups_to_scan        nr_cpus_scanned
>> ========================================================
>> 0.9              1                       16     (-87.5%)
>> 0.75             2                       32     (-75%)
>> 0.5              4                       64     (-50%)
>> 0.25             6                       96     (-25%)
>> 0.1              7                      112     (-12.5%) 
>>
>>
>> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
>> 12 groups, values will be:
>>
>> util_ratio     nr_groups_to_scan        nr_cpus_scanned
>> ========================================================
>> 0.9              1                       16     (-91%)
>> 0.75             3                       48     (-75%)
>> 0.5              6                       96     (-50%)
>> 0.25             9                      144     (-25%)
>> 0.1             10                      160     (-16.7%)
>>
> I have an idea to limit scan depth for newidle balance for big domains.
> These domains should have CPUs higher than/equals to LLC(MC domain).
> However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
> flag set will have the shared struct sched_domain_shared among the CPUs in this
> domain. And this is reasonable because the cost to access the struct sched_domain_shared
> is lower if the CPUs share cache. Since ILB_UTIL relies on the sched_domain_shared
> to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
> during sched_domain_shared assignment.
> If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
> the overhead should be not too high(only periodic load balance would
> write to sched_domain_shared). Here is a untest patch which shows what
> I'm thinking of, and I'll do some refinement based on this:
> 
> thanks,
> Chenyu
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 67b573d5bf28..ce7ffbb7b3f8 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,6 +82,10 @@ struct sched_domain_shared {
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
>  	int		nr_idle_scan;
> +	/* ilb scan depth and load balance statistic snapshot */
> +	int		ilb_nr_scan;
> +	unsigned long ilb_total_load;
> +	unsigned long ilb_total_capacity;
>  };
>  
>  struct sched_domain {
> @@ -152,6 +156,7 @@ struct sched_domain {
>  	struct sched_domain_shared *shared;
>  
>  	unsigned int span_weight;
> +	unsigned int nr_groups;
>  	/*
>  	 * Span of all CPUs in this domain.
>  	 *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d724215826ae..34619dbb2f4e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10162,6 +10162,54 @@ static void update_idle_cpu_scan(struct lb_env *env,
>  		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
>  }
>  
> +/*
> + * Get the domain shared information of dst CPU.
> + */
> +static struct sched_domain_shared *get_sd_shared(struct lb_env *env)
> +{
> +	/*
> +	 * Do not consider the domains smaller than LLC because those
> +	 * small domains have low cost on idle load balance.
> +	 */
> +       if (env->sd->span_weight < per_cpu(sd_llc_size, env->dst_cpu))
> +               return NULL;
> +
> +       return env->sd->shared;
> +}
> +
> +static void update_ilb_group_scan(struct lb_env *env,
> +				  unsigned long sum_util,
> +				  struct sched_domain_shared *sd_share,
> +				  struct sd_lb_stats *sds)
> +{
> +	u64 tmp, nr_scan;
> +
> +	if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	if(!sd_share)
> +		return;
> +	/*
> +	 * Limit the newidle balance scan depth based on overall system
> +	 * utilization:
> +	 * nr_groups_scan = nr_groups * (1 - util_ratio)
> +	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> +	 */
> +	nr_scan = env->sd->nr_groups * sum_util;
> +	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> +	do_div(nr_scan, tmp);
> +	nr_scan = env->sd->nr_groups - nr_scan;
> +	if ((int)nr_scan != sd_share->ilb_nr_scan)
> +		WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
> +
> +	/* save the statistic snapshot of the periodic load balance */
> +	if (sds->total_load != sd_share->ilb_total_load)
> +		WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
> +
> +	if (sds->total_capacity != sd_share->ilb_total_capacity)
> +		WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -10170,11 +10218,17 @@ static void update_idle_cpu_scan(struct lb_env *env,
>  
>  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
>  {
> +	struct sched_domain_shared *sd_share = get_sd_shared(env);
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
>  	unsigned long sum_util = 0;
> -	int sg_status = 0;
> +	int sg_status = 0, nr_scan_ilb;
> +	bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
> +	    sd_share && READ_ONCE(sd_share->ilb_total_capacity);
> +
> +	if (ilb_util_enabled)
> +		nr_scan_ilb = sd_share->ilb_nr_scan;
>  
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10192,6 +10246,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  
>  		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
>  
> +		if (ilb_util_enabled && --nr_scan_ilb <= 0) {
> +			/*
> +			 * Borrow the statistic of previous periodic load balance.
> +			 * The data might be stale and it is a trade-off.
> +			 */
> +			sds->total_load = READ_ONCE(sd_share->ilb_total_load);
> +			sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
> +
> +			break;
> +		}
> +
>  		if (local_group)
>  			goto next_group;
>  
> @@ -10239,6 +10304,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  
>  	update_idle_cpu_scan(env, sum_util);
> +	update_ilb_group_scan(env, sum_util, sd_share, sds);
>  }
>  
>  /**
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..8f6e5b08408d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>  
>  SCHED_FEAT(RT_RUNTIME_SHARE, false)
>  SCHED_FEAT(LB_MIN, false)
> +SCHED_FEAT(ILB_UTIL, true)
>  SCHED_FEAT(ATTACH_AGE_LOAD, true)
>  
>  SCHED_FEAT(WA_IDLE, true)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..98bfac5f7836 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>  	struct cpumask *covered = sched_domains_tmpmask;
>  	struct sd_data *sdd = sd->private;
>  	struct sched_domain *sibling;
> -	int i;
> +	int i, nr_groups = 0;
>  
>  	cpumask_clear(covered);
>  
> @@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>  		if (!sg)
>  			goto fail;
>  
> +		nr_groups++;
> +
>  		sg_span = sched_group_span(sg);
>  		cpumask_or(covered, covered, sg_span);
>  
> @@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>  		last->next = first;
>  	}
>  	sd->groups = first;
> +	sd->nr_groups = nr_groups;
>  
>  	return 0;
>  
> @@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>  	struct sd_data *sdd = sd->private;
>  	const struct cpumask *span = sched_domain_span(sd);
>  	struct cpumask *covered;
> -	int i;
> +	int i, nr_groups = 0;
>  
>  	lockdep_assert_held(&sched_domains_mutex);
>  	covered = sched_domains_tmpmask;
> @@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>  
>  		sg = get_group(i, sdd);
>  
> +		nr_groups++;
> +
>  		cpumask_or(covered, covered, sched_group_span(sg));
>  
>  		if (!first)
> @@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>  	}
>  	last->next = first;
>  	sd->groups = first;
> +	sd->nr_groups = nr_groups;
>  
>  	return 0;
>  }
> @@ -1641,14 +1647,12 @@ sd_init(struct sched_domain_topology_level *tl,
>  	}
>  
>  	/*
> -	 * For all levels sharing cache; connect a sched_domain_shared
> +	 * For all levels; connect a sched_domain_shared
>  	 * instance.
>  	 */
> -	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> -		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> -		atomic_inc(&sd->shared->ref);
> -		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> -	}
> +	sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> +	atomic_inc(&sd->shared->ref);
> +	atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>  
>  	sd->private = sdd;
>  

o System Details

Dual Socket 3rd Generation EPYC System (2 x 64C/128T)

o NPS Modes

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

o Kernel Versions

- tip              - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
                     Avoid resetting the min update period when it is
                     unnecessary")

- ILB_UTIL	   - tip:sched/core + this patch

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			ILB_UTIL
 1-groups:	   3.92 (0.00 pct)	   3.66 (6.63 pct)
 2-groups:	   4.58 (0.00 pct)	   4.18 (8.73 pct)
 4-groups:	   4.99 (0.00 pct)	   4.46 (10.62 pct)
 8-groups:	   5.67 (0.00 pct)	   5.39 (4.93 pct)
16-groups:	   7.88 (0.00 pct)	  10.43 (-32.36 pct)

o NPS2

Test:			tip			ILB_UTIL
 1-groups:	   3.82 (0.00 pct)	   3.59 (6.02 pct)
 2-groups:	   4.40 (0.00 pct)	   4.08 (7.27 pct)
 4-groups:	   4.84 (0.00 pct)	   4.44 (8.26 pct)
 8-groups:	   5.45 (0.00 pct)	   6.32 (-15.96 pct)
16-groups:	   6.94 (0.00 pct)	  11.71 (-68.73 pct)

o NPS4

Test:			tip			ILB_UTIL
 1-groups:	   3.82 (0.00 pct)	   3.65 (4.45 pct)
 2-groups:	   4.44 (0.00 pct)	   4.19 (5.63 pct)
 4-groups:	   4.86 (0.00 pct)	   4.60 (5.34 pct)
 8-groups:	   5.42 (0.00 pct)	   5.36 (1.10 pct)
16-groups:	   6.68 (0.00 pct)	  10.09 (-51.04 pct)

~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~

o NPS1

#workers:	tip			ILB_UTIL
  1:	  26.00 (0.00 pct)	  26.00 (0.00 pct)
  2:	  27.00 (0.00 pct)	  28.00 (-3.70 pct)
  4:	  31.00 (0.00 pct)	  27.00 (12.90 pct)
  8:	  36.00 (0.00 pct)	  40.00 (-11.11 pct)
 16:	  49.00 (0.00 pct)	  50.00 (-2.04 pct)
 32:	  80.00 (0.00 pct)	  80.00 (0.00 pct)
 64:	 169.00 (0.00 pct)	 170.00 (-0.59 pct)
128:	 343.00 (0.00 pct)	 338.00 (1.45 pct)
256:	 42048.00 (0.00 pct)	 45760.00 (-8.82 pct)
512:	 95104.00 (0.00 pct)	 109696.00 (-15.34 pct)

o NPS2

#workers:	tip			ILB_UTIL
  1:	  23.00 (0.00 pct)	  21.00 (8.69 pct)
  2:	  24.00 (0.00 pct)	  25.00 (-4.16 pct)
  4:	  31.00 (0.00 pct)	  29.00 (6.45 pct)
  8:	  41.00 (0.00 pct)	  43.00 (-4.87 pct)
 16:	  48.00 (0.00 pct)	  50.00 (-4.16 pct)
 32:	  81.00 (0.00 pct)	  81.00 (0.00 pct)
 64:	 157.00 (0.00 pct)	 180.00 (-14.64 pct)
128:	 386.00 (0.00 pct)	 385.00 (0.25 pct)
256:	 48832.00 (0.00 pct)	 52032.00 (-6.55 pct)
512:	 92032.00 (0.00 pct)	 113024.00 (-22.80 pct)

o NPS4

#workers:	tip			ILB_UTIL
  1:	  21.00 (0.00 pct)	  23.00 (-9.52 pct)
  2:	  28.00 (0.00 pct)	  30.00 (-7.14 pct)
  4:	  32.00 (0.00 pct)	  33.00 (-3.12 pct)
  8:	  46.00 (0.00 pct)	  51.00 (-10.86 pct)
 16:	  51.00 (0.00 pct)	  54.00 (-5.88 pct)
 32:	  82.00 (0.00 pct)	  88.00 (-7.31 pct)
 64:	 173.00 (0.00 pct)	 175.00 (-1.15 pct)
128:	 396.00 (0.00 pct)	 387.00 (2.27 pct)
256:	 48832.00 (0.00 pct)	 46912.00 (3.93 pct)
512:	 95104.00 (0.00 pct)	 110720.00 (-16.41 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			ILB_UTIL
    1	 452.49 (0.00 pct)	 449.93 (-0.56 pct)
    2	 862.44 (0.00 pct)	 875.04 (1.46 pct)
    4	 1604.27 (0.00 pct)	 1626.23 (1.36 pct)
    8	 2966.77 (0.00 pct)	 3036.80 (2.36 pct)
   16	 5176.70 (0.00 pct)	 5402.88 (4.36 pct)
   32	 8205.24 (0.00 pct)	 9256.48 (12.81 pct)
   64	 13956.71 (0.00 pct)	 15581.58 (11.64 pct)
  128	 24005.50 (0.00 pct)	 24782.63 (3.23 pct)
  256	 32457.61 (0.00 pct)	 30810.66 (-5.07 pct)
  512	 34345.24 (0.00 pct)	 40971.90 (19.29 pct)
 1024	 33432.92 (0.00 pct)	 41604.06 (24.44 pct)

o NPS2

Clients:	tip			ILB_UTIL
    1	 453.73 (0.00 pct)	 444.72 (-1.98 pct)
    2	 861.71 (0.00 pct)	 853.67 (-0.93 pct)
    4	 1599.14 (0.00 pct)	 1573.69 (-1.59 pct)
    8	 2951.03 (0.00 pct)	 3021.87 (2.40 pct)
   16	 5080.32 (0.00 pct)	 5464.64 (7.56 pct)
   32	 7900.41 (0.00 pct)	 10304.44 (30.42 pct)
   64	 14629.65 (0.00 pct)	 17083.33 (16.77 pct)
  128	 23155.88 (0.00 pct)	 25278.86 (9.16 pct)
  256	 33449.57 (0.00 pct)	 32964.11 (-1.45 pct)
  512	 33757.47 (0.00 pct)	 40951.04 (21.30 pct)
 1024	 34823.14 (0.00 pct)	 41737.76 (19.85 pct)

o NPS4

Clients:	tip			ILB_UTIL
    1	 450.14 (0.00 pct)	 451.88 (0.38 pct)
    2	 863.26 (0.00 pct)	 864.96 (0.19 pct)
    4	 1618.71 (0.00 pct)	 1632.00 (0.82 pct)
    8	 2929.35 (0.00 pct)	 3071.80 (4.86 pct)
   16	 5114.04 (0.00 pct)	 5373.74 (5.07 pct)
   32	 7912.18 (0.00 pct)	 8830.49 (11.60 pct)
   64	 14424.72 (0.00 pct)	 15598.13 (8.13 pct)
  128	 23614.97 (0.00 pct)	 24563.76 (4.01 pct)
  256	 34365.13 (0.00 pct)	 32096.70 (-6.60 pct)
  512	 34215.50 (0.00 pct)	 42068.49 (22.95 pct)
 1024	 35421.90 (0.00 pct)	 42230.56 (19.22 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:		tip			ILB_UTIL
 Copy:	 271317.35 (0.00 pct)	 304210.62 (12.12 pct)
Scale:	 205533.77 (0.00 pct)	 204155.75 (-0.67 pct)
  Add:	 221624.62 (0.00 pct)	 228757.07 (3.21 pct)
Triad:	 228500.68 (0.00 pct)	 236454.48 (3.48 pct)

- 100 Runs:

Test:		tip			ILB_UTIL
 Copy:	 317381.65 (0.00 pct)	 321587.90 (1.32 pct)
Scale:	 214145.00 (0.00 pct)	 211397.70 (-1.28 pct)
  Add:	 239243.29 (0.00 pct)	 235497.67 (-1.56 pct)
Triad:	 249477.76 (0.00 pct)	 240764.14 (-3.49 pct)

o NPS2

- 10 Runs:

Test:		tip			ILB_UTIL
 Copy:	 277761.29 (0.00 pct)	 279582.97 (0.65 pct)
Scale:	 215193.83 (0.00 pct)	 203628.71 (-5.37 pct)
  Add:	 242725.75 (0.00 pct)	 232522.80 (-4.20 pct)
Triad:	 237253.44 (0.00 pct)	 245716.42 (3.56 pct)

- 100 Runs:

Test:		tip			ILB_UTIL
 Copy:	 318082.10 (0.00 pct)	 320640.80 (0.80 pct)
Scale:	 219338.56 (0.00 pct)	 222158.47 (1.28 pct)
  Add:	 248118.20 (0.00 pct)	 254163.15 (2.43 pct)
Triad:	 247088.55 (0.00 pct)	 252459.53 (2.17 pct)

o NPS4

- 10 Runs:

Test:		tip			ILB_UTIL
 Copy:	 273307.14 (0.00 pct)	 269979.40 (-1.21 pct)
Scale:	 235715.23 (0.00 pct)	 225429.20 (-4.36 pct)
  Add:	 244500.40 (0.00 pct)	 227988.81 (-6.75 pct)
Triad:	 250600.04 (0.00 pct)	 234012.67 (-6.61 pct)

- 100 Runs:

Test:		tip			ILB_UTIL
 Copy:	 345396.19 (0.00 pct)	 335548.25 (-2.85 pct)
Scale:	 241521.63 (0.00 pct)	 228991.04 (-5.18 pct)
  Add:	 261157.86 (0.00 pct)	 247020.34 (-5.41 pct)
Triad:	 267804.99 (0.00 pct)	 258260.01 (-3.56 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

Test:			tip			ILB_UTIL
 1-clients:	 102839.97 (0.00 pct)	 101826.77 (-0.98 pct)
 2-clients:	 98428.08 (0.00 pct)	 98563.25 (0.13 pct)
 4-clients:	 92298.45 (0.00 pct)	 95310.26 (3.26 pct)
 8-clients:	 85618.41 (0.00 pct)	 87859.85 (2.61 pct)
16-clients:	 78722.18 (0.00 pct)	 79430.42 (0.89 pct)
32-clients:	 73610.75 (0.00 pct)	 76459.08 (3.86 pct)
64-clients:	 55285.07 (0.00 pct)	 64071.43 (15.89 pct)
128-clients:	 31176.92 (0.00 pct)	 37287.20 (19.59 pct)
256-clients:	 20011.44 (0.00 pct)	 31243.73 (56.12 pct)

o NPS2

Test:			tip			ILB_UTIL
 1-clients:	 103105.55 (0.00 pct)	 99162.95 (-3.82 pct)
 2-clients:	 98720.29 (0.00 pct)	 96055.84 (-2.69 pct)
 4-clients:	 92289.39 (0.00 pct)	 92818.61 (0.57 pct)
 8-clients:	 84998.63 (0.00 pct)	 86693.17 (1.99 pct)
16-clients:	 76395.81 (0.00 pct)	 77137.01 (0.97 pct)
32-clients:	 71110.89 (0.00 pct)	 70154.80 (-1.34 pct)
64-clients:	 49526.21 (0.00 pct)	 55032.79 (11.11 pct)
128-clients:	 27917.51 (0.00 pct)	 36377.03 (30.30 pct)
256-clients:	 20067.17 (0.00 pct)	 27607.78 (37.57 pct)

o NPS4

Test:			tip			ILB_UTIL
 1-clients:	 102139.49 (0.00 pct)	 103414.93 (1.24 pct)
 2-clients:	 98259.53 (0.00 pct)	 101472.40 (3.26 pct)
 4-clients:	 91576.79 (0.00 pct)	 96917.69 (5.83 pct)
 8-clients:	 84742.30 (0.00 pct)	 90389.72 (6.66 pct)
16-clients:	 79540.75 (0.00 pct)	 85183.23 (7.09 pct)
32-clients:	 71166.14 (0.00 pct)	 78511.48 (10.32 pct)
64-clients:	 51763.24 (0.00 pct)	 61334.30 (18.49 pct)
128-clients:	 27829.29 (0.00 pct)	 35989.34 (29.32 pct)
256-clients:	 24185.37 (0.00 pct)	 35769.17 (47.89 pct)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

						tip			ILB_UTIL
Hmean     unixbench-dhry2reg-1   	  41322625.19 (   0.00%)    41202944.91 (  -0.29%)
Hmean     unixbench-dhry2reg-512	6252491108.60 (   0.00%)  6193511930.01 *  -0.94%*
Amean     unixbench-syscall-1    	   2501398.27 (   0.00%)     2558258.57 *  -2.27%*
Amean     unixbench-syscall-512  	   8120524.00 (   0.00%)     8014692.00 *   1.30%*
Hmean     unixbench-pipe-1    		   2359346.02 (   0.00%)     2395716.82 *   1.54%*
Hmean     unixbench-pipe-512		 338790322.61 (   0.00%)   339462110.52 (   0.20%)
Hmean     unixbench-spawn-1      	      4261.52 (   0.00%)        4786.09 *  12.31%*
Hmean     unixbench-spawn-512    	     64328.93 (   0.00%)       68328.36 *   6.22%*
Hmean     unixbench-execl-1      	      3677.73 (   0.00%)        3671.96 (  -0.16%)
Hmean     unixbench-execl-512    	     11984.83 (   0.00%)       13272.01 (  10.74%)

o NPS2

						tip			ILB_UTIL
Hmean     unixbench-dhry2reg-1   	  41311787.29 (   0.00%)    41209738.92 (  -0.25%)
Hmean     unixbench-dhry2reg-512	6243873272.76 (   0.00%)  6198007442.15 *  -0.73%*
Amean     unixbench-syscall-1    	   2503190.70 (   0.00%)     2559295.30 *  -2.24%*
Amean     unixbench-syscall-512  	   8012388.13 (   0.00%)     7984268.83 *   0.35%*
Hmean     unixbench-pipe-1    		   2340486.25 (   0.00%)     2395174.42 *   2.34%*
Hmean     unixbench-pipe-512		 338965319.79 (   0.00%)   339972146.39 (   0.30%)
Hmean     unixbench-spawn-1    		      5241.83 (   0.00%)        5041.98 *  -3.81%*
Hmean     unixbench-spawn-512  		     65799.86 (   0.00%)       68871.88 *   4.67%*
Hmean     unixbench-execl-1    		      3670.65 (   0.00%)        3659.10 (  -0.31%)
Hmean     unixbench-execl-512  		     13682.00 (   0.00%)       13984.58 (   2.21%)

o NPS4

						tip			ILB_UTIL
Hmean     unixbench-dhry2reg-1   	  41025577.99 (   0.00%)    41039940.89 (   0.04%)
Hmean     unixbench-dhry2reg-512	6255568261.91 (   0.00%)  6216198481.97 *  -0.63%*
Amean     unixbench-syscall-1    	   2507165.37 (   0.00%)     2553468.33 *  -1.85%*
Amean     unixbench-syscall-512  	   7458476.50 (   0.00%)     7483366.27 *  -0.33%*
Hmean     unixbench-pipe-1    		   2369301.21 (   0.00%)     2397653.84 *   1.20%*
Hmean     unixbench-pipe-512		 340299405.72 (   0.00%)   340332182.64 (   0.01%)
Hmean     unixbench-spawn-1      	      5571.78 (   0.00%)        5389.50 (  -3.27%)
Hmean     unixbench-spawn-512    	     63999.96 (   0.00%)       68343.41 *   6.79%*
Hmean     unixbench-execl-1      	      3587.15 (   0.00%)        3628.48 *   1.15%*
Hmean     unixbench-execl-512    	     14184.17 (   0.00%)       13720.55 (  -3.27%)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

base:			298681.00 (var: 2.31%)
ILB_UTIL		292352.67 (var: 3.31%) (-2.11%)

o NPS2:

base:			296570.00 (var: 1.01%)
ILB_UTIL		298804.67 (var: 1.50%) (0.75%)

o NPS4:

base			297181.67 (var: 0.46%)
ILB_UTIL		297495.00 (var: 0.33%) (0.10%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1:

- 1 CCD

base:			1.00 (var: 0.27%)
ILB_UTIL:		1.03 (var: 0.16%) (+3.31%)

- 2 CCD

base:			1.00 (var: 0.42%)
ILB_UTIL:		1.01 (var: 0.19%) (+1.48%)

- 4 CCD

base:			1.00 (var: 0.46%)
ILB_UTIL:		0.98 (var: 0.17%) (-2.00%)

- 8 CCD

base:			1.00 (var: 0.63%)
ILB_UTIL:		0.96 (var: 0.46%) (-3.79%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb2015 - multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

max-jOPS	1.00		0.99  (-1.11%)  
critical-jOPS	1.00		0.99  (-1.06%)

--

I have a couple of theories:

o Either new_idle_balance is failing to find an overloaded busy rq as a
  result of the limit.

o Or, there is a chain reaction where pulling from a loaded rq which is not
  the most loaded, will lead to more new_idle_balancing attempts which is
  degrading performance.

I'll go back and get some data to narrow down the cause. Meanwhile if
there is any specific benchmark you would like me to run on the test
system, please do let me know.

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-07-10 11:06       ` K Prateek Nayak
@ 2023-07-10 15:37         ` Chen Yu
  2023-07-11  2:52           ` K Prateek Nayak
  0 siblings, 1 reply; 17+ messages in thread
From: Chen Yu @ 2023-07-10 15:37 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, Abel Wu,
	Gautham R. Shenoy, Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hi Prateek,

thanks for testing this patch,
On 2023-07-10 at 16:36:47 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> Thank you for sharing this extended version. Sharing the results from
> testing below.
> 
> tl;dr
> 
> - tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.
> 
> - schbench (old) sees a regression in tail latency once system is heavily 
>   loaded. DeathStarBench and SPECjbb too see a small drop under those
>   conditions.
> 
> - Rest of the benchmark results do not vary much.
> 
>
[...] 
> I have a couple of theories:
> 
Thanks for the insights, I agree the risk you mentioned below could impact the
performance. Some thoughts below:
> o Either new_idle_balance is failing to find an overloaded busy rq as a
>   result of the limit.
> 
If the system is overloaded, the ilb_util finds a relatively busy rq and pulls from it.
There could be no much difference between a relatively busy rq and the busiest one,
because all rqs are quite busy I suppose?
> o Or, there is a chain reaction where pulling from a loaded rq which is not
>   the most loaded, will lead to more new_idle_balancing attempts which is
>   degrading performance.
Yeah, it could be possible that the ilb_util finds a relatively busy rq, but the
imbalance is not high so ilb decides not pull from it. However the busiest
rq is still waiting for someone to help, and this could trigger idle load
balance more frequently.
> 
> I'll go back and get some data to narrow down the cause. Meanwhile if
> there is any specific benchmark you would like me to run on the test
> system, please do let me know.
> 
Another hints might be that, as Gautham and Peter suggested, we should apply ILB_UTIL
to non-Numa domains. In above patch all the domains has sd_share which could
bring negative impact when accessing/writing cross-node data.
Sorry I did not post the latest version with Numa domain excluded previously as
I was trying to create a protype to further speed up idle load balance by
reusing the statistic suggested by Peter. I will send them out once it is stable.

Thanks again,
Chenyu
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-07-10 15:37         ` Chen Yu
@ 2023-07-11  2:52           ` K Prateek Nayak
  0 siblings, 0 replies; 17+ messages in thread
From: K Prateek Nayak @ 2023-07-11  2:52 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, Abel Wu,
	Gautham R. Shenoy, Len Brown, Chen Yu, Yicong Yang, linux-kernel

Hello Chenyu,

Thank you for taking a look at the results.

On 7/10/2023 9:07 PM, Chen Yu wrote:
> Hi Prateek,
> 
> thanks for testing this patch,
> On 2023-07-10 at 16:36:47 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> Thank you for sharing this extended version. Sharing the results from
>> testing below.
>>
>> tl;dr
>>
>> - tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.
>>
>> - schbench (old) sees a regression in tail latency once system is heavily 
>>   loaded. DeathStarBench and SPECjbb too see a small drop under those
>>   conditions.
>>
>> - Rest of the benchmark results do not vary much.
>>
>>
> [...] 
>> I have a couple of theories:
>>
> Thanks for the insights, I agree the risk you mentioned below could impact the
> performance. Some thoughts below:
>> o Either new_idle_balance is failing to find an overloaded busy rq as a
>>   result of the limit.
>>
> If the system is overloaded, the ilb_util finds a relatively busy rq and pulls from it.
> There could be no much difference between a relatively busy rq and the busiest one,
> because all rqs are quite busy I suppose?
>> o Or, there is a chain reaction where pulling from a loaded rq which is not
>>   the most loaded, will lead to more new_idle_balancing attempts which is
>>   degrading performance.
> Yeah, it could be possible that the ilb_util finds a relatively busy rq, but the
> imbalance is not high so ilb decides not pull from it. However the busiest
> rq is still waiting for someone to help, and this could trigger idle load
> balance more frequently.

Yes, I was thinking along the similar lines.

>>
>> I'll go back and get some data to narrow down the cause. Meanwhile if
>> there is any specific benchmark you would like me to run on the test
>> system, please do let me know.
>>
> Another hints might be that, as Gautham and Peter suggested, we should apply ILB_UTIL
> to non-Numa domains. In above patch all the domains has sd_share which could
> bring negative impact when accessing/writing cross-node data.
> Sorry I did not post the latest version with Numa domain excluded previously as
> I was trying to create a protype to further speed up idle load balance by
> reusing the statistic suggested by Peter. I will send them out once it is stable.

Sure. I'll keep an eye out for the next version :)

> 
> [..snip..]
>

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-11  2:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 16:17 [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Chen Yu
2023-06-12 16:18 ` [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared Chen Yu
2023-06-12 16:18 ` [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
2023-06-12 16:18 ` [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
2023-06-15  6:01   ` Gautham R. Shenoy
2023-06-16  6:17     ` Chen Yu
2023-06-21  7:29     ` Chen Yu
2023-06-22  6:01       ` Gautham R. Shenoy
2023-07-10 11:06       ` K Prateek Nayak
2023-07-10 15:37         ` Chen Yu
2023-07-11  2:52           ` K Prateek Nayak
2023-06-21 11:20     ` Peter Zijlstra
2023-06-21 11:17   ` Peter Zijlstra
2023-06-23 14:33     ` Chen Yu
2023-06-23 14:43       ` Chen Yu
2023-06-12 16:19 ` [RFC PATCH 4/4] sched/fair: Throttle the busiest group scanning in idle load balance Chen Yu
2023-06-15  4:22 ` [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance Gautham R. Shenoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).