linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/fair: filter out overloaded cpus in SIS
@ 2022-05-05 12:23 Abel Wu
  2022-05-07 16:09 ` Chen Yu
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Abel Wu @ 2022-05-05 12:23 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don
  Cc: linux-kernel, Abel Wu

Try to improve searching efficiency of SIS by filtering out the
overloaded cpus, and as a result the more overloaded the system
is, the less cpus will be searched.

The overloaded cpus are tracked through LLC shared domain. To
regulate accesses to the shared data, the update happens mainly
at the tick. But in order to make it more accurate, we also take
the task migrations into consideration during load balancing which
can be quite frequent due to short running workload causing newly-
idle. Since an overloaded runqueue requires at least 2 non-idle
tasks runnable, we could have more faith on the "frequent newly-
idle" case.

Benchmark
=========

Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
machine with 2 NUMA nodes each of which has 24 cores with SMT2
enabled, so 96 CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled.

Based on tip sched/core 089c02ae2771 (v5.18-rc1).

Results
=======

hackbench-process-pipes
                             vanilla		     filter
Amean     1        0.2537 (   0.00%)      0.2330 (   8.15%)
Amean     4        0.7090 (   0.00%)      0.7440 *  -4.94%*
Amean     7        0.9153 (   0.00%)      0.9040 (   1.24%)
Amean     12       1.1473 (   0.00%)      1.0857 *   5.37%*
Amean     21       2.7210 (   0.00%)      2.2320 *  17.97%*
Amean     30       4.8263 (   0.00%)      3.6170 *  25.06%*
Amean     48       7.4107 (   0.00%)      6.1130 *  17.51%*
Amean     79       9.2120 (   0.00%)      8.2350 *  10.61%*
Amean     110     10.1647 (   0.00%)      8.8043 *  13.38%*
Amean     141     11.5713 (   0.00%)     10.5867 *   8.51%*
Amean     172     13.7963 (   0.00%)     12.8120 *   7.13%*
Amean     203     15.9283 (   0.00%)     14.8703 *   6.64%*
Amean     234     17.8737 (   0.00%)     17.1053 *   4.30%*
Amean     265     19.8443 (   0.00%)     18.7870 *   5.33%*
Amean     296     22.4147 (   0.00%)     21.3943 *   4.55%*

There is a regression in 4-groups test in which case lots of busy
cpus can be found in the system. The busy cpus are not recorded in
the overloaded cpu mask, so we trade overhead for nothing in SIS.
This is the worst case of this feature.

hackbench-process-sockets
                             vanilla		     filter
Amean     1        0.5343 (   0.00%)      0.5270 (   1.37%)
Amean     4        1.4500 (   0.00%)      1.4273 *   1.56%*
Amean     7        2.4813 (   0.00%)      2.4383 *   1.73%*
Amean     12       4.1357 (   0.00%)      4.0827 *   1.28%*
Amean     21       6.9707 (   0.00%)      6.9290 (   0.60%)
Amean     30       9.8373 (   0.00%)      9.6730 *   1.67%*
Amean     48      15.6233 (   0.00%)     15.3213 *   1.93%*
Amean     79      26.2763 (   0.00%)     25.3293 *   3.60%*
Amean     110     36.6170 (   0.00%)     35.8920 *   1.98%*
Amean     141     47.0720 (   0.00%)     45.8773 *   2.54%*
Amean     172     57.0580 (   0.00%)     56.1627 *   1.57%*
Amean     203     67.2040 (   0.00%)     66.4323 *   1.15%*
Amean     234     77.8897 (   0.00%)     76.6320 *   1.61%*
Amean     265     88.0437 (   0.00%)     87.1400 (   1.03%)
Amean     296     98.2387 (   0.00%)     96.8633 *   1.40%*

hackbench-thread-pipes
                             vanilla		     filter
Amean     1        0.2693 (   0.00%)      0.2800 *  -3.96%*
Amean     4        0.7843 (   0.00%)      0.7680 (   2.08%)
Amean     7        0.9287 (   0.00%)      0.9217 (   0.75%)
Amean     12       1.4443 (   0.00%)      1.3680 *   5.29%*
Amean     21       3.5150 (   0.00%)      3.1107 *  11.50%*
Amean     30       6.3997 (   0.00%)      5.2160 *  18.50%*
Amean     48       8.4183 (   0.00%)      7.8477 *   6.78%*
Amean     79      10.0713 (   0.00%)      9.2240 *   8.41%*
Amean     110     10.9940 (   0.00%)     10.1280 *   7.88%*
Amean     141     13.6347 (   0.00%)     11.9387 *  12.44%*
Amean     172     15.0523 (   0.00%)     14.4117 (   4.26%)
Amean     203     18.0710 (   0.00%)     17.3533 (   3.97%)
Amean     234     19.7413 (   0.00%)     19.8453 (  -0.53%)
Amean     265     23.1820 (   0.00%)     22.8223 (   1.55%)
Amean     296     25.3820 (   0.00%)     24.2397 (   4.50%)

hackbench-thread-sockets
                             vanilla		     filter
Amean     1        0.5893 (   0.00%)      0.5750 *   2.43%*
Amean     4        1.4853 (   0.00%)      1.4727 (   0.85%)
Amean     7        2.5353 (   0.00%)      2.5047 *   1.21%*
Amean     12       4.3003 (   0.00%)      4.1910 *   2.54%*
Amean     21       7.1930 (   0.00%)      7.1533 (   0.55%)
Amean     30      10.0983 (   0.00%)      9.9690 *   1.28%*
Amean     48      15.9853 (   0.00%)     15.6963 *   1.81%*
Amean     79      26.7537 (   0.00%)     25.9497 *   3.01%*
Amean     110     37.3850 (   0.00%)     36.6793 *   1.89%*
Amean     141     47.7730 (   0.00%)     47.0967 *   1.42%*
Amean     172     58.4280 (   0.00%)     57.5513 *   1.50%*
Amean     203     69.3093 (   0.00%)     67.7680 *   2.22%*
Amean     234     80.0190 (   0.00%)     78.2633 *   2.19%*
Amean     265     90.7237 (   0.00%)     89.1027 *   1.79%*
Amean     296    101.1153 (   0.00%)     99.2693 *   1.83%*

schbench
				   vanilla		   filter
Lat 50.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
Lat 75.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
Lat 95.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
Lat 99.0th-qrtle-1         6.00 (   0.00%)        6.00 (   0.00%)
Lat 99.5th-qrtle-1         7.00 (   0.00%)        6.00 (  14.29%)
Lat 99.9th-qrtle-1         7.00 (   0.00%)        7.00 (   0.00%)
Lat 50.0th-qrtle-2         6.00 (   0.00%)        6.00 (   0.00%)
Lat 75.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
Lat 90.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
Lat 95.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
Lat 99.0th-qrtle-2         9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.5th-qrtle-2         9.00 (   0.00%)        9.00 (   0.00%)
Lat 99.9th-qrtle-2        12.00 (   0.00%)       11.00 (   8.33%)
Lat 50.0th-qrtle-4         8.00 (   0.00%)        8.00 (   0.00%)
Lat 75.0th-qrtle-4        10.00 (   0.00%)       10.00 (   0.00%)
Lat 90.0th-qrtle-4        10.00 (   0.00%)       11.00 ( -10.00%)
Lat 95.0th-qrtle-4        11.00 (   0.00%)       11.00 (   0.00%)
Lat 99.0th-qrtle-4        12.00 (   0.00%)       13.00 (  -8.33%)
Lat 99.5th-qrtle-4        16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.9th-qrtle-4        17.00 (   0.00%)       15.00 (  11.76%)
Lat 50.0th-qrtle-8        13.00 (   0.00%)       13.00 (   0.00%)
Lat 75.0th-qrtle-8        16.00 (   0.00%)       16.00 (   0.00%)
Lat 90.0th-qrtle-8        18.00 (   0.00%)       18.00 (   0.00%)
Lat 95.0th-qrtle-8        19.00 (   0.00%)       18.00 (   5.26%)
Lat 99.0th-qrtle-8        24.00 (   0.00%)       21.00 (  12.50%)
Lat 99.5th-qrtle-8        28.00 (   0.00%)       26.00 (   7.14%)
Lat 99.9th-qrtle-8        33.00 (   0.00%)       32.00 (   3.03%)
Lat 50.0th-qrtle-16       20.00 (   0.00%)       20.00 (   0.00%)
Lat 75.0th-qrtle-16       28.00 (   0.00%)       28.00 (   0.00%)
Lat 90.0th-qrtle-16       32.00 (   0.00%)       32.00 (   0.00%)
Lat 95.0th-qrtle-16       34.00 (   0.00%)       34.00 (   0.00%)
Lat 99.0th-qrtle-16       40.00 (   0.00%)       40.00 (   0.00%)
Lat 99.5th-qrtle-16       44.00 (   0.00%)       44.00 (   0.00%)
Lat 99.9th-qrtle-16       53.00 (   0.00%)       67.00 ( -26.42%)
Lat 50.0th-qrtle-32       39.00 (   0.00%)       36.00 (   7.69%)
Lat 75.0th-qrtle-32       57.00 (   0.00%)       52.00 (   8.77%)
Lat 90.0th-qrtle-32       69.00 (   0.00%)       61.00 (  11.59%)
Lat 95.0th-qrtle-32       76.00 (   0.00%)       64.00 (  15.79%)
Lat 99.0th-qrtle-32       88.00 (   0.00%)       74.00 (  15.91%)
Lat 99.5th-qrtle-32       91.00 (   0.00%)       80.00 (  12.09%)
Lat 99.9th-qrtle-32      115.00 (   0.00%)      107.00 (   6.96%)
Lat 50.0th-qrtle-47       63.00 (   0.00%)       55.00 (  12.70%)
Lat 75.0th-qrtle-47       93.00 (   0.00%)       80.00 (  13.98%)
Lat 90.0th-qrtle-47      116.00 (   0.00%)       97.00 (  16.38%)
Lat 95.0th-qrtle-47      129.00 (   0.00%)      106.00 (  17.83%)
Lat 99.0th-qrtle-47      148.00 (   0.00%)      123.00 (  16.89%)
Lat 99.5th-qrtle-47      157.00 (   0.00%)      132.00 (  15.92%)
Lat 99.9th-qrtle-47      387.00 (   0.00%)      164.00 (  57.62%)

netperf-udp
				    vanilla		    filter
Hmean     send-64         183.09 (   0.00%)      182.28 (  -0.44%)
Hmean     send-128        364.68 (   0.00%)      363.12 (  -0.43%)
Hmean     send-256        715.38 (   0.00%)      716.57 (   0.17%)
Hmean     send-1024      2764.76 (   0.00%)     2779.17 (   0.52%)
Hmean     send-2048      5282.93 (   0.00%)     5220.41 *  -1.18%*
Hmean     send-3312      8282.26 (   0.00%)     8121.78 *  -1.94%*
Hmean     send-4096     10108.12 (   0.00%)    10042.98 (  -0.64%)
Hmean     send-8192     16868.49 (   0.00%)    16826.99 (  -0.25%)
Hmean     send-16384    26230.44 (   0.00%)    26271.85 (   0.16%)
Hmean     recv-64         183.09 (   0.00%)      182.28 (  -0.44%)
Hmean     recv-128        364.68 (   0.00%)      363.12 (  -0.43%)
Hmean     recv-256        715.38 (   0.00%)      716.57 (   0.17%)
Hmean     recv-1024      2764.76 (   0.00%)     2779.17 (   0.52%)
Hmean     recv-2048      5282.93 (   0.00%)     5220.39 *  -1.18%*
Hmean     recv-3312      8282.26 (   0.00%)     8121.78 *  -1.94%*
Hmean     recv-4096     10108.12 (   0.00%)    10042.97 (  -0.64%)
Hmean     recv-8192     16868.47 (   0.00%)    16826.93 (  -0.25%)
Hmean     recv-16384    26230.44 (   0.00%)    26271.75 (   0.16%)

The overhead this feature added to the scheduler can be unfriendly
to the fast context-switching workloads like netperf/tbench. But
the test result seems fine.

netperf-tcp
			       vanilla		       filter
Hmean     64         863.35 (   0.00%)     1176.11 *  36.23%*
Hmean     128       1674.32 (   0.00%)     2223.37 *  32.79%*
Hmean     256       3151.03 (   0.00%)     4109.64 *  30.42%*
Hmean     1024     10281.94 (   0.00%)    12799.28 *  24.48%*
Hmean     2048     16906.05 (   0.00%)    20129.91 *  19.07%*
Hmean     3312     21246.21 (   0.00%)    24747.24 *  16.48%*
Hmean     4096     23690.57 (   0.00%)    26596.35 *  12.27%*
Hmean     8192     28758.29 (   0.00%)    30423.10 *   5.79%*
Hmean     16384    33071.06 (   0.00%)    34262.39 *   3.60%*

The suspicious improvement (and regression in tbench4-128) needs
further digging..

tbench4 Throughput
			     vanilla		     filter
Hmean     1        293.71 (   0.00%)      298.89 *   1.76%*
Hmean     2        583.25 (   0.00%)      596.00 *   2.19%*
Hmean     4       1162.40 (   0.00%)     1176.73 *   1.23%*
Hmean     8       2309.28 (   0.00%)     2332.89 *   1.02%*
Hmean     16      4517.23 (   0.00%)     4587.60 *   1.56%*
Hmean     32      7458.54 (   0.00%)     7550.19 *   1.23%*
Hmean     64      9041.62 (   0.00%)     9192.69 *   1.67%*
Hmean     128    19983.62 (   0.00%)    12228.91 * -38.81%*
Hmean     256    20054.12 (   0.00%)    20997.33 *   4.70%*
Hmean     384    19137.11 (   0.00%)    20331.14 *   6.24%*

v3:
  - removed sched-idle balance feature and focus on SIS
  - take non-CFS tasks into consideration
  - several fixes/improvement suggested by Josh Don

v2:
  - several optimizations on sched-idle balancing
  - ignore asym topos in can_migrate_task
  - add more benchmarks including SIS efficiency
  - re-organize patch as suggested by Mel

v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/
v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 12 ++++++++++
 kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
 kernel/sched/idle.c            |  1 +
 kernel/sched/sched.h           |  4 ++++
 kernel/sched/topology.c        |  4 +++-
 6 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..95c7ad1e05b5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,8 +81,20 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+
+	/*
+	 * Tracking of the overloaded cpus can be heavy, so start
+	 * a new cacheline to avoid false sharing.
+	 */
+	atomic_t	nr_overloaded_cpus ____cacheline_aligned;
+	unsigned long	overloaded_cpus[]; /* Must be last */
 };
 
+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->overloaded_cpus);
+}
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51efaabac3e4..a29801c8b363 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5320,6 +5320,42 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
 static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
 #endif /* CONFIG_SCHED_DEBUG */
 
+#ifdef CONFIG_SMP
+static inline bool rq_overloaded(struct rq *rq)
+{
+	return rq->nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+void update_overloaded_rq(struct rq *rq)
+{
+	struct sched_domain_shared *sds;
+	bool overloaded = rq_overloaded(rq);
+	int cpu = cpu_of(rq);
+
+	lockdep_assert_rq_held(rq);
+
+	if (rq->overloaded == overloaded)
+		return;
+
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (unlikely(!sds))
+		goto unlock;
+
+	if (overloaded) {
+		cpumask_set_cpu(cpu, sdo_mask(sds));
+		atomic_inc(&sds->nr_overloaded_cpus);
+	} else {
+		cpumask_clear_cpu(cpu, sdo_mask(sds));
+		atomic_dec(&sds->nr_overloaded_cpus);
+	}
+
+	rq->overloaded = overloaded;
+unlock:
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -5346,6 +5382,7 @@ void scheduler_tick(void)
 		resched_latency = cpu_resched_latency(rq);
 	calc_global_load_tick(rq);
 	sched_core_tick(rq);
+	update_overloaded_rq(rq);
 
 	rq_unlock(rq, &rf);
 
@@ -9578,6 +9615,7 @@ void __init sched_init(void)
 		rq->wake_stamp = jiffies;
 		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->overloaded = 0;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..79b4ff24faee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	struct sched_domain_shared *sds = sd->shared;
+	int nr, nro, weight = sd->span_weight;
+	int i, cpu, idle_cpu = -1;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (!this_sd)
 		return -1;
 
+	nro = atomic_read(&sds->nr_overloaded_cpus);
+	if (nro == weight)
+		goto out;
+
+	nr = min_t(int, weight, p->nr_cpus_allowed);
+
+	/*
+	 * It's unlikely to find an idle cpu if the system is under
+	 * heavy pressure, so skip searching to save a few cycles
+	 * and relieve cache traffic.
+	 */
+	if (weight - nro < (nr >> 4) && !has_idle_core)
+		return -1;
+
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	if (nro > 1)
+		cpumask_andnot(cpus, cpus, sdo_mask(sds));
 
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
@@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		avg_idle = this_rq->wake_avg_idle;
 		avg_cost = this_sd->avg_scan_cost + 1;
 
-		span_avg = sd->span_weight * avg_idle;
+		span_avg = weight * avg_idle;
 		if (span_avg > 4*avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
@@ -6378,9 +6396,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
-	if (has_idle_core)
-		set_idle_cores(target, false);
-
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		time = cpu_clock(this) - time;
 
@@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 
 		update_avg(&this_sd->avg_scan_cost, time);
 	}
+out:
+	if (has_idle_core)
+		WRITE_ONCE(sds->has_idle_cores, 0);
 
 	return idle_cpu;
 }
@@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 			continue;
 
 		detach_task(p, env);
+		update_overloaded_rq(env->src_rq);
 
 		/*
 		 * Right now, this is only the second place where
@@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
 		list_move(&p->se.group_node, tasks);
 	}
 
+	if (detached)
+		update_overloaded_rq(env->src_rq);
+
 	/*
 	 * Right now, this is one of only two places we collect this stat
 	 * so we can safely collect detach_one_task() stats here rather
@@ -8080,6 +8102,7 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	attach_task(rq, p);
+	update_overloaded_rq(rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -8090,20 +8113,22 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
 static void attach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->tasks;
+	struct rq *rq = env->dst_rq;
 	struct task_struct *p;
 	struct rq_flags rf;
 
-	rq_lock(env->dst_rq, &rf);
-	update_rq_clock(env->dst_rq);
+	rq_lock(rq, &rf);
+	update_rq_clock(rq);
 
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 		list_del_init(&p->se.group_node);
 
-		attach_task(env->dst_rq, p);
+		attach_task(rq, p);
 	}
 
-	rq_unlock(env->dst_rq, &rf);
+	update_overloaded_rq(rq);
+	rq_unlock(rq, &rf);
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ecb0d7052877..7b65c9046a75 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -433,6 +433,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
+	update_overloaded_rq(rq);
 	schedstat_inc(rq->sched_goidle);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8dccb34eb190..d2b6e65cc336 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -997,6 +997,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		overloaded;
 
 	unsigned long		misfit_task_load;
 
@@ -1830,8 +1831,11 @@ extern int sched_update_scaling(void);
 
 extern void flush_smp_call_function_from_idle(void);
 
+extern void update_overloaded_rq(struct rq *rq);
+
 #else /* !CONFIG_SMP: */
 static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_overloaded_rq(struct rq *rq) { }
 #endif
 
 #include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 810750e62118..6d5291875275 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1620,6 +1620,8 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		atomic_set(&sd->shared->nr_overloaded_cpus, 0);
+		cpumask_clear(sdo_mask(sd->shared));
 	}
 
 	sd->private = sdd;
@@ -2085,7 +2087,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
@ 2022-05-07 16:09 ` Chen Yu
  2022-05-07 17:50   ` Abel Wu
  2022-05-10  1:14 ` Josh Don
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Chen Yu @ 2022-05-07 16:09 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	Linux Kernel Mailing List

 in
Hi Abel,
On Fri, May 6, 2022 at 1:21 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Try to improve searching efficiency of SIS by filtering out the
> overloaded cpus, and as a result the more overloaded the system
> is, the less cpus will be searched.
>
My understanding is that, this patch aims to address the following issue:
What kind of CPUs should the SIS  scan.
And we also have another patch[1]  from another angle:
How many CPUs should the SIS  scan.
I assume the two direction could both help speed up the SIS process, so
I'm curious what the result would be with both patch applied, and I planned
to run your patch on my system too.
> The overloaded cpus are tracked through LLC shared domain. To
> regulate accesses to the shared data, the update happens mainly
> at the tick. But in order to make it more accurate, we also take
> the task migrations into consideration during load balancing which
> can be quite frequent due to short running workload causing newly-
> idle. Since an overloaded runqueue requires at least 2 non-idle
> tasks runnable, we could have more faith on the "frequent newly-
> idle" case.
>
> Benchmark
> =========
>
> Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
> machine with 2 NUMA nodes each of which has 24 cores with SMT2
> enabled, so 96 CPUs in total.
>
> All of the benchmarks are done inside a normal cpu cgroup in a
Do you have any script that I can leverage to launch the test?
> clean environment with cpu turbo disabled.
I would recommend to apply the following patch(queued for 5.19) if
the intel_pstate driver is loaded, because it seems that there is a
utilization calculation
issue when turbo is disabled:

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 846bb3a78788..2216b24b6f84 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
struct kobj_attribute *b,
  mutex_unlock(&intel_pstate_limits_lock);

  intel_pstate_update_policies();
+ arch_set_max_freq_ratio(global.no_turbo);

  mutex_unlock(&intel_pstate_driver_lock);

--
2.25.1
[cut]
>
> v3:
>   - removed sched-idle balance feature and focus on SIS
>   - take non-CFS tasks into consideration
>   - several fixes/improvement suggested by Josh Don
>
> v2:
>   - several optimizations on sched-idle balancing
>   - ignore asym topos in can_migrate_task
>   - add more benchmarks including SIS efficiency
>   - re-organize patch as suggested by Mel
>
> v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/
> v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 12 ++++++++++
>  kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>  kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>  kernel/sched/idle.c            |  1 +
>  kernel/sched/sched.h           |  4 ++++
>  kernel/sched/topology.c        |  4 +++-
>  6 files changed, 92 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..95c7ad1e05b5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>         atomic_t        ref;
>         atomic_t        nr_busy_cpus;
>         int             has_idle_cores;
> +
> +       /*
> +        * Tracking of the overloaded cpus can be heavy, so start
> +        * a new cacheline to avoid false sharing.
> +        */
Although we put the following items into different cache line compared to
above ones, is it possible that there is still cache false sharing if
CPU1 is reading nr_overloaded_cpus while
CPU2 is updating overloaded_cpus?
> +       atomic_t        nr_overloaded_cpus ____cacheline_aligned;
____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
be more lightweight?
> +       unsigned long   overloaded_cpus[]; /* Must be last */
>  };
>
[cut]
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..79b4ff24faee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +       struct sched_domain_shared *sds = sd->shared;
> +       int nr, nro, weight = sd->span_weight;
> +       int i, cpu, idle_cpu = -1;
>         struct rq *this_rq = this_rq();
>         int this = smp_processor_id();
>         struct sched_domain *this_sd;
> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         if (!this_sd)
>                 return -1;
>
> +       nro = atomic_read(&sds->nr_overloaded_cpus);
> +       if (nro == weight)
> +               goto out;
> +
> +       nr = min_t(int, weight, p->nr_cpus_allowed);
> +
> +       /*
> +        * It's unlikely to find an idle cpu if the system is under
> +        * heavy pressure, so skip searching to save a few cycles
> +        * and relieve cache traffic.
> +        */
> +       if (weight - nro < (nr >> 4) && !has_idle_core)
> +               return -1;
In [1] we used util_avg to check if the domain is overloaded and quit
earlier, since util_avg would be
more stable and contains historic data. But I think nr_running in your
patch could be used as
complementary metric and added to update_idle_cpu_scan() in [1] IMO.
> +
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +       if (nro > 1)
> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));
If I understand correctly, this is the core of the optimization: SIS
filters out the busy cores. I wonder if it
is possible to save historic h_nr_running/idle_h_nr_running and use
the average value? (like the calculation
of avg_scan_cost).

>
>         if (sched_feat(SIS_PROP) && !has_idle_core) {
>                 u64 avg_cost, avg_idle, span_avg;
> @@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>                 avg_idle = this_rq->wake_avg_idle;
>                 avg_cost = this_sd->avg_scan_cost + 1;
>
> -               span_avg = sd->span_weight * avg_idle;
> +               span_avg = weight * avg_idle;
>                 if (span_avg > 4*avg_cost)
>                         nr = div_u64(span_avg, avg_cost);
>                 else
[cut]
[1]  https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/

-- 
Thanks,
Chenyu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-07 16:09 ` Chen Yu
@ 2022-05-07 17:50   ` Abel Wu
  2022-05-09 15:21     ` Chen Yu
  0 siblings, 1 reply; 12+ messages in thread
From: Abel Wu @ 2022-05-07 17:50 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	Linux Kernel Mailing List

Hi Chen,

On 5/8/22 12:09 AM, Chen Yu Wrote:
>   in
> Hi Abel,
> On Fri, May 6, 2022 at 1:21 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Try to improve searching efficiency of SIS by filtering out the
>> overloaded cpus, and as a result the more overloaded the system
>> is, the less cpus will be searched.
>>
> My understanding is that, this patch aims to address the following issue:
> What kind of CPUs should the SIS  scan.
> And we also have another patch[1]  from another angle:
> How many CPUs should the SIS  scan.
> I assume the two direction could both help speed up the SIS process, so

Agreed.

> I'm curious what the result would be with both patch applied, and I planned
> to run your patch on my system too.
>> The overloaded cpus are tracked through LLC shared domain. To
>> regulate accesses to the shared data, the update happens mainly
>> at the tick. But in order to make it more accurate, we also take
>> the task migrations into consideration during load balancing which
>> can be quite frequent due to short running workload causing newly-
>> idle. Since an overloaded runqueue requires at least 2 non-idle
>> tasks runnable, we could have more faith on the "frequent newly-
>> idle" case.
>>
>> Benchmark
>> =========
>>
>> Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
>> machine with 2 NUMA nodes each of which has 24 cores with SMT2
>> enabled, so 96 CPUs in total.
>>
>> All of the benchmarks are done inside a normal cpu cgroup in a
> Do you have any script that I can leverage to launch the test?

I benchmarked the following configs in mmtests:
	config-scheduler-unbound
	config-scheduler-schbench
	config-network-netperf-stream-unbound
	config-network-tbench

more details: https://github.com/gormanm/mmtests

>> clean environment with cpu turbo disabled.
> I would recommend to apply the following patch(queued for 5.19) if
> the intel_pstate driver is loaded, because it seems that there is a
> utilization calculation
> issue when turbo is disabled:
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 846bb3a78788..2216b24b6f84 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
> struct kobj_attribute *b,
>    mutex_unlock(&intel_pstate_limits_lock);
> 
>    intel_pstate_update_policies();
> + arch_set_max_freq_ratio(global.no_turbo);
> 
>    mutex_unlock(&intel_pstate_driver_lock);
> 
> --
> 2.25.1
> [cut]

Thanks, I will apply it before next rounds of testing.

>>
>> v3:
>>    - removed sched-idle balance feature and focus on SIS
>>    - take non-CFS tasks into consideration
>>    - several fixes/improvement suggested by Josh Don
>>
>> v2:
>>    - several optimizations on sched-idle balancing
>>    - ignore asym topos in can_migrate_task
>>    - add more benchmarks including SIS efficiency
>>    - re-organize patch as suggested by Mel
>>
>> v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/
>> v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 12 ++++++++++
>>   kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>>   kernel/sched/idle.c            |  1 +
>>   kernel/sched/sched.h           |  4 ++++
>>   kernel/sched/topology.c        |  4 +++-
>>   6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..95c7ad1e05b5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>>          atomic_t        ref;
>>          atomic_t        nr_busy_cpus;
>>          int             has_idle_cores;
>> +
>> +       /*
>> +        * Tracking of the overloaded cpus can be heavy, so start
>> +        * a new cacheline to avoid false sharing.
>> +        */
> Although we put the following items into different cache line compared to
> above ones, is it possible that there is still cache false sharing if
> CPU1 is reading nr_overloaded_cpus while
> CPU2 is updating overloaded_cpus?

I think it's not false sharing, it's just cache contention. But yes,
it is still possible if the two items mixed with others (by compiler)
in one cacheline, which seems out of our control..

>> +       atomic_t        nr_overloaded_cpus ____cacheline_aligned;
> ____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
> seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
> be more lightweight?

I didn't see the difference of the two macros, it would be appreciate
if you can shed some light.

>> +       unsigned long   overloaded_cpus[]; /* Must be last */
>>   };
>>
> [cut]
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>   static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>>   {
>>          struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> +       struct sched_domain_shared *sds = sd->shared;
>> +       int nr, nro, weight = sd->span_weight;
>> +       int i, cpu, idle_cpu = -1;
>>          struct rq *this_rq = this_rq();
>>          int this = smp_processor_id();
>>          struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>          if (!this_sd)
>>                  return -1;
>>
>> +       nro = atomic_read(&sds->nr_overloaded_cpus);
>> +       if (nro == weight)
>> +               goto out;
>> +
>> +       nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> +       /*
>> +        * It's unlikely to find an idle cpu if the system is under
>> +        * heavy pressure, so skip searching to save a few cycles
>> +        * and relieve cache traffic.
>> +        */
>> +       if (weight - nro < (nr >> 4) && !has_idle_core)
>> +               return -1;
> In [1] we used util_avg to check if the domain is overloaded and quit
> earlier, since util_avg would be
> more stable and contains historic data. But I think nr_running in your
> patch could be used as
> complementary metric and added to update_idle_cpu_scan() in [1] IMO.
>> +
>>          cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +       if (nro > 1)
>> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));
> If I understand correctly, this is the core of the optimization: SIS
> filters out the busy cores. I wonder if it
> is possible to save historic h_nr_running/idle_h_nr_running and use
> the average value? (like the calculation
> of avg_scan_cost).

Yes, I have been already working on that for several days, and
along with some improvement on load balance (group_has_spare).
Ideally we can finally get rid out of the cache issues.

> 
>>
>>          if (sched_feat(SIS_PROP) && !has_idle_core) {
>>                  u64 avg_cost, avg_idle, span_avg;
>> @@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>                  avg_idle = this_rq->wake_avg_idle;
>>                  avg_cost = this_sd->avg_scan_cost + 1;
>>
>> -               span_avg = sd->span_weight * avg_idle;
>> +               span_avg = weight * avg_idle;
>>                  if (span_avg > 4*avg_cost)
>>                          nr = div_u64(span_avg, avg_cost);
>>                  else
> [cut]
> [1]  https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/
> 

I followed all 3 versions of your patch, and I think your work makes
sense. My only concern was that the depth is updated every llc_size
milliseconds which could be a long period in large machines and the
load can vary quickly enough to deviate from the historic value. But
it seems not a big deal as we discussed in your v1 patch.

Thanks & BR,
Abel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-07 17:50   ` Abel Wu
@ 2022-05-09 15:21     ` Chen Yu
  2022-05-09 15:31       ` Chen Yu
  2022-05-10  2:55       ` Abel Wu
  0 siblings, 2 replies; 12+ messages in thread
From: Chen Yu @ 2022-05-09 15:21 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	Linux Kernel Mailing List

On Sun, May 8, 2022 at 1:50 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Hi Chen,
>
> On 5/8/22 12:09 AM, Chen Yu Wrote:
[cut]
> >> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> >>          atomic_t        ref;
> >>          atomic_t        nr_busy_cpus;
> >>          int             has_idle_cores;
> >> +
> >> +       /*
> >> +        * Tracking of the overloaded cpus can be heavy, so start
> >> +        * a new cacheline to avoid false sharing.
> >> +        */
> > Although we put the following items into different cache line compared to
> > above ones, is it possible that there is still cache false sharing if
> > CPU1 is reading nr_overloaded_cpus while
> > CPU2 is updating overloaded_cpus?
>
> I think it's not false sharing, it's just cache contention. But yes,
> it is still possible if the two items mixed with others (by compiler)
> in one cacheline, which seems out of our control..
>
My understanding is that, since nr_overloaded_cpus starts with a new
cache line,  overloaded_cpus is very likely to be in the same cache line.
Only If the write to nr_overloaded_cpus mask is not frequent(maybe tick based
update is not frequent), the read of nr_overloaded_cpus can survive from cache
false sharing, which is mainly read by SIS.  I have a stupid thought
that if nr_overloaded_cpus
mask and nr_overloaded_cpus could be put to 2 cache lines.
> >> +       atomic_t        nr_overloaded_cpus ____cacheline_aligned;
> > ____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
> > seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
> > be more lightweight?
>
> I didn't see the difference of the two macros, it would be appreciate
> if you can shed some light.
>
Sorry I mistook  ____cacheline_aligned for __cacheline_aligned which is
put into a data section. Please ignore my previous comment.
> >> +       unsigned long   overloaded_cpus[]; /* Must be last */
> >>   };
> >>
[cut]
> >> +       /*
> >> +        * It's unlikely to find an idle cpu if the system is under
> >> +        * heavy pressure, so skip searching to save a few cycles
> >> +        * and relieve cache traffic.
> >> +        */
> >> +       if (weight - nro < (nr >> 4) && !has_idle_core)
> >> +               return -1;
> > In [1] we used util_avg to check if the domain is overloaded and quit
> > earlier, since util_avg would be
> > more stable and contains historic data. But I think nr_running in your
> > patch could be used as
> > complementary metric and added to update_idle_cpu_scan() in [1] IMO.
> >> +
> >>          cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> +       if (nro > 1)
> >> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));
> > If I understand correctly, this is the core of the optimization: SIS
> > filters out the busy cores. I wonder if it
> > is possible to save historic h_nr_running/idle_h_nr_running and use
> > the average value? (like the calculation
> > of avg_scan_cost).
>
> Yes, I have been already working on that for several days, and
> along with some improvement on load balance (group_has_spare).
> Ideally we can finally get rid out of the cache issues.
>
Ok, could you please also Cc me in the next version? I'd like to have
a try.

-- 
Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-09 15:21     ` Chen Yu
@ 2022-05-09 15:31       ` Chen Yu
  2022-05-10  2:55       ` Abel Wu
  1 sibling, 0 replies; 12+ messages in thread
From: Chen Yu @ 2022-05-09 15:31 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	Linux Kernel Mailing List

On Mon, May 9, 2022 at 11:21 PM Chen Yu <yu.chen.surf@gmail.com> wrote:
>
> On Sun, May 8, 2022 at 1:50 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
> >
> > Hi Chen,
> >
> > On 5/8/22 12:09 AM, Chen Yu Wrote:
> [cut]
> > >> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> > >>          atomic_t        ref;
> > >>          atomic_t        nr_busy_cpus;
> > >>          int             has_idle_cores;
> > >> +
> > >> +       /*
> > >> +        * Tracking of the overloaded cpus can be heavy, so start
> > >> +        * a new cacheline to avoid false sharing.
> > >> +        */
> > > Although we put the following items into different cache line compared to
> > > above ones, is it possible that there is still cache false sharing if
> > > CPU1 is reading nr_overloaded_cpus while
> > > CPU2 is updating overloaded_cpus?
> >
> > I think it's not false sharing, it's just cache contention. But yes,
> > it is still possible if the two items mixed with others (by compiler)
> > in one cacheline, which seems out of our control..
> >
> My understanding is that, since nr_overloaded_cpus starts with a new
> cache line,  overloaded_cpus is very likely to be in the same cache line.
> Only If the write to nr_overloaded_cpus mask is not frequent(maybe tick based
> update is not frequent), the read of nr_overloaded_cpus can survive from cache
> false sharing, which is mainly read by SIS.  I have a stupid thought
> that if nr_overloaded_cpus
> mask and nr_overloaded_cpus could be put to 2 cache lines.
Not exactly, as overloaded_cpus and nr_overloaded_cpus are updated at the same
time, it is not a false sharing case.

-- 
Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
  2022-05-07 16:09 ` Chen Yu
@ 2022-05-10  1:14 ` Josh Don
  2022-05-10  8:03   ` Abel Wu
  2022-05-19 22:16 ` Tim Chen
  2022-05-20  6:48 ` K Prateek Nayak
  3 siblings, 1 reply; 12+ messages in thread
From: Josh Don @ 2022-05-10  1:14 UTC (permalink / raw)
  To: Abel Wu; +Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, linux-kernel

Hi Abel,

Overall this looks good, just a couple of comments.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..79b4ff24faee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +       struct sched_domain_shared *sds = sd->shared;
> +       int nr, nro, weight = sd->span_weight;
> +       int i, cpu, idle_cpu = -1;
>         struct rq *this_rq = this_rq();
>         int this = smp_processor_id();
>         struct sched_domain *this_sd;
> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         if (!this_sd)
>                 return -1;
>
> +       nro = atomic_read(&sds->nr_overloaded_cpus);
> +       if (nro == weight)
> +               goto out;

This assumes that the sd we're operating on here is the LLC domain
(true for current use). Perhaps to catch future bugs from changing
this assumption, we could WARN_ON_ONCE(nro > weight).

> +
> +       nr = min_t(int, weight, p->nr_cpus_allowed);
> +
> +       /*
> +        * It's unlikely to find an idle cpu if the system is under
> +        * heavy pressure, so skip searching to save a few cycles
> +        * and relieve cache traffic.
> +        */
> +       if (weight - nro < (nr >> 4) && !has_idle_core)
> +               return -1;

nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Was < intentional vs <= ? With <= you'll be able to skip the search in
the case where both sides evaluate to 0 (can happen frequently if we
have no idle cpus, and a task with a small affinity mask).

This will also get a bit confused in the case where the task has many
cpus allowed, but almost all of them on a different LLC than the one
we're considering here. Apart from caching the per-LLC
nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
only do this in the !has_idle_core case to reduce calls to
cpumask_weight()).

> +
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +       if (nro > 1)
> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));

Just
if (nro)
?

> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>
>                 update_avg(&this_sd->avg_scan_cost, time);
>         }
> +out:
> +       if (has_idle_core)
> +               WRITE_ONCE(sds->has_idle_cores, 0);

nit: use set_idle_cores() instead (or, if you really want to avoid the
extra sds dereference, add a __set_idle_cores(sds, val) helper you can
call directly.

> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>                         continue;
>
>                 detach_task(p, env);
> +               update_overloaded_rq(env->src_rq);
>
>                 /*
>                  * Right now, this is only the second place where
> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>                 list_move(&p->se.group_node, tasks);
>         }
>
> +       if (detached)
> +               update_overloaded_rq(env->src_rq);
> +

Thinking about this more, I don't see an issue with moving the
update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
in the attach/detach_task paths. Overloaded state only changes when we
pass the boundary of 2 runnable non-idle tasks, so thashing of the
overloaded mask is a lot less worrisome than if it were updated on the
boundary of 1 runnable task. The attach/detach_task paths run as part
of load balancing, which can be on a millisecond time scale.

Best,
Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-09 15:21     ` Chen Yu
  2022-05-09 15:31       ` Chen Yu
@ 2022-05-10  2:55       ` Abel Wu
  1 sibling, 0 replies; 12+ messages in thread
From: Abel Wu @ 2022-05-10  2:55 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don,
	Linux Kernel Mailing List


On 5/9/22 11:21 PM, Chen Yu Wrote:
>> Yes, I have been already working on that for several days, and
>> along with some improvement on load balance (group_has_spare).
>> Ideally we can finally get rid out of the cache issues.
>>
> Ok, could you please also Cc me in the next version? I'd like to have
> a try.
> 

I will, thanks!

Abel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-10  1:14 ` Josh Don
@ 2022-05-10  8:03   ` Abel Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Abel Wu @ 2022-05-10  8:03 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, linux-kernel,
	Abel Wu, Chen Yu

Hi Josh,

On 5/10/22 9:14 AM, Josh Don Wrote:
> Hi Abel,
> 
> Overall this looks good, just a couple of comments.
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>   static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>>   {
>>          struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> +       struct sched_domain_shared *sds = sd->shared;
>> +       int nr, nro, weight = sd->span_weight;
>> +       int i, cpu, idle_cpu = -1;
>>          struct rq *this_rq = this_rq();
>>          int this = smp_processor_id();
>>          struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>          if (!this_sd)
>>                  return -1;
>>
>> +       nro = atomic_read(&sds->nr_overloaded_cpus);
>> +       if (nro == weight)
>> +               goto out;
> 
> This assumes that the sd we're operating on here is the LLC domain
> (true for current use). Perhaps to catch future bugs from changing
> this assumption, we could WARN_ON_ONCE(nro > weight).

The @sds comes from sd->shared, so I don't think the condition will
break once we operate at other level domains. But a quick check on
sds != NULL may be needed then since domains can have no sds attached.

> 
>> +
>> +       nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> +       /*
>> +        * It's unlikely to find an idle cpu if the system is under
>> +        * heavy pressure, so skip searching to save a few cycles
>> +        * and relieve cache traffic.
>> +        */
>> +       if (weight - nro < (nr >> 4) && !has_idle_core)
>> +               return -1;
> 
> nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Agreed.

> 
> Was < intentional vs <= ? With <= you'll be able to skip the search in
> the case where both sides evaluate to 0 (can happen frequently if we
> have no idle cpus, and a task with a small affinity mask).

It's intentional, the idea is to unconditionally pass when there are
less than 16 cpus to search which seems scalability is not an issue.
But I made a mistake that (weight - nro) couldn't be 0 here, so it's
not appropriate to use "<".

BTW, I think Chen Yu's proposal[1] on search depth limitation is a
better idea and more reasonable. And he is doing some benchmark on
the mixture of our work.

[1] 
https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/

> 
> This will also get a bit confused in the case where the task has many
> cpus allowed, but almost all of them on a different LLC than the one
> we're considering here. Apart from caching the per-LLC
> nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
> only do this in the !has_idle_core case to reduce calls to
> cpumask_weight()).

Yes the task might have many cpus allowed on another LLC, the idea is
to use @nr as a worst case boundary. And with Chen's work, I think we
can get rid of nr_cpus_allowed.

> 
>> +
>>          cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +       if (nro > 1)
>> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));
> 
> Just
> if (nro)
> ?

I think it's just not worthy to touch sdo_mask(sds) which causes heavy
cache traffic, if it only contains one cpu.

> 
>> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>
>>                  update_avg(&this_sd->avg_scan_cost, time);
>>          }
>> +out:
>> +       if (has_idle_core)
>> +               WRITE_ONCE(sds->has_idle_cores, 0);
> 
> nit: use set_idle_cores() instead (or, if you really want to avoid the
> extra sds dereference, add a __set_idle_cores(sds, val) helper you can
> call directly.

OK, will do.

> 
>> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>                          continue;
>>
>>                  detach_task(p, env);
>> +               update_overloaded_rq(env->src_rq);
>>
>>                  /*
>>                   * Right now, this is only the second place where
>> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>>                  list_move(&p->se.group_node, tasks);
>>          }
>>
>> +       if (detached)
>> +               update_overloaded_rq(env->src_rq);
>> +
> 
> Thinking about this more, I don't see an issue with moving the
> update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
> in the attach/detach_task paths. Overloaded state only changes when we
> pass the boundary of 2 runnable non-idle tasks, so thashing of the
> overloaded mask is a lot less worrisome than if it were updated on the
> boundary of 1 runnable task. The attach/detach_task paths run as part
> of load balancing, which can be on a millisecond time scale.

It's really hard to say which one is better, and I think it's more like
workload-specific. It's common in our cloud servers that a long running
workload co-exists with a short running workload which could flip the
status frequently.

Thanks & BR,
Abel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
  2022-05-07 16:09 ` Chen Yu
  2022-05-10  1:14 ` Josh Don
@ 2022-05-19 22:16 ` Tim Chen
  2022-05-20  2:48   ` Abel Wu
  2022-05-20  6:48 ` K Prateek Nayak
  3 siblings, 1 reply; 12+ messages in thread
From: Tim Chen @ 2022-05-19 22:16 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don
  Cc: linux-kernel

On Thu, 2022-05-05 at 20:23 +0800, Abel Wu wrote:
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 12 ++++++++++
>  kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>  kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>  kernel/sched/idle.c            |  1 +
>  kernel/sched/sched.h           |  4 ++++
>  kernel/sched/topology.c        |  4 +++-
>  6 files changed, 92 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..95c7ad1e05b5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +
> +	/*
> +	 * Tracking of the overloaded cpus can be heavy, so start
> +	 * a new cacheline to avoid false sharing.
> +	 */
> +	atomic_t	nr_overloaded_cpus ____cacheline_aligned;

Abel,

This is nice work. I have one comment.

The update and reading of nr_overloaded_cpus will incur cache bouncing cost.
As far as I can tell, this counter is used to determine if we should bail out
from the search for an idle CPUs if the system is heavily loaded.  And I
hope we can avoid using atomic counter in these heavily used scheduler paths.  
The logic to filter overloaded CPUs only need overloaded_cpus[] 
mask and not the nr_overloaded_cpus counter.

So I recommend that you break out the logic of using nr_overloaded_cpus
atomic counter to detect LLC has heavy load as a second patch so that
it can be evaluated on its own merit.

That functionality overlaps with Chen Yu's patch to limit search depending
on load, so it will be easier to compare the two approaches if it is separated.

Otherwise, the logic in the patch to use overloaded_cpus[]
mask to filter out the overloaded cpus looks fine and complements
Chen Yu's patch.

Thanks.

Tim

> +	unsigned long	overloaded_cpus[]; /* Must be last */ };
>  
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-19 22:16 ` Tim Chen
@ 2022-05-20  2:48   ` Abel Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Abel Wu @ 2022-05-20  2:48 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don
  Cc: linux-kernel


On 5/20/22 6:16 AM, Tim Chen Wrote:
> On Thu, 2022-05-05 at 20:23 +0800, Abel Wu wrote:
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 12 ++++++++++
>>   kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>>   kernel/sched/idle.c            |  1 +
>>   kernel/sched/sched.h           |  4 ++++
>>   kernel/sched/topology.c        |  4 +++-
>>   6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..95c7ad1e05b5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>>   	atomic_t	ref;
>>   	atomic_t	nr_busy_cpus;
>>   	int		has_idle_cores;
>> +
>> +	/*
>> +	 * Tracking of the overloaded cpus can be heavy, so start
>> +	 * a new cacheline to avoid false sharing.
>> +	 */
>> +	atomic_t	nr_overloaded_cpus ____cacheline_aligned;
> 
> Abel,
> 
> This is nice work. I have one comment.
> 
> The update and reading of nr_overloaded_cpus will incur cache bouncing cost.
> As far as I can tell, this counter is used to determine if we should bail out
> from the search for an idle CPUs if the system is heavily loaded.  And I
> hope we can avoid using atomic counter in these heavily used scheduler paths.
> The logic to filter overloaded CPUs only need overloaded_cpus[]
> mask and not the nr_overloaded_cpus counter.
> 
> So I recommend that you break out the logic of using nr_overloaded_cpus
> atomic counter to detect LLC has heavy load as a second patch so that
> it can be evaluated on its own merit.
> 
> That functionality overlaps with Chen Yu's patch to limit search depending
> on load, so it will be easier to compare the two approaches if it is separated.
> 
> Otherwise, the logic in the patch to use overloaded_cpus[]
> mask to filter out the overloaded cpus looks fine and complements
> Chen Yu's patch.

OK, will do. And ideally the nr_overloaded_cpus atomic can be replaced
by Chen's patch.

Thanks & BR
Abel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
                   ` (2 preceding siblings ...)
  2022-05-19 22:16 ` Tim Chen
@ 2022-05-20  6:48 ` K Prateek Nayak
  2022-05-20  7:43   ` Abel Wu
  3 siblings, 1 reply; 12+ messages in thread
From: K Prateek Nayak @ 2022-05-20  6:48 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don
  Cc: linux-kernel

Hello Abel,

We tested this patch on our systems.

tl;dr

- We observed some regressions in schbench with 128 workers in
  NPS4 mode.
- tbench shows regression for 32 workers in NPS2 mode and 64 workers
  in NPS2 and NPS4 mode.
- Great improvements in schbench for low worker count.
- Overall, the performance seems to be comparable to the tip.

Below are the detailed numbers for each benchmark.

On 5/5/2022 5:53 PM, Abel Wu wrote:
> Try to improve searching efficiency of SIS by filtering out the
> overloaded cpus, and as a result the more overloaded the system
> is, the less cpus will be searched.
> 
> The overloaded cpus are tracked through LLC shared domain. To
> regulate accesses to the shared data, the update happens mainly
> at the tick. But in order to make it more accurate, we also take
> the task migrations into consideration during load balancing which
> can be quite frequent due to short running workload causing newly-
> idle. Since an overloaded runqueue requires at least 2 non-idle
> tasks runnable, we could have more faith on the "frequent newly-
> idle" case.
> 
> Benchmark
> =========
> 
> Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
> machine with 2 NUMA nodes each of which has 24 cores with SMT2
> enabled, so 96 CPUs in total.
> 
> All of the benchmarks are done inside a normal cpu cgroup in a
> clean environment with cpu turbo disabled.
> 
> Based on tip sched/core 089c02ae2771 (v5.18-rc1).
> 
> Results
> =======
> 
> hackbench-process-pipes
>                              vanilla		     filter
> Amean     1        0.2537 (   0.00%)      0.2330 (   8.15%)
> Amean     4        0.7090 (   0.00%)      0.7440 *  -4.94%*
> Amean     7        0.9153 (   0.00%)      0.9040 (   1.24%)
> Amean     12       1.1473 (   0.00%)      1.0857 *   5.37%*
> Amean     21       2.7210 (   0.00%)      2.2320 *  17.97%*
> Amean     30       4.8263 (   0.00%)      3.6170 *  25.06%*
> Amean     48       7.4107 (   0.00%)      6.1130 *  17.51%*
> Amean     79       9.2120 (   0.00%)      8.2350 *  10.61%*
> Amean     110     10.1647 (   0.00%)      8.8043 *  13.38%*
> Amean     141     11.5713 (   0.00%)     10.5867 *   8.51%*
> Amean     172     13.7963 (   0.00%)     12.8120 *   7.13%*
> Amean     203     15.9283 (   0.00%)     14.8703 *   6.64%*
> Amean     234     17.8737 (   0.00%)     17.1053 *   4.30%*
> Amean     265     19.8443 (   0.00%)     18.7870 *   5.33%*
> Amean     296     22.4147 (   0.00%)     21.3943 *   4.55%*
> 
> There is a regression in 4-groups test in which case lots of busy
> cpus can be found in the system. The busy cpus are not recorded in
> the overloaded cpu mask, so we trade overhead for nothing in SIS.
> This is the worst case of this feature.
> 
> hackbench-process-sockets
>                              vanilla		     filter
> Amean     1        0.5343 (   0.00%)      0.5270 (   1.37%)
> Amean     4        1.4500 (   0.00%)      1.4273 *   1.56%*
> Amean     7        2.4813 (   0.00%)      2.4383 *   1.73%*
> Amean     12       4.1357 (   0.00%)      4.0827 *   1.28%*
> Amean     21       6.9707 (   0.00%)      6.9290 (   0.60%)
> Amean     30       9.8373 (   0.00%)      9.6730 *   1.67%*
> Amean     48      15.6233 (   0.00%)     15.3213 *   1.93%*
> Amean     79      26.2763 (   0.00%)     25.3293 *   3.60%*
> Amean     110     36.6170 (   0.00%)     35.8920 *   1.98%*
> Amean     141     47.0720 (   0.00%)     45.8773 *   2.54%*
> Amean     172     57.0580 (   0.00%)     56.1627 *   1.57%*
> Amean     203     67.2040 (   0.00%)     66.4323 *   1.15%*
> Amean     234     77.8897 (   0.00%)     76.6320 *   1.61%*
> Amean     265     88.0437 (   0.00%)     87.1400 (   1.03%)
> Amean     296     98.2387 (   0.00%)     96.8633 *   1.40%*
> 
> hackbench-thread-pipes
>                              vanilla		     filter
> Amean     1        0.2693 (   0.00%)      0.2800 *  -3.96%*
> Amean     4        0.7843 (   0.00%)      0.7680 (   2.08%)
> Amean     7        0.9287 (   0.00%)      0.9217 (   0.75%)
> Amean     12       1.4443 (   0.00%)      1.3680 *   5.29%*
> Amean     21       3.5150 (   0.00%)      3.1107 *  11.50%*
> Amean     30       6.3997 (   0.00%)      5.2160 *  18.50%*
> Amean     48       8.4183 (   0.00%)      7.8477 *   6.78%*
> Amean     79      10.0713 (   0.00%)      9.2240 *   8.41%*
> Amean     110     10.9940 (   0.00%)     10.1280 *   7.88%*
> Amean     141     13.6347 (   0.00%)     11.9387 *  12.44%*
> Amean     172     15.0523 (   0.00%)     14.4117 (   4.26%)
> Amean     203     18.0710 (   0.00%)     17.3533 (   3.97%)
> Amean     234     19.7413 (   0.00%)     19.8453 (  -0.53%)
> Amean     265     23.1820 (   0.00%)     22.8223 (   1.55%)
> Amean     296     25.3820 (   0.00%)     24.2397 (   4.50%)
> 
> hackbench-thread-sockets
>                              vanilla		     filter
> Amean     1        0.5893 (   0.00%)      0.5750 *   2.43%*
> Amean     4        1.4853 (   0.00%)      1.4727 (   0.85%)
> Amean     7        2.5353 (   0.00%)      2.5047 *   1.21%*
> Amean     12       4.3003 (   0.00%)      4.1910 *   2.54%*
> Amean     21       7.1930 (   0.00%)      7.1533 (   0.55%)
> Amean     30      10.0983 (   0.00%)      9.9690 *   1.28%*
> Amean     48      15.9853 (   0.00%)     15.6963 *   1.81%*
> Amean     79      26.7537 (   0.00%)     25.9497 *   3.01%*
> Amean     110     37.3850 (   0.00%)     36.6793 *   1.89%*
> Amean     141     47.7730 (   0.00%)     47.0967 *   1.42%*
> Amean     172     58.4280 (   0.00%)     57.5513 *   1.50%*
> Amean     203     69.3093 (   0.00%)     67.7680 *   2.22%*
> Amean     234     80.0190 (   0.00%)     78.2633 *   2.19%*
> Amean     265     90.7237 (   0.00%)     89.1027 *   1.79%*
> Amean     296    101.1153 (   0.00%)     99.2693 *   1.83%*
> 
> schbench
> 				   vanilla		   filter
> Lat 50.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 75.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 95.0th-qrtle-1         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 99.0th-qrtle-1         6.00 (   0.00%)        6.00 (   0.00%)
> Lat 99.5th-qrtle-1         7.00 (   0.00%)        6.00 (  14.29%)
> Lat 99.9th-qrtle-1         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 50.0th-qrtle-2         6.00 (   0.00%)        6.00 (   0.00%)
> Lat 75.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 90.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 95.0th-qrtle-2         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 99.0th-qrtle-2         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 99.5th-qrtle-2         9.00 (   0.00%)        9.00 (   0.00%)
> Lat 99.9th-qrtle-2        12.00 (   0.00%)       11.00 (   8.33%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        8.00 (   0.00%)
> Lat 75.0th-qrtle-4        10.00 (   0.00%)       10.00 (   0.00%)
> Lat 90.0th-qrtle-4        10.00 (   0.00%)       11.00 ( -10.00%)
> Lat 95.0th-qrtle-4        11.00 (   0.00%)       11.00 (   0.00%)
> Lat 99.0th-qrtle-4        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.5th-qrtle-4        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-4        17.00 (   0.00%)       15.00 (  11.76%)
> Lat 50.0th-qrtle-8        13.00 (   0.00%)       13.00 (   0.00%)
> Lat 75.0th-qrtle-8        16.00 (   0.00%)       16.00 (   0.00%)
> Lat 90.0th-qrtle-8        18.00 (   0.00%)       18.00 (   0.00%)
> Lat 95.0th-qrtle-8        19.00 (   0.00%)       18.00 (   5.26%)
> Lat 99.0th-qrtle-8        24.00 (   0.00%)       21.00 (  12.50%)
> Lat 99.5th-qrtle-8        28.00 (   0.00%)       26.00 (   7.14%)
> Lat 99.9th-qrtle-8        33.00 (   0.00%)       32.00 (   3.03%)
> Lat 50.0th-qrtle-16       20.00 (   0.00%)       20.00 (   0.00%)
> Lat 75.0th-qrtle-16       28.00 (   0.00%)       28.00 (   0.00%)
> Lat 90.0th-qrtle-16       32.00 (   0.00%)       32.00 (   0.00%)
> Lat 95.0th-qrtle-16       34.00 (   0.00%)       34.00 (   0.00%)
> Lat 99.0th-qrtle-16       40.00 (   0.00%)       40.00 (   0.00%)
> Lat 99.5th-qrtle-16       44.00 (   0.00%)       44.00 (   0.00%)
> Lat 99.9th-qrtle-16       53.00 (   0.00%)       67.00 ( -26.42%)
> Lat 50.0th-qrtle-32       39.00 (   0.00%)       36.00 (   7.69%)
> Lat 75.0th-qrtle-32       57.00 (   0.00%)       52.00 (   8.77%)
> Lat 90.0th-qrtle-32       69.00 (   0.00%)       61.00 (  11.59%)
> Lat 95.0th-qrtle-32       76.00 (   0.00%)       64.00 (  15.79%)
> Lat 99.0th-qrtle-32       88.00 (   0.00%)       74.00 (  15.91%)
> Lat 99.5th-qrtle-32       91.00 (   0.00%)       80.00 (  12.09%)
> Lat 99.9th-qrtle-32      115.00 (   0.00%)      107.00 (   6.96%)
> Lat 50.0th-qrtle-47       63.00 (   0.00%)       55.00 (  12.70%)
> Lat 75.0th-qrtle-47       93.00 (   0.00%)       80.00 (  13.98%)
> Lat 90.0th-qrtle-47      116.00 (   0.00%)       97.00 (  16.38%)
> Lat 95.0th-qrtle-47      129.00 (   0.00%)      106.00 (  17.83%)
> Lat 99.0th-qrtle-47      148.00 (   0.00%)      123.00 (  16.89%)
> Lat 99.5th-qrtle-47      157.00 (   0.00%)      132.00 (  15.92%)
> Lat 99.9th-qrtle-47      387.00 (   0.00%)      164.00 (  57.62%)
> 
> netperf-udp
> 				    vanilla		    filter
> Hmean     send-64         183.09 (   0.00%)      182.28 (  -0.44%)
> Hmean     send-128        364.68 (   0.00%)      363.12 (  -0.43%)
> Hmean     send-256        715.38 (   0.00%)      716.57 (   0.17%)
> Hmean     send-1024      2764.76 (   0.00%)     2779.17 (   0.52%)
> Hmean     send-2048      5282.93 (   0.00%)     5220.41 *  -1.18%*
> Hmean     send-3312      8282.26 (   0.00%)     8121.78 *  -1.94%*
> Hmean     send-4096     10108.12 (   0.00%)    10042.98 (  -0.64%)
> Hmean     send-8192     16868.49 (   0.00%)    16826.99 (  -0.25%)
> Hmean     send-16384    26230.44 (   0.00%)    26271.85 (   0.16%)
> Hmean     recv-64         183.09 (   0.00%)      182.28 (  -0.44%)
> Hmean     recv-128        364.68 (   0.00%)      363.12 (  -0.43%)
> Hmean     recv-256        715.38 (   0.00%)      716.57 (   0.17%)
> Hmean     recv-1024      2764.76 (   0.00%)     2779.17 (   0.52%)
> Hmean     recv-2048      5282.93 (   0.00%)     5220.39 *  -1.18%*
> Hmean     recv-3312      8282.26 (   0.00%)     8121.78 *  -1.94%*
> Hmean     recv-4096     10108.12 (   0.00%)    10042.97 (  -0.64%)
> Hmean     recv-8192     16868.47 (   0.00%)    16826.93 (  -0.25%)
> Hmean     recv-16384    26230.44 (   0.00%)    26271.75 (   0.16%)
> 
> The overhead this feature added to the scheduler can be unfriendly
> to the fast context-switching workloads like netperf/tbench. But
> the test result seems fine.
> 
> netperf-tcp
> 			       vanilla		       filter
> Hmean     64         863.35 (   0.00%)     1176.11 *  36.23%*
> Hmean     128       1674.32 (   0.00%)     2223.37 *  32.79%*
> Hmean     256       3151.03 (   0.00%)     4109.64 *  30.42%*
> Hmean     1024     10281.94 (   0.00%)    12799.28 *  24.48%*
> Hmean     2048     16906.05 (   0.00%)    20129.91 *  19.07%*
> Hmean     3312     21246.21 (   0.00%)    24747.24 *  16.48%*
> Hmean     4096     23690.57 (   0.00%)    26596.35 *  12.27%*
> Hmean     8192     28758.29 (   0.00%)    30423.10 *   5.79%*
> Hmean     16384    33071.06 (   0.00%)    34262.39 *   3.60%*
> 
> The suspicious improvement (and regression in tbench4-128) needs
> further digging..
> 
> tbench4 Throughput
> 			     vanilla		     filter
> Hmean     1        293.71 (   0.00%)      298.89 *   1.76%*
> Hmean     2        583.25 (   0.00%)      596.00 *   2.19%*
> Hmean     4       1162.40 (   0.00%)     1176.73 *   1.23%*
> Hmean     8       2309.28 (   0.00%)     2332.89 *   1.02%*
> Hmean     16      4517.23 (   0.00%)     4587.60 *   1.56%*
> Hmean     32      7458.54 (   0.00%)     7550.19 *   1.23%*
> Hmean     64      9041.62 (   0.00%)     9192.69 *   1.67%*
> Hmean     128    19983.62 (   0.00%)    12228.91 * -38.81%*
> Hmean     256    20054.12 (   0.00%)    20997.33 *   4.70%*
> Hmean     384    19137.11 (   0.00%)    20331.14 *   6.24%*

Following are the results from testing on a dual socket Zen3 system
(2 x 64C/128T) in different NPS modes.

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:      		5.18-rc1 tip sched/core
- Filter Overloaded:    5.18-rc1 tip sched/core + this patch

When we began testing, we recorded the tip at:

commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"

Following are the results from the benchmark:

Note: Results marked with * are data points of concern. A rerun
for the data point has been provided on both the tip and the
patched kernel to check for any run to run variation.

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test:                   tip               Filter Overloaded
 1-groups:         4.64 (0.00 pct)         4.74 (-2.15 pct)
 2-groups:         5.38 (0.00 pct)         5.55 (-3.15 pct)
 4-groups:         6.15 (0.00 pct)         6.20 (-0.81 pct)
 8-groups:         7.42 (0.00 pct)         7.47 (-0.67 pct)
16-groups:        10.70 (0.00 pct)        10.60 (0.93 pct)

NPS2

Test:                   tip               Filter Overloaded
 1-groups:         4.70 (0.00 pct)         4.68 (0.42 pct)
 2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
 4-groups:         6.13 (0.00 pct)         6.11 (0.32 pct)
 8-groups:         7.30 (0.00 pct)         7.23 (0.95 pct)
16-groups:        10.30 (0.00 pct)        10.38 (-0.77 pct)

NPS4

Test:                   tip               Filter Overloaded
 1-groups:         4.60 (0.00 pct)         4.66 (-1.30 pct)
 2-groups:         5.41 (0.00 pct)         5.53 (-2.21 pct)
 4-groups:         6.12 (0.00 pct)         6.16 (-0.65 pct)
 8-groups:         7.22 (0.00 pct)         7.28 (-0.83 pct)
16-groups:        10.24 (0.00 pct)        10.26 (-0.19 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers:      tip                Filter Overloaded
  1:      29.00 (0.00 pct)        29.00 (0.00 pct)
  2:      28.00 (0.00 pct)        27.00 (3.57 pct)
  4:      31.50 (0.00 pct)        33.00 (-4.76 pct)
  8:      42.00 (0.00 pct)        42.50 (-1.19 pct)
 16:      56.50 (0.00 pct)        56.00 (0.88 pct)
 32:      94.50 (0.00 pct)        96.50 (-2.11 pct)
 64:     176.00 (0.00 pct)       178.50 (-1.42 pct)
128:     404.00 (0.00 pct)       418.00 (-3.46 pct)
256:     869.00 (0.00 pct)       900.00 (-3.56 pct)
512:     58432.00 (0.00 pct)     56256.00 (3.72 pct)

NPS2

#workers:      tip                Filter Overloaded
  1:      26.50 (0.00 pct)        14.00 (47.16 pct)
  2:      26.50 (0.00 pct)        14.50 (45.28 pct)
  4:      34.50 (0.00 pct)        18.00 (47.82 pct)
  8:      45.00 (0.00 pct)        30.50 (32.22 pct)
 16:      56.50 (0.00 pct)        57.00 (-0.88 pct)
 32:      95.50 (0.00 pct)        94.00 (1.57 pct)
 64:     179.00 (0.00 pct)       176.00 (1.67 pct)
128:     369.00 (0.00 pct)       411.00 (-11.38 pct)    *
128:     400.60 (0.00 pct)       412.90 (-3.07 pct)     [Verification Run]
256:     898.00 (0.00 pct)       850.00 (5.34 pct)
512:     56256.00 (0.00 pct)     59456.00 (-5.68 pct)

NPS4

#workers:      tip                Filter Overloaded
  1:      25.00 (0.00 pct)        24.50 (2.00 pct)
  2:      28.00 (0.00 pct)        24.00 (14.28 pct)
  4:      29.50 (0.00 pct)        28.50 (3.38 pct)
  8:      41.00 (0.00 pct)        36.50 (10.97 pct)
 16:      65.50 (0.00 pct)        59.00 (9.92 pct)
 32:      93.00 (0.00 pct)        95.50 (-2.68 pct)
 64:     170.50 (0.00 pct)       182.00 (-6.74 pct)     *
 64:     175.00 (0.00 pct)       176.00 (-0.57 pct)     [Verification Run]
128:     377.00 (0.00 pct)       409.50 (-8.62 pct)     *
128:     372.00 (0.00 pct)       401.00 (-7.79 pct)     [Verification Run]
256:     867.00 (0.00 pct)       940.00 (-8.41 pct)     *
256:     925.00 (0.00 pct)       892.00 (+3.45 pct)     [Verification Run]
512:     58048.00 (0.00 pct)     59456.00 (-2.42 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients:      tip                Filter Overloaded
    1    443.31 (0.00 pct)       466.32 (5.19 pct)
    2    877.32 (0.00 pct)       891.87 (1.65 pct)
    4    1665.11 (0.00 pct)      1727.98 (3.77 pct)
    8    3016.68 (0.00 pct)      3125.12 (3.59 pct)
   16    5374.30 (0.00 pct)      5414.02 (0.73 pct)
   32    8763.86 (0.00 pct)      8599.72 (-1.87 pct)
   64    15786.93 (0.00 pct)     14095.47 (-10.71 pct)  *
   64    15441.33 (0.00 pct)     15148.00 (-1.89 pct)   [Verification Run]
  128    26826.08 (0.00 pct)     27837.07 (3.76 pct)
  256    24207.35 (0.00 pct)     23769.48 (-1.80 pct)
  512    51740.58 (0.00 pct)     53369.28 (3.14 pct)
 1024    51177.82 (0.00 pct)     51928.06 (1.46 pct)

NPS2

Clients:      tip                Filter Overloaded
    1    449.49 (0.00 pct)       464.65 (3.37 pct)
    2    867.28 (0.00 pct)       898.85 (3.64 pct)
    4    1643.60 (0.00 pct)      1691.53 (2.91 pct)
    8    3047.35 (0.00 pct)      3010.65 (-1.20 pct)
   16    5340.77 (0.00 pct)      5242.42 (-1.84 pct)
   32    10536.85 (0.00 pct)     8978.74 (-14.78 pct)   *
   32    10417.46 /90.00 pct)    8375.55 (-19.60 pct)   [Verification Run]
   64    16543.23 (0.00 pct)     15357.35 (-7.16 pct)   *
   64    17101.56 (0.00 pct)     15465.73 (-9.56 pct)   [Verification Run]
  128    26400.40 (0.00 pct)     26637.87 (0.89 pct)
  256    23436.75 (0.00 pct)     24324.08 (3.78 pct)
  512    50902.60 (0.00 pct)     49159.14 (-3.42 pct)
 1024    50216.10 (0.00 pct)     50218.74 (0.00 pct)

NPS4

Clients:      tip                Filter Overloaded
    1    443.82 (0.00 pct)       458.65 (3.34 pct)
    2    849.14 (0.00 pct)       883.79 (4.08 pct)
    4    1603.26 (0.00 pct)      1658.89 (3.46 pct)
    8    2972.37 (0.00 pct)      3087.76 (3.88 pct)
   16    5277.13 (0.00 pct)      5472.11 (3.69 pct)
   32    9744.73 (0.00 pct)      9666.67 (-0.80 pct)
   64    15854.80 (0.00 pct)     13778.51 (-13.09 pct)  *
   64    15732.56 (0.00 pct)     14397.83 (-8.48 pct)   [Verification Run]
  128    26116.97 (0.00 pct)     25966.86 (-0.57 pct)
  256    22403.25 (0.00 pct)     22634.04 (1.03 pct)
  512    48317.20 (0.00 pct)     47123.73 (-2.47 pct)
 1024    50445.41 (0.00 pct)     48934.56 (-2.99 pct)

Note: tbench resuts for 256 workers are known to have
a great amount of run to run variation on the test
machine. Any regression seen for the data point can
be safely ignored.

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

Test:         tip                  Filter Overloaded
 Copy:   189113.11 (0.00 pct)    184006.43 (-2.70 pct)
Scale:   201190.61 (0.00 pct)    197663.80 (-1.75 pct)
  Add:   232654.21 (0.00 pct)    223226.88 (-4.05 pct)
Triad:   226583.57 (0.00 pct)    218920.27 (-3.38 pct)

NPS2

Test:          tip                 Filter Overloaded
 Copy:   155347.14 (0.00 pct)    150710.93 (-2.98 pct)
Scale:   191701.53 (0.00 pct)    181436.61 (-5.35 pct)
  Add:   210013.97 (0.00 pct)    200397.89 (-4.57 pct)
Triad:   207602.00 (0.00 pct)    198247.25 (-4.50 pct)

NPS4

Test:          tip                 Filter Overloaded
 Copy:   136421.15 (0.00 pct)    137608.05 (0.87 pct)
Scale:   191217.59 (0.00 pct)    189674.77 (-0.80 pct)
  Add:   189229.52 (0.00 pct)    188682.22 (-0.28 pct)
Triad:   188052.99 (0.00 pct)    188946.75 (0.47 pct)

- 100 runs

NPS1

Test:          tip                 Filter Overloaded
 Copy:   244693.32 (0.00 pct)    235089.48 (-3.92 pct)
Scale:   221874.99 (0.00 pct)    217716.94 (-1.87 pct)
  Add:   268363.89 (0.00 pct)    266529.22 (-0.68 pct)
Triad:   260945.24 (0.00 pct)    252780.93 (-3.12 pct)

NPS2

Test:          tip                 Filter Overloaded
 Copy:   211262.00 (0.00 pct)    240922.15 (14.03 pct)
Scale:   222493.34 (0.00 pct)    220122.09 (-1.06 pct)
  Add:   280277.17 (0.00 pct)    278002.19 (-0.81 pct)
Triad:   265860.49 (0.00 pct)    264231.43 (-0.61 pct)

NPS4

Test:          tip                 Filter Overloaded
 Copy:   250171.40 (0.00 pct)    243512.01 (-2.66 pct)
Scale:   222293.56 (0.00 pct)    224911.55 (1.17 pct)
  Add:   279222.16 (0.00 pct)    280700.91 (0.52 pct)
Triad:   262013.92 (0.00 pct)    265729.44 (1.41 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip:      303718.33 (var: 1.31)
NUMA Bal:       309396.00 (var: 1.24)    (+1.83%)

NPS2

sched-tip:      304536.33 (var: 2.46)
NUMA Bal:       305209.00 (var: 0.80)    (+0.22%)

NPS4

sched-tip:      301192.33 (var: 1.81)
NUMA Bal:       304248.00 (var: 2.05)    (+1.00%)

~~~~~
Notes
~~~~~

- schbench shows regression for 128 workers in NPS4 mode.
  The rerun shows stable results for both tip and patched
  kernel.
- tbench shows regression for 64 workers in NPS2 and NPS4
  mode. In NPS2 mode, the tip shows some run to run variance
  however the median of 10 runs reported is lower for the
  patched kernel.
- tbench shows regression for 32 workers in NPS2 mode. The
  tip seems to report stable values here but there is a
  slight run to run variation observed in the patched kernel.

- Overall, the performance is comparable to that of the tip.
- schbench low worker count improvements has the load balancer
  coming into the picture. We still have to do deeper analysis
  to see if and how this patch is helping.

I haven't run the mmtest as a part of this testing. I've made
a note of the configs you ran the numbers for. I'll try to
get numbers for same.

> 
> v3:
>   - removed sched-idle balance feature and focus on SIS
>   - take non-CFS tasks into consideration
>   - several fixes/improvement suggested by Josh Don
> 
> v2:
>   - several optimizations on sched-idle balancing
>   - ignore asym topos in can_migrate_task
>   - add more benchmarks including SIS efficiency
>   - re-organize patch as suggested by Mel
> 
> v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/
> v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 12 ++++++++++
>  kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>  kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>  kernel/sched/idle.c            |  1 +
>  kernel/sched/sched.h           |  4 ++++
>  kernel/sched/topology.c        |  4 +++-
>  6 files changed, 92 insertions(+), 10 deletions(-)
> 
> [..snip..]

Let me know if there is some additional data you would like
me to gather on the test machine.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
  2022-05-20  6:48 ` K Prateek Nayak
@ 2022-05-20  7:43   ` Abel Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Abel Wu @ 2022-05-20  7:43 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don
  Cc: linux-kernel, Abel Wu

Hi Prateek, thanks very much for your test!

On 5/20/22 2:48 PM, K Prateek Nayak Wrote:
> Hello Abel,
> 
> We tested this patch on our systems.
> 
> tl;dr
> 
> - We observed some regressions in schbench with 128 workers in
>    NPS4 mode.
> - tbench shows regression for 32 workers in NPS2 mode and 64 workers
>    in NPS2 and NPS4 mode.
> - Great improvements in schbench for low worker count.
> - Overall, the performance seems to be comparable to the tip.
> 
> Following are the results from testing on a dual socket Zen3 system
> (2 x 64C/128T) in different NPS modes.
> 
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>      Total 2 NUMA nodes in the dual socket machine.
> 
>      Node 0: 0-63,   128-191
>      Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>      Total 4 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-31,   128-159
>      Node 1: 32-63,  160-191
>      Node 2: 64-95,  192-223
>      Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>      Total 8 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-15,    128-143
>      Node 1: 16-31,   144-159
>      Node 2: 32-47,   160-175
>      Node 3: 48-63,   176-191
>      Node 4: 64-79,   192-207
>      Node 5: 80-95,   208-223
>      Node 6: 96-111,  223-231
>      Node 7: 112-127, 232-255

I remember you replied in Chen's patch that the number of CPUs in
LLC domain is always 16 irrespective of the NPS mode. This reminds
me of something that Josh previously concerned about. The piece of
code below in SIS may hurt the performance badly:

	if (weight - nro < (nr >> 4) && !has_idle_core)
		return -1;

where nr is no more than LLC weight which is 16 in your test. So
this condition never stands, and a following cpumask_andnot on the
LLC shared sds->overloaded_cpus[] will most probably occur, causing
heavier cache traffic to negate a lot of the benefit it brought.

The early bailing out needs carefully re-designed, and luckily Chen
is working on that. Besides, I also have some major improvements on
this SIS filter under testing and I will post v4 in one or two weeks.

Thanks again for your test & best regards,
Abel

> 
> Kernel versions:
> - tip:      		5.18-rc1 tip sched/core
> - Filter Overloaded:    5.18-rc1 tip sched/core + this patch
> 
> When we began testing, we recorded the tip at:
> 
> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
> 
> Following are the results from the benchmark:
> 
> Note: Results marked with * are data points of concern. A rerun
> for the data point has been provided on both the tip and the
> patched kernel to check for any run to run variation.
> 
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
> 
> NPS1
> 
> Test:                   tip               Filter Overloaded
>   1-groups:         4.64 (0.00 pct)         4.74 (-2.15 pct)
>   2-groups:         5.38 (0.00 pct)         5.55 (-3.15 pct)
>   4-groups:         6.15 (0.00 pct)         6.20 (-0.81 pct)
>   8-groups:         7.42 (0.00 pct)         7.47 (-0.67 pct)
> 16-groups:        10.70 (0.00 pct)        10.60 (0.93 pct)
> 
> NPS2
> 
> Test:                   tip               Filter Overloaded
>   1-groups:         4.70 (0.00 pct)         4.68 (0.42 pct)
>   2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
>   4-groups:         6.13 (0.00 pct)         6.11 (0.32 pct)
>   8-groups:         7.30 (0.00 pct)         7.23 (0.95 pct)
> 16-groups:        10.30 (0.00 pct)        10.38 (-0.77 pct)
> 
> NPS4
> 
> Test:                   tip               Filter Overloaded
>   1-groups:         4.60 (0.00 pct)         4.66 (-1.30 pct)
>   2-groups:         5.41 (0.00 pct)         5.53 (-2.21 pct)
>   4-groups:         6.12 (0.00 pct)         6.16 (-0.65 pct)
>   8-groups:         7.22 (0.00 pct)         7.28 (-0.83 pct)
> 16-groups:        10.24 (0.00 pct)        10.26 (-0.19 pct)
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS1
> 
> #workers:      tip                Filter Overloaded
>    1:      29.00 (0.00 pct)        29.00 (0.00 pct)
>    2:      28.00 (0.00 pct)        27.00 (3.57 pct)
>    4:      31.50 (0.00 pct)        33.00 (-4.76 pct)
>    8:      42.00 (0.00 pct)        42.50 (-1.19 pct)
>   16:      56.50 (0.00 pct)        56.00 (0.88 pct)
>   32:      94.50 (0.00 pct)        96.50 (-2.11 pct)
>   64:     176.00 (0.00 pct)       178.50 (-1.42 pct)
> 128:     404.00 (0.00 pct)       418.00 (-3.46 pct)
> 256:     869.00 (0.00 pct)       900.00 (-3.56 pct)
> 512:     58432.00 (0.00 pct)     56256.00 (3.72 pct)
> 
> NPS2
> 
> #workers:      tip                Filter Overloaded
>    1:      26.50 (0.00 pct)        14.00 (47.16 pct)
>    2:      26.50 (0.00 pct)        14.50 (45.28 pct)
>    4:      34.50 (0.00 pct)        18.00 (47.82 pct)
>    8:      45.00 (0.00 pct)        30.50 (32.22 pct)
>   16:      56.50 (0.00 pct)        57.00 (-0.88 pct)
>   32:      95.50 (0.00 pct)        94.00 (1.57 pct)
>   64:     179.00 (0.00 pct)       176.00 (1.67 pct)
> 128:     369.00 (0.00 pct)       411.00 (-11.38 pct)    *
> 128:     400.60 (0.00 pct)       412.90 (-3.07 pct)     [Verification Run]
> 256:     898.00 (0.00 pct)       850.00 (5.34 pct)
> 512:     56256.00 (0.00 pct)     59456.00 (-5.68 pct)
> 
> NPS4
> 
> #workers:      tip                Filter Overloaded
>    1:      25.00 (0.00 pct)        24.50 (2.00 pct)
>    2:      28.00 (0.00 pct)        24.00 (14.28 pct)
>    4:      29.50 (0.00 pct)        28.50 (3.38 pct)
>    8:      41.00 (0.00 pct)        36.50 (10.97 pct)
>   16:      65.50 (0.00 pct)        59.00 (9.92 pct)
>   32:      93.00 (0.00 pct)        95.50 (-2.68 pct)
>   64:     170.50 (0.00 pct)       182.00 (-6.74 pct)     *
>   64:     175.00 (0.00 pct)       176.00 (-0.57 pct)     [Verification Run]
> 128:     377.00 (0.00 pct)       409.50 (-8.62 pct)     *
> 128:     372.00 (0.00 pct)       401.00 (-7.79 pct)     [Verification Run]
> 256:     867.00 (0.00 pct)       940.00 (-8.41 pct)     *
> 256:     925.00 (0.00 pct)       892.00 (+3.45 pct)     [Verification Run]
> 512:     58048.00 (0.00 pct)     59456.00 (-2.42 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS1
> 
> Clients:      tip                Filter Overloaded
>      1    443.31 (0.00 pct)       466.32 (5.19 pct)
>      2    877.32 (0.00 pct)       891.87 (1.65 pct)
>      4    1665.11 (0.00 pct)      1727.98 (3.77 pct)
>      8    3016.68 (0.00 pct)      3125.12 (3.59 pct)
>     16    5374.30 (0.00 pct)      5414.02 (0.73 pct)
>     32    8763.86 (0.00 pct)      8599.72 (-1.87 pct)
>     64    15786.93 (0.00 pct)     14095.47 (-10.71 pct)  *
>     64    15441.33 (0.00 pct)     15148.00 (-1.89 pct)   [Verification Run]
>    128    26826.08 (0.00 pct)     27837.07 (3.76 pct)
>    256    24207.35 (0.00 pct)     23769.48 (-1.80 pct)
>    512    51740.58 (0.00 pct)     53369.28 (3.14 pct)
>   1024    51177.82 (0.00 pct)     51928.06 (1.46 pct)
> 
> NPS2
> 
> Clients:      tip                Filter Overloaded
>      1    449.49 (0.00 pct)       464.65 (3.37 pct)
>      2    867.28 (0.00 pct)       898.85 (3.64 pct)
>      4    1643.60 (0.00 pct)      1691.53 (2.91 pct)
>      8    3047.35 (0.00 pct)      3010.65 (-1.20 pct)
>     16    5340.77 (0.00 pct)      5242.42 (-1.84 pct)
>     32    10536.85 (0.00 pct)     8978.74 (-14.78 pct)   *
>     32    10417.46 /90.00 pct)    8375.55 (-19.60 pct)   [Verification Run]
>     64    16543.23 (0.00 pct)     15357.35 (-7.16 pct)   *
>     64    17101.56 (0.00 pct)     15465.73 (-9.56 pct)   [Verification Run]
>    128    26400.40 (0.00 pct)     26637.87 (0.89 pct)
>    256    23436.75 (0.00 pct)     24324.08 (3.78 pct)
>    512    50902.60 (0.00 pct)     49159.14 (-3.42 pct)
>   1024    50216.10 (0.00 pct)     50218.74 (0.00 pct)
> 
> NPS4
> 
> Clients:      tip                Filter Overloaded
>      1    443.82 (0.00 pct)       458.65 (3.34 pct)
>      2    849.14 (0.00 pct)       883.79 (4.08 pct)
>      4    1603.26 (0.00 pct)      1658.89 (3.46 pct)
>      8    2972.37 (0.00 pct)      3087.76 (3.88 pct)
>     16    5277.13 (0.00 pct)      5472.11 (3.69 pct)
>     32    9744.73 (0.00 pct)      9666.67 (-0.80 pct)
>     64    15854.80 (0.00 pct)     13778.51 (-13.09 pct)  *
>     64    15732.56 (0.00 pct)     14397.83 (-8.48 pct)   [Verification Run]
>    128    26116.97 (0.00 pct)     25966.86 (-0.57 pct)
>    256    22403.25 (0.00 pct)     22634.04 (1.03 pct)
>    512    48317.20 (0.00 pct)     47123.73 (-2.47 pct)
>   1024    50445.41 (0.00 pct)     48934.56 (-2.99 pct)
> 
> Note: tbench resuts for 256 workers are known to have
> a great amount of run to run variation on the test
> machine. Any regression seen for the data point can
> be safely ignored.
> 
> ~~~~~~
> Stream
> ~~~~~~
> 
> - 10 runs
> 
> NPS1
> 
> Test:         tip                  Filter Overloaded
>   Copy:   189113.11 (0.00 pct)    184006.43 (-2.70 pct)
> Scale:   201190.61 (0.00 pct)    197663.80 (-1.75 pct)
>    Add:   232654.21 (0.00 pct)    223226.88 (-4.05 pct)
> Triad:   226583.57 (0.00 pct)    218920.27 (-3.38 pct)
> 
> NPS2
> 
> Test:          tip                 Filter Overloaded
>   Copy:   155347.14 (0.00 pct)    150710.93 (-2.98 pct)
> Scale:   191701.53 (0.00 pct)    181436.61 (-5.35 pct)
>    Add:   210013.97 (0.00 pct)    200397.89 (-4.57 pct)
> Triad:   207602.00 (0.00 pct)    198247.25 (-4.50 pct)
> 
> NPS4
> 
> Test:          tip                 Filter Overloaded
>   Copy:   136421.15 (0.00 pct)    137608.05 (0.87 pct)
> Scale:   191217.59 (0.00 pct)    189674.77 (-0.80 pct)
>    Add:   189229.52 (0.00 pct)    188682.22 (-0.28 pct)
> Triad:   188052.99 (0.00 pct)    188946.75 (0.47 pct)
> 
> - 100 runs
> 
> NPS1
> 
> Test:          tip                 Filter Overloaded
>   Copy:   244693.32 (0.00 pct)    235089.48 (-3.92 pct)
> Scale:   221874.99 (0.00 pct)    217716.94 (-1.87 pct)
>    Add:   268363.89 (0.00 pct)    266529.22 (-0.68 pct)
> Triad:   260945.24 (0.00 pct)    252780.93 (-3.12 pct)
> 
> NPS2
> 
> Test:          tip                 Filter Overloaded
>   Copy:   211262.00 (0.00 pct)    240922.15 (14.03 pct)
> Scale:   222493.34 (0.00 pct)    220122.09 (-1.06 pct)
>    Add:   280277.17 (0.00 pct)    278002.19 (-0.81 pct)
> Triad:   265860.49 (0.00 pct)    264231.43 (-0.61 pct)
> 
> NPS4
> 
> Test:          tip                 Filter Overloaded
>   Copy:   250171.40 (0.00 pct)    243512.01 (-2.66 pct)
> Scale:   222293.56 (0.00 pct)    224911.55 (1.17 pct)
>    Add:   279222.16 (0.00 pct)    280700.91 (0.52 pct)
> Triad:   262013.92 (0.00 pct)    265729.44 (1.41 pct)
> 
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
> 
> NPS1
> 
> sched-tip:      303718.33 (var: 1.31)
> NUMA Bal:       309396.00 (var: 1.24)    (+1.83%)
> 
> NPS2
> 
> sched-tip:      304536.33 (var: 2.46)
> NUMA Bal:       305209.00 (var: 0.80)    (+0.22%)
> 
> NPS4
> 
> sched-tip:      301192.33 (var: 1.81)
> NUMA Bal:       304248.00 (var: 2.05)    (+1.00%)
> 
> ~~~~~
> Notes
> ~~~~~
> 
> - schbench shows regression for 128 workers in NPS4 mode.
>    The rerun shows stable results for both tip and patched
>    kernel.
> - tbench shows regression for 64 workers in NPS2 and NPS4
>    mode. In NPS2 mode, the tip shows some run to run variance
>    however the median of 10 runs reported is lower for the
>    patched kernel.
> - tbench shows regression for 32 workers in NPS2 mode. The
>    tip seems to report stable values here but there is a
>    slight run to run variation observed in the patched kernel.
> 
> - Overall, the performance is comparable to that of the tip.
> - schbench low worker count improvements has the load balancer
>    coming into the picture. We still have to do deeper analysis
>    to see if and how this patch is helping.
> 
> I haven't run the mmtest as a part of this testing. I've made
> a note of the configs you ran the numbers for. I'll try to
> get numbers for same.
> 
>>
>> v3:
>>    - removed sched-idle balance feature and focus on SIS
>>    - take non-CFS tasks into consideration
>>    - several fixes/improvement suggested by Josh Don
>>
>> v2:
>>    - several optimizations on sched-idle balancing
>>    - ignore asym topos in can_migrate_task
>>    - add more benchmarks including SIS efficiency
>>    - re-organize patch as suggested by Mel
>>
>> v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/
>> v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 12 ++++++++++
>>   kernel/sched/core.c            | 38 ++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++-------
>>   kernel/sched/idle.c            |  1 +
>>   kernel/sched/sched.h           |  4 ++++
>>   kernel/sched/topology.c        |  4 +++-
>>   6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> [..snip..]
> 
> Let me know if there is some additional data you would like
> me to gather on the test machine.
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-20  7:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
2022-05-07 16:09 ` Chen Yu
2022-05-07 17:50   ` Abel Wu
2022-05-09 15:21     ` Chen Yu
2022-05-09 15:31       ` Chen Yu
2022-05-10  2:55       ` Abel Wu
2022-05-10  1:14 ` Josh Don
2022-05-10  8:03   ` Abel Wu
2022-05-19 22:16 ` Tim Chen
2022-05-20  2:48   ` Abel Wu
2022-05-20  6:48 ` K Prateek Nayak
2022-05-20  7:43   ` Abel Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).