All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] select_idle_sibling() wreckage
@ 2020-12-14 16:48 Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang


Hai, here them patches Mel asked for. They've not (yet) been through the
robots, so there might be some build fail for configs I've not used.

Benchmark time :-)

---
 include/linux/sched/topology.h |    1
 kernel/sched/core.c            |   19 +++-
 kernel/sched/fair.c            |  171 +++++++++++------------------------------
 kernel/sched/idle.c            |    1
 kernel/sched/sched.h           |   13 ---
 5 files changed, 63 insertions(+), 142 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
@ 2020-12-14 16:48 ` Peter Zijlstra
  2020-12-15  3:36   ` Li, Aubrey
  2020-12-14 16:48 ` [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

We compute the average cost of the total scan, but then use it as a
per-cpu scan cost when computing the scan proportion. Fix this by
properly computing a per-cpu scan cost.

This also fixes a bug where we would terminate early (!--nr, case) and
not account that cost at all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6144,10 +6144,10 @@ static inline int select_idle_smt(struct
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int cpu, loops = 1, nr = INT_MAX;
+	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
-	int this = smp_processor_id();
-	int cpu, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6175,14 +6175,19 @@ static int select_idle_cpu(struct task_s
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (!--nr)
-			return -1;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
 			break;
+
+		if (loops >= nr) {
+			cpu = -1;
+			break;
+		}
+		loops++;
 	}
 
 	if (sched_feat(SIS_PROP)) {
 		time = cpu_clock(this) - time;
+		time = div_u64(time, loops);
 		update_avg(&this_sd->avg_scan_cost, time);
 	}
 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
@ 2020-12-14 16:48 ` Peter Zijlstra
  2020-12-23 13:31   ` Vincent Guittot
  2020-12-14 16:48 ` [RFC][PATCH 3/5] sched/fair: Remove select_idle_smt() Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   19 ++++++++++++++-----
 kernel/sched/fair.c |   12 ++++++++++--
 2 files changed, 24 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7454,11 +7454,20 @@ int sched_cpu_activate(unsigned int cpu)
 	balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-	/*
-	 * When going up, increment the number of cores with SMT present.
-	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-		static_branch_inc_cpuslocked(&sched_smt_present);
+	do {
+		int weight = cpumask_weight(cpu_smt_mask(cpu));
+		extern int sched_smt_weight;
+
+		if (weight > sched_smt_weight)
+			sched_smt_weight = weight;
+
+		/*
+		 * When going up, increment the number of cores with SMT present.
+		 */
+		if (weight == 2)
+			static_branch_inc_cpuslocked(&sched_smt_present);
+
+	} while (0);
 #endif
 	set_cpu_active(cpu, true);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+int sched_smt_weight = 1;
+
 static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
@@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_s
 
 #else /* CONFIG_SCHED_SMT */
 
+#define sched_smt_weight	1
+
 static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	return -1;
@@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct
 
 #endif /* CONFIG_SCHED_SMT */
 
+#define sis_min_cores		2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_s
 		avg_cost = this_sd->avg_scan_cost + 1;
 
 		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > sis_min_cores * avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
-			nr = 4;
+			nr = sis_min_cores;
+
+		nr *= sched_smt_weight;
 
 		time = cpu_clock(this);
 	}



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 3/5] sched/fair: Remove select_idle_smt()
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
@ 2020-12-14 16:48 ` Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 4/5] sched/fair: Merge select_idle_core/cpu() Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   30 ------------------------------
 1 file changed, 30 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_
 	return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	int cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-		    !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-			continue;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			return cpu;
-	}
-
-	return -1;
-}
-
 #else /* CONFIG_SCHED_SMT */
 
 #define sched_smt_weight	1
@@ -6133,11 +6112,6 @@ static inline int select_idle_core(struc
 	return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
-
 #endif /* CONFIG_SCHED_SMT */
 
 #define sis_min_cores		2
@@ -6334,10 +6308,6 @@ static int select_idle_sibling(struct ta
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-	i = select_idle_smt(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	return target;
 }
 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 4/5] sched/fair: Merge select_idle_core/cpu()
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-12-14 16:48 ` [RFC][PATCH 3/5] sched/fair: Remove select_idle_smt() Peter Zijlstra
@ 2020-12-14 16:48 ` Peter Zijlstra
  2020-12-14 16:48 ` [RFC][PATCH 5/5] sched/fair: SIS_PROP the idle core scan Peter Zijlstra
  2020-12-16 12:59 ` [RFC][PATCH 0/5] select_idle_sibling() wreckage Li, Aubrey
  5 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |  100 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 60 insertions(+), 40 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct
 	return new_cpu;
 }
 
+static inline int __select_idle_cpu(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
+{
+	if (available_idle_cpu(core) || sched_idle_cpu(core))
+		return core;
+
+	return -1;
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
+	bool idle = true;
+	int cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	if (!test_idle_cores(target, false))
-		return -1;
+		return __select_idle_cpu(p, core, cpus, idle_cpu);
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
-	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
-
-		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			if (!available_idle_cpu(cpu)) {
-				idle = false;
-				break;
+	for_each_cpu(cpu, cpu_smt_mask(core)) {
+		if (!available_idle_cpu(cpu)) {
+			idle = false;
+			if (*idle_cpu == -1) {
+				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
+					*idle_cpu = cpu;
+					break;
+				}
+				continue;
 			}
+			break;
 		}
-
-		if (idle)
-			return core;
-
-		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+			*idle_cpu = cpu;
 	}
 
-	/*
-	 * Failed to find an idle core; stop looking for one.
-	 */
-	set_idle_cores(target, 0);
+	if (idle)
+		return core;
 
+	cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 	return -1;
 }
 
@@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_
 
 #define sched_smt_weight	1
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static inline void set_idle_cores(int cpu, int val)
 {
-	return -1;
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+	return def;
+}
+
+static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
+{
+	return __select_idle_cpu(p, core, cpus, idle_cpu);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6124,7 +6135,8 @@ static inline int select_idle_core(struc
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int cpu, loops = 1, nr = INT_MAX;
+	int i, cpu, idle_cpu = -1, loops = 1, nr = INT_MAX;
+	bool smt = test_idle_cores(target, false);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_s
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP)) {
+	if (sched_feat(SIS_PROP) && !smt) {
 		u64 avg_cost, avg_idle, span_avg;
 
 		/*
@@ -6157,23 +6169,35 @@ static int select_idle_cpu(struct task_s
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+		if (smt) {
+			i = select_idle_core(p, cpu, cpus, &idle_cpu);
+			if ((unsigned)i < nr_cpumask_bits)
+				return i;
+
+		} else {
+			i = __select_idle_cpu(p, cpu, cpus, &idle_cpu);
+			if ((unsigned)i < nr_cpumask_bits) {
+				idle_cpu = i;
+				break;
+			}
+		}
 
-		if (loops >= nr) {
-			cpu = -1;
+		if (loops >= nr)
 			break;
-		}
+
 		loops++;
 	}
 
-	if (sched_feat(SIS_PROP)) {
+	if (smt)
+		set_idle_cores(this, false);
+
+	if (sched_feat(SIS_PROP) && !smt) {
 		time = cpu_clock(this) - time;
 		time = div_u64(time, loops);
 		update_avg(&this_sd->avg_scan_cost, time);
 	}
 
-	return cpu;
+	return idle_cpu;
 }
 
 /*
@@ -6302,10 +6326,6 @@ static int select_idle_sibling(struct ta
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 5/5] sched/fair: SIS_PROP the idle core scan
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-12-14 16:48 ` [RFC][PATCH 4/5] sched/fair: Merge select_idle_core/cpu() Peter Zijlstra
@ 2020-12-14 16:48 ` Peter Zijlstra
  2020-12-16 12:59 ` [RFC][PATCH 0/5] select_idle_sibling() wreckage Li, Aubrey
  5 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-14 16:48 UTC (permalink / raw)
  To: mgorman, vincent.guittot
  Cc: peterz, linux-kernel, aubrey.li, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

Further unify the new select_idle_cpu() loop and remove the 'smt'
selection code and unconditionally use SIS_PROP, even for idle core
searches.

This effectively brings back the effects of select_idle_smt() which we
removed a few patches ago due to always iterating the target core.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/topology.h |    1 
 kernel/sched/fair.c            |   90 +++--------------------------------------
 kernel/sched/idle.c            |    1 
 kernel/sched/sched.h           |   13 -----
 4 files changed, 7 insertions(+), 98 deletions(-)

--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -73,7 +73,6 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
 };
 
 struct sched_domain {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1581,11 +1581,9 @@ numa_type numa_classify(unsigned int imb
 
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
 static inline int numa_idle_core(int idle_core, int cpu)
 {
-	if (!static_branch_likely(&sched_smt_present) ||
-	    idle_core >= 0 || !test_idle_cores(cpu, false))
+	if (!static_branch_likely(&sched_smt_present) || idle_core >= 0)
 		return idle_core;
 
 	/*
@@ -6020,60 +6018,6 @@ EXPORT_SYMBOL_GPL(sched_smt_present);
 
 int sched_smt_weight = 1;
 
-static inline void set_idle_cores(int cpu, int val)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
-
-	return def;
-}
-
-/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
- */
-void __update_idle_core(struct rq *rq)
-{
-	int core = cpu_of(rq);
-	int cpu;
-
-	rcu_read_lock();
-	if (test_idle_cores(core, true))
-		goto unlock;
-
-	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (cpu == core)
-			continue;
-
-		if (!available_idle_cpu(cpu))
-			goto unlock;
-	}
-
-	set_idle_cores(core, 1);
-unlock:
-	rcu_read_unlock();
-}
-
-/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
- */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	bool idle = true;
@@ -6109,15 +6053,6 @@ static int select_idle_core(struct task_
 
 #define sched_smt_weight	1
 
-static inline void set_idle_cores(int cpu, int val)
-{
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
-	return def;
-}
-
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(p, core, cpus, idle_cpu);
@@ -6136,7 +6071,6 @@ static int select_idle_cpu(struct task_s
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, loops = 1, nr = INT_MAX;
-	bool smt = test_idle_cores(target, false);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6147,7 +6081,7 @@ static int select_idle_cpu(struct task_s
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP) && !smt) {
+	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
 
 		/*
@@ -6169,17 +6103,10 @@ static int select_idle_cpu(struct task_s
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (smt) {
-			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned)i < nr_cpumask_bits)
-				return i;
-
-		} else {
-			i = __select_idle_cpu(p, cpu, cpus, &idle_cpu);
-			if ((unsigned)i < nr_cpumask_bits) {
-				idle_cpu = i;
-				break;
-			}
+		i = select_idle_core(p, cpu, cpus, &idle_cpu);
+		if ((unsigned)i < nr_cpumask_bits) {
+			idle_cpu = i;
+			break;
 		}
 
 		if (loops >= nr)
@@ -6188,10 +6115,7 @@ static int select_idle_cpu(struct task_s
 		loops++;
 	}
 
-	if (smt)
-		set_idle_cores(this, false);
-
-	if (sched_feat(SIS_PROP) && !smt) {
+	if (sched_feat(SIS_PROP)) {
 		time = cpu_clock(this) - time;
 		time = div_u64(time, loops);
 		update_avg(&this_sd->avg_scan_cost, time);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -428,7 +428,6 @@ static void put_prev_task_idle(struct rq
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
-	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1099,19 +1099,6 @@ static inline bool is_migration_disabled
 #endif
 }
 
-#ifdef CONFIG_SCHED_SMT
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
-	if (static_branch_unlikely(&sched_smt_present))
-		__update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
-#endif
-
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
@ 2020-12-15  3:36   ` Li, Aubrey
  2020-12-15  7:59     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Li, Aubrey @ 2020-12-15  3:36 UTC (permalink / raw)
  To: Peter Zijlstra, mgorman, vincent.guittot
  Cc: linux-kernel, mingo, juri.lelli, valentin.schneider, qais.yousef,
	dietmar.eggemann, rostedt, bsegall, tim.c.chen, benbjiang

On 2020/12/15 0:48, Peter Zijlstra wrote:
> We compute the average cost of the total scan, but then use it as a
> per-cpu scan cost when computing the scan proportion. Fix this by
> properly computing a per-cpu scan cost.
> 
> This also fixes a bug where we would terminate early (!--nr, case) and
> not account that cost at all.

I'm a bit worried this may introduce a regression under heavy load.
The overhead of adding another cpu_clock() and calculation becomes 
significant when sis_scan is throttled by nr.

I'm not sure if it's a good idea to not account the scan cost at all
when sis_scan is throttled, that is, remove the first cpu_clock() as
well. The avg scan cost remains the value when the system is not very
busy, and when the load comes down and span avg idle > span avg cost,
we account the cost again. This should make select_idle_cpu() a bit
faster when the load is very high.

Thanks,
-Aubrey
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6144,10 +6144,10 @@ static inline int select_idle_smt(struct
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +	int cpu, loops = 1, nr = INT_MAX;
> +	int this = smp_processor_id();
>  	struct sched_domain *this_sd;
>  	u64 time;
> -	int this = smp_processor_id();
> -	int cpu, nr = INT_MAX;
>  
>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
> @@ -6175,14 +6175,19 @@ static int select_idle_cpu(struct task_s
>  	}
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
> -		if (!--nr)
> -			return -1;
>  		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>  			break;
> +
> +		if (loops >= nr) {
> +			cpu = -1;
> +			break;
> +		}
> +		loops++;
>  	}
>  
>  	if (sched_feat(SIS_PROP)) {
>  		time = cpu_clock(this) - time;
> +		time = div_u64(time, loops);
>  		update_avg(&this_sd->avg_scan_cost, time);
>  	}
>  
> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-15  3:36   ` Li, Aubrey
@ 2020-12-15  7:59     ` Peter Zijlstra
  2020-12-15 11:45       ` Mel Gorman
                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-12-15  7:59 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: mgorman, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> On 2020/12/15 0:48, Peter Zijlstra wrote:
> > We compute the average cost of the total scan, but then use it as a
> > per-cpu scan cost when computing the scan proportion. Fix this by
> > properly computing a per-cpu scan cost.
> > 
> > This also fixes a bug where we would terminate early (!--nr, case) and
> > not account that cost at all.
> 
> I'm a bit worried this may introduce a regression under heavy load.
> The overhead of adding another cpu_clock() and calculation becomes 
> significant when sis_scan is throttled by nr.

The thing is, the code as it exists today makes no sense what so ever.
It's plain broken batshit.

We calculate the total scanning time (irrespective of how many CPUs we
touched), and then use that calculate the number of cpus to scan. That's
just daft.

After this patch we calculate the avg cost of scanning 1 cpu and use
that to calculate how many cpus to scan. Which is coherent and sane.

Maybe it can be improved, but that's a completely different thing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-15  7:59     ` Peter Zijlstra
@ 2020-12-15 11:45       ` Mel Gorman
  2020-12-15 12:13       ` Li, Aubrey
  2021-01-08 10:27       ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2020-12-15 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Li, Aubrey, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > We compute the average cost of the total scan, but then use it as a
> > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > properly computing a per-cpu scan cost.
> > > 
> > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > not account that cost at all.
> > 
> > I'm a bit worried this may introduce a regression under heavy load.
> > The overhead of adding another cpu_clock() and calculation becomes 
> > significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.

Which makes it very hard to reason about or change in a "safe" manner as
all sorts of counter-intuitive effects occur.

The series is queued and running and takes 1-2 days. I haven't reviewed
the patches properly (holiday) but it'll be interesting to get some
provisional data at least.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-15  7:59     ` Peter Zijlstra
  2020-12-15 11:45       ` Mel Gorman
@ 2020-12-15 12:13       ` Li, Aubrey
  2021-01-08 10:27       ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Li, Aubrey @ 2020-12-15 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mgorman, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On 2020/12/15 15:59, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
>> On 2020/12/15 0:48, Peter Zijlstra wrote:
>>> We compute the average cost of the total scan, but then use it as a
>>> per-cpu scan cost when computing the scan proportion. Fix this by
>>> properly computing a per-cpu scan cost.
>>>
>>> This also fixes a bug where we would terminate early (!--nr, case) and
>>> not account that cost at all.
>>
>> I'm a bit worried this may introduce a regression under heavy load.
>> The overhead of adding another cpu_clock() and calculation becomes 
>> significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.
> It's plain broken batshit.
> 
> We calculate the total scanning time (irrespective of how many CPUs we
> touched), and then use that calculate the number of cpus to scan. That's
> just daft.
> 
> After this patch we calculate the avg cost of scanning 1 cpu and use
> that to calculate how many cpus to scan. Which is coherent and sane.

I see and all of these make sense to me.

> 
> Maybe it can be improved, but that's a completely different thing.
> 

OK, I'll go through the workloads in hand and paste the data here.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage
  2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-12-14 16:48 ` [RFC][PATCH 5/5] sched/fair: SIS_PROP the idle core scan Peter Zijlstra
@ 2020-12-16 12:59 ` Li, Aubrey
  2020-12-16 18:07   ` Vincent Guittot
  5 siblings, 1 reply; 31+ messages in thread
From: Li, Aubrey @ 2020-12-16 12:59 UTC (permalink / raw)
  To: Peter Zijlstra, mgorman, vincent.guittot
  Cc: linux-kernel, mingo, juri.lelli, valentin.schneider, qais.yousef,
	dietmar.eggemann, rostedt, bsegall, tim.c.chen, benbjiang

Hi Peter,

On 2020/12/15 0:48, Peter Zijlstra wrote:
> Hai, here them patches Mel asked for. They've not (yet) been through the
> robots, so there might be some build fail for configs I've not used.
> 
> Benchmark time :-)
> 

Here is the data on my side, benchmarks were tested on a x86 4 sockets system
with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs.

uperf throughput: netperf workload, tcp_nodelay, r/w size = 90

  threads	baseline-avg	%std	patch-avg	%std
  96		1		0.78	1.0072		1.09
  144		1		0.58	1.0204		0.83
  192		1		0.66	1.0151		0.52
  240		1		2.08	0.8990		0.75

hackbench: process mode, 25600 loops, 40 file descriptors per group

  group		baseline-avg	%std	patch-avg	%std
  2(80)		1		10.02	1.0339		9.94
  3(120)	1		6.69	1.0049		6.92
  4(160)	1		6.76	0.8663		8.74
  5(200)	1		2.96	0.9651		4.28

schbench: 99th percentile latency, 16 workers per message thread

  mthread	baseline-avg	%std	patch-avg	%std
  6(96)		1		0.88	1.0055		0.81
  9(144)	1		0.59	1.0007		0.37
  12(192)	1		0.61	0.9973		0.82
  15(240)	1		25.05	0.9251		18.36

sysbench mysql throughput: read/write, table size = 10,000,000

  thread	baseline-avg	%std	patch-avg	%std
  96		1               6.62	0.9668		4.04
  144		1		9.29	0.9579		6.53
  192		1		9.52	0.9503		5.35
  240		1		8.55	0.9657		3.34

It looks like 
- hackbench has a significant improvement of 4 groups
- uperf has a significant regression of 240 threads

Please let me know if you have any interested cases I can run/rerun.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage
  2020-12-16 12:59 ` [RFC][PATCH 0/5] select_idle_sibling() wreckage Li, Aubrey
@ 2020-12-16 18:07   ` Vincent Guittot
  2020-12-23 13:23     ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2020-12-16 18:07 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Mel Gorman, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Wed, 16 Dec 2020 at 14:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> Hi Peter,
>
> On 2020/12/15 0:48, Peter Zijlstra wrote:
> > Hai, here them patches Mel asked for. They've not (yet) been through the
> > robots, so there might be some build fail for configs I've not used.
> >
> > Benchmark time :-)
> >
>
> Here is the data on my side, benchmarks were tested on a x86 4 sockets system
> with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs.
>
> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>
>   threads       baseline-avg    %std    patch-avg       %std
>   96            1               0.78    1.0072          1.09
>   144           1               0.58    1.0204          0.83
>   192           1               0.66    1.0151          0.52
>   240           1               2.08    0.8990          0.75
>
> hackbench: process mode, 25600 loops, 40 file descriptors per group
>
>   group         baseline-avg    %std    patch-avg       %std
>   2(80)         1               10.02   1.0339          9.94
>   3(120)        1               6.69    1.0049          6.92
>   4(160)        1               6.76    0.8663          8.74
>   5(200)        1               2.96    0.9651          4.28
>
> schbench: 99th percentile latency, 16 workers per message thread
>
>   mthread       baseline-avg    %std    patch-avg       %std
>   6(96)         1               0.88    1.0055          0.81
>   9(144)        1               0.59    1.0007          0.37
>   12(192)       1               0.61    0.9973          0.82
>   15(240)       1               25.05   0.9251          18.36
>
> sysbench mysql throughput: read/write, table size = 10,000,000
>
>   thread        baseline-avg    %std    patch-avg       %std
>   96            1               6.62    0.9668          4.04
>   144           1               9.29    0.9579          6.53
>   192           1               9.52    0.9503          5.35
>   240           1               8.55    0.9657          3.34
>
> It looks like
> - hackbench has a significant improvement of 4 groups
> - uperf has a significant regression of 240 threads

Tests are still running on my side but early results shows perf
regression for hackbench

>
> Please let me know if you have any interested cases I can run/rerun.
>
> Thanks,
> -Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage
  2020-12-16 18:07   ` Vincent Guittot
@ 2020-12-23 13:23     ` Vincent Guittot
  2021-01-04 15:40       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2020-12-23 13:23 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Mel Gorman, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Wed, 16 Dec 2020 at 19:07, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 16 Dec 2020 at 14:00, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> >
> > Hi Peter,
> >
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > Hai, here them patches Mel asked for. They've not (yet) been through the
> > > robots, so there might be some build fail for configs I've not used.
> > >
> > > Benchmark time :-)
> > >
> >
> > Here is the data on my side, benchmarks were tested on a x86 4 sockets system
> > with 24 cores per socket and 2 hyperthreads per core, total 192 CPUs.
> >
> > uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
> >
> >   threads       baseline-avg    %std    patch-avg       %std
> >   96            1               0.78    1.0072          1.09
> >   144           1               0.58    1.0204          0.83
> >   192           1               0.66    1.0151          0.52
> >   240           1               2.08    0.8990          0.75
> >
> > hackbench: process mode, 25600 loops, 40 file descriptors per group
> >
> >   group         baseline-avg    %std    patch-avg       %std
> >   2(80)         1               10.02   1.0339          9.94
> >   3(120)        1               6.69    1.0049          6.92
> >   4(160)        1               6.76    0.8663          8.74
> >   5(200)        1               2.96    0.9651          4.28
> >
> > schbench: 99th percentile latency, 16 workers per message thread
> >
> >   mthread       baseline-avg    %std    patch-avg       %std
> >   6(96)         1               0.88    1.0055          0.81
> >   9(144)        1               0.59    1.0007          0.37
> >   12(192)       1               0.61    0.9973          0.82
> >   15(240)       1               25.05   0.9251          18.36
> >
> > sysbench mysql throughput: read/write, table size = 10,000,000
> >
> >   thread        baseline-avg    %std    patch-avg       %std
> >   96            1               6.62    0.9668          4.04
> >   144           1               9.29    0.9579          6.53
> >   192           1               9.52    0.9503          5.35
> >   240           1               8.55    0.9657          3.34
> >
> > It looks like
> > - hackbench has a significant improvement of 4 groups
> > - uperf has a significant regression of 240 threads
>
> Tests are still running on my side but early results shows perf
> regression for hackbench

Few more results before being off:
On small embedded system, the problem seems to be mainly a matter of
setting the right number of loops.

On large smt system, The system on which  I usually run my tests  if
off for now so i haven't been able to finalize tests yet but the
problem might be that we don't loop all core anymore with this
patchset compare to current algorithm

>
> >
> > Please let me know if you have any interested cases I can run/rerun.
> >
> > Thanks,
> > -Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores
  2020-12-14 16:48 ` [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
@ 2020-12-23 13:31   ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2020-12-23 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, linux-kernel, Aubrey Li, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Mon, 14 Dec 2020 at 18:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Instead of calculating how many (logical) CPUs to scan, compute how
> many cores to scan.
>
> This changes behaviour for anything !SMT2.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   19 ++++++++++++++-----
>  kernel/sched/fair.c |   12 ++++++++++--
>  2 files changed, 24 insertions(+), 7 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7454,11 +7454,20 @@ int sched_cpu_activate(unsigned int cpu)
>         balance_push_set(cpu, false);
>
>  #ifdef CONFIG_SCHED_SMT
> -       /*
> -        * When going up, increment the number of cores with SMT present.
> -        */
> -       if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> -               static_branch_inc_cpuslocked(&sched_smt_present);
> +       do {
> +               int weight = cpumask_weight(cpu_smt_mask(cpu));
> +               extern int sched_smt_weight;
> +
> +               if (weight > sched_smt_weight)
> +                       sched_smt_weight = weight;
> +
> +               /*
> +                * When going up, increment the number of cores with SMT present.
> +                */
> +               if (weight == 2)
> +                       static_branch_inc_cpuslocked(&sched_smt_present);
> +
> +       } while (0);
>  #endif
>         set_cpu_active(cpu, true);
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
>
> +int sched_smt_weight = 1;
> +
>  static inline void set_idle_cores(int cpu, int val)
>  {
>         struct sched_domain_shared *sds;
> @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_s
>
>  #else /* CONFIG_SCHED_SMT */
>
> +#define sched_smt_weight       1
> +
>  static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>         return -1;
> @@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct
>
>  #endif /* CONFIG_SCHED_SMT */
>
> +#define sis_min_cores          2
> +
>  /*
>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_s
>                 avg_cost = this_sd->avg_scan_cost + 1;
>
>                 span_avg = sd->span_weight * avg_idle;
> -               if (span_avg > 4*avg_cost)
> +               if (span_avg > sis_min_cores * avg_cost)
>                         nr = div_u64(span_avg, avg_cost);
>                 else
> -                       nr = 4;
> +                       nr = sis_min_cores;
> +
> +               nr *= sched_smt_weight;

My comment apply of the final result of the patchset where
select_idle_cpu/cpore are merged
IIUC, sched_smt_weight is the max number of CPUs per core but we
already loop all CPUs of a core in one loop so we don't need to loop
more than the number of core.

>
>                 time = cpu_clock(this);
>         }
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage
  2020-12-23 13:23     ` Vincent Guittot
@ 2021-01-04 15:40       ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-04 15:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Li, Aubrey, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Wed, Dec 23, 2020 at 02:23:41PM +0100, Vincent Guittot wrote:
> > Tests are still running on my side but early results shows perf
> > regression for hackbench
> 
> Few more results before being off:
> On small embedded system, the problem seems to be mainly a matter of
> setting the right number of loops.
> 
> On large smt system, The system on which  I usually run my tests  if
> off for now so i haven't been able to finalize tests yet but the
> problem might be that we don't loop all core anymore with this
> patchset compare to current algorithm
> 

Tests ran over the holidays and are available at http://www.skynet.ie/~mel/postings/peterz-20210104/dashboard.html

I am thrawling through the data but by and large the two main
observations I've had so far are

1. The last patch seems the most problematic and the most likely to make
   a large change, particularly to hackbench. For example;
   http://www.skynet.ie/~mel/postings/peterz-20210104/scheduler-unbound/bing2/index.html#hackbench-thread-pipes

   The idle cpu cutoff is reasonably effective even though it triggers a
   lot of false positives meaning that it may be better to treat that in
   isolation

2. The cost accounting one had variable impact. Generally it was small
   gains and losses but tbench for low client counts is an exception as
   low thread counts say variable impact. Some big losses although EPYC1
   is a counter-example (toto in the dashboard)

The second issue might be responsible for the first issue, not sure.
However, it does not suprise me that properly accounting would have an
impact on the SMT depth search and likely needs tweaking.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2020-12-15  7:59     ` Peter Zijlstra
  2020-12-15 11:45       ` Mel Gorman
  2020-12-15 12:13       ` Li, Aubrey
@ 2021-01-08 10:27       ` Mel Gorman
  2021-01-08 13:01         ` Qais Yousef
                           ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-08 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Li, Aubrey, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > We compute the average cost of the total scan, but then use it as a
> > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > properly computing a per-cpu scan cost.
> > > 
> > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > not account that cost at all.
> > 
> > I'm a bit worried this may introduce a regression under heavy load.
> > The overhead of adding another cpu_clock() and calculation becomes 
> > significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.
> It's plain broken batshit.
> 
> We calculate the total scanning time (irrespective of how many CPUs we
> touched), and then use that calculate the number of cpus to scan. That's
> just daft.
> 
> After this patch we calculate the avg cost of scanning 1 cpu and use
> that to calculate how many cpus to scan. Which is coherent and sane.
> 
> Maybe it can be improved, but that's a completely different thing.

In general I agree with everything you said so lets talk about the last
sentence.

1. avg_scan_cost is now based on the average scan cost of a rq but
   avg_idle is still scaled to the domain size. This is a bit problematic
   because it's comparing scan cost of a single rq with the estimated
   average idle time of a domain. As a result, the scan depth can be much
   larger than it was before the patch and led to some regressions.

2. Accounting for the scan cost of success makes sense but there is a
   big difference between a scan that finds an idle CPU and one that fails.
   For failures, the scan cost is wasted CPU time where as a success
   means that an uncontested CPU is used. This can cause a search to be
   truncated earlier than it should be when the domain is lightly loaded.

The patch below attempts to deal with both of those points. The
remaining difficulty is the "fuzz" factor which is necessary to bring
large avg_idle values into a reasonable range for comparing against
avg_scan_cost. The max avg_idle value can be anything but generally the
ceiling is 2*sysctl_sched_migration_cost. I didn't quickly spot a good way
of mapping avg_idle to a range between 0 and sd->span_weight.  The obvious
one was (weight*avg_idle/(2*sched_migration_cost)) but it did not work very
well as avg_scan_cost accounts for the full cost of accessing a remote rq.
However, this could be revisited later after this series is sorted out.

Anyway, for the enumerated concerns, how about this on top for your
first patch? It seemed to work well for a few workloads on 3 machines
at least and I'd like to nail this down before considering the remaining
patches. The first run indicated that the first patch offset some of the
benefits of the other patches.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a2f208ada8..1e04a250e8ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6156,7 +6156,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	if (sched_feat(SIS_PROP)) {
-		u64 avg_cost, avg_idle, span_avg;
+		u64 avg_cost, avg_idle;
 
 		/*
 		 * Due to large variance we need a large fuzz factor;
@@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		 */
 		avg_idle = this_rq()->avg_idle / 512;
 		avg_cost = this_sd->avg_scan_cost + 1;
-
-		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
-			nr = div_u64(span_avg, avg_cost);
-		else
+		nr = div_u64(avg_idle, avg_cost);
+		if (nr < 4)
 			nr = 4;
 
 		time = cpu_clock(this);
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+			/* Adjust cost of a successful scan */
+			loops <<= 2;
+
 			break;
+		}
 
-		if (loops >= nr) {
+		if (++loops >= nr) {
 			cpu = -1;
 			break;
 		}
-		loops++;
 	}
 
 	if (sched_feat(SIS_PROP)) {

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 10:27       ` Mel Gorman
@ 2021-01-08 13:01         ` Qais Yousef
  2021-01-08 13:47           ` Mel Gorman
  2021-01-08 13:41         ` Vincent Guittot
  2021-01-08 20:21         ` Peter Zijlstra
  2 siblings, 1 reply; 31+ messages in thread
From: Qais Yousef @ 2021-01-08 13:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Li, Aubrey, vincent.guittot, linux-kernel, mingo,
	juri.lelli, valentin.schneider, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On 01/08/21 10:27, Mel Gorman wrote:
>  	for_each_cpu_wrap(cpu, cpus, target) {
> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +			/* Adjust cost of a successful scan */
> +			loops <<= 2;
> +
>  			break;
> +		}
>  
> -		if (loops >= nr) {
> +		if (++loops >= nr) {
>  			cpu = -1;
>  			break;
>  		}
> -		loops++;

Random (out of the blue) comment.

Now this will increment loops before the comparison/break. ie: we're
effectively doing one iteration less IIRC. Should loops be initialized to
0 instead of 1?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 10:27       ` Mel Gorman
  2021-01-08 13:01         ` Qais Yousef
@ 2021-01-08 13:41         ` Vincent Guittot
  2021-01-08 14:40           ` Mel Gorman
  2021-01-08 20:21         ` Peter Zijlstra
  2 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2021-01-08 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, 8 Jan 2021 at 11:27, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > > We compute the average cost of the total scan, but then use it as a
> > > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > > properly computing a per-cpu scan cost.
> > > >
> > > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > > not account that cost at all.
> > >
> > > I'm a bit worried this may introduce a regression under heavy load.
> > > The overhead of adding another cpu_clock() and calculation becomes
> > > significant when sis_scan is throttled by nr.
> >
> > The thing is, the code as it exists today makes no sense what so ever.
> > It's plain broken batshit.
> >
> > We calculate the total scanning time (irrespective of how many CPUs we
> > touched), and then use that calculate the number of cpus to scan. That's
> > just daft.
> >
> > After this patch we calculate the avg cost of scanning 1 cpu and use
> > that to calculate how many cpus to scan. Which is coherent and sane.
> >
> > Maybe it can be improved, but that's a completely different thing.
>
> In general I agree with everything you said so lets talk about the last
> sentence.
>
> 1. avg_scan_cost is now based on the average scan cost of a rq but
>    avg_idle is still scaled to the domain size. This is a bit problematic
>    because it's comparing scan cost of a single rq with the estimated
>    average idle time of a domain. As a result, the scan depth can be much
>    larger than it was before the patch and led to some regressions.

Point 1 makes sense to me too

>
> 2. Accounting for the scan cost of success makes sense but there is a
>    big difference between a scan that finds an idle CPU and one that fails.
>    For failures, the scan cost is wasted CPU time where as a success
>    means that an uncontested CPU is used. This can cause a search to be
>    truncated earlier than it should be when the domain is lightly loaded.

But I'm not sure to catch your problem with point 2.
track the average cost to scan one rq so looping all rqs are only few
should not impact (much) the avg_scan_cost

Trying to bias the avg_scan_cost with:  loops <<= 2;
will just make avg_scan_cost lost any kind of meaning because it
doesn't reflect the avg cost of scanning a rq anymore

>
> The patch below attempts to deal with both of those points. The
> remaining difficulty is the "fuzz" factor which is necessary to bring
> large avg_idle values into a reasonable range for comparing against
> avg_scan_cost. The max avg_idle value can be anything but generally the
> ceiling is 2*sysctl_sched_migration_cost. I didn't quickly spot a good way
> of mapping avg_idle to a range between 0 and sd->span_weight.  The obvious
> one was (weight*avg_idle/(2*sched_migration_cost)) but it did not work very
> well as avg_scan_cost accounts for the full cost of accessing a remote rq.
> However, this could be revisited later after this series is sorted out.
>
> Anyway, for the enumerated concerns, how about this on top for your
> first patch? It seemed to work well for a few workloads on 3 machines
> at least and I'd like to nail this down before considering the remaining
> patches. The first run indicated that the first patch offset some of the
> benefits of the other patches.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 65a2f208ada8..1e04a250e8ee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6156,7 +6156,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
>         if (sched_feat(SIS_PROP)) {
> -               u64 avg_cost, avg_idle, span_avg;
> +               u64 avg_cost, avg_idle;
>
>                 /*
>                  * Due to large variance we need a large fuzz factor;
> @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>                  */
>                 avg_idle = this_rq()->avg_idle / 512;
>                 avg_cost = this_sd->avg_scan_cost + 1;
> -
> -               span_avg = sd->span_weight * avg_idle;
> -               if (span_avg > 4*avg_cost)
> -                       nr = div_u64(span_avg, avg_cost);
> -               else
> +               nr = div_u64(avg_idle, avg_cost);
> +               if (nr < 4)
>                         nr = 4;
>
>                 time = cpu_clock(this);
>         }
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> +               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +                       /* Adjust cost of a successful scan */
> +                       loops <<= 2;
> +
>                         break;
> +               }
>
> -               if (loops >= nr) {
> +               if (++loops >= nr) {
>                         cpu = -1;
>                         break;
>                 }
> -               loops++;
>         }
>
>         if (sched_feat(SIS_PROP)) {
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 13:01         ` Qais Yousef
@ 2021-01-08 13:47           ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-08 13:47 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Li, Aubrey, vincent.guittot, linux-kernel, mingo,
	juri.lelli, valentin.schneider, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Fri, Jan 08, 2021 at 01:01:10PM +0000, Qais Yousef wrote:
> On 01/08/21 10:27, Mel Gorman wrote:
> >  	for_each_cpu_wrap(cpu, cpus, target) {
> > -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > +		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> > +			/* Adjust cost of a successful scan */
> > +			loops <<= 2;
> > +
> >  			break;
> > +		}
> >  
> > -		if (loops >= nr) {
> > +		if (++loops >= nr) {
> >  			cpu = -1;
> >  			break;
> >  		}
> > -		loops++;
> 
> Random (out of the blue) comment.
> 
> Now this will increment loops before the comparison/break. ie: we're
> effectively doing one iteration less IIRC. Should loops be initialized to
> 0 instead of 1?
> 

Yep, although in practice it'll make little difference except after a
rapid phase change when avg_idle still appears high on a per-rq basis
yet the domain is fully busy with no idle CPUs.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 13:41         ` Vincent Guittot
@ 2021-01-08 14:40           ` Mel Gorman
  2021-01-08 15:10             ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-01-08 14:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, Jan 08, 2021 at 02:41:19PM +0100, Vincent Guittot wrote:
> > 1. avg_scan_cost is now based on the average scan cost of a rq but
> >    avg_idle is still scaled to the domain size. This is a bit problematic
> >    because it's comparing scan cost of a single rq with the estimated
> >    average idle time of a domain. As a result, the scan depth can be much
> >    larger than it was before the patch and led to some regressions.
> 
> Point 1 makes sense to me too
> 
> >
> > 2. Accounting for the scan cost of success makes sense but there is a
> >    big difference between a scan that finds an idle CPU and one that fails.
> >    For failures, the scan cost is wasted CPU time where as a success
> >    means that an uncontested CPU is used. This can cause a search to be
> >    truncated earlier than it should be when the domain is lightly loaded.
> 
> But I'm not sure to catch your problem with point 2.
> track the average cost to scan one rq so looping all rqs are only few
> should not impact (much) the avg_scan_cost
> 
> Trying to bias the avg_scan_cost with:  loops <<= 2;
> will just make avg_scan_cost lost any kind of meaning because it
> doesn't reflect the avg cost of scanning a rq anymore
> 

Before the series, the avg_scan_cost also did not represent the cost of
scanning a RQ before either. Treating scan failures and successes equally
can problems with the depth of the scan conducted. Previously the "cost"
of a successful scan was 0 so successful scans allowed deeper scans in
the near future. This partially illustrates the problem.

                          5.11.0-rc2             5.11.0-rc2             5.11.0-rc2
                       baseline-v2r1          acctscan-v2r1           altscan-v2r8
Hmean     1        429.47 (   0.00%)      420.90 *  -2.00%*      414.27 *  -3.54%*
Hmean     2        709.39 (   0.00%)      796.05 *  12.22%*      791.98 *  11.64%*
Hmean     4       1449.19 (   0.00%)     1445.14 (  -0.28%)     1319.09 *  -8.98%*
Hmean     8       2765.65 (   0.00%)     2750.07 *  -0.56%*     2756.17 *  -0.34%*
Hmean     16      5158.47 (   0.00%)     5056.59 *  -1.97%*     5030.67 *  -2.48%*
Hmean     32      8969.96 (   0.00%)     8796.96 *  -1.93%*     8768.34 *  -2.25%*
Hmean     64     11210.05 (   0.00%)     9910.39 * -11.59%*    11073.42 *  -1.22%*
Hmean     128    17978.21 (   0.00%)    17031.41 *  -5.27%*    17037.76 *  -5.23%*
Hmean     256    16143.32 (   0.00%)    15636.59 *  -3.14%*    15761.12 *  -2.37%*
Hmean     320    16388.59 (   0.00%)    15591.78 *  -4.86%*    15588.85 *  -4.88%*

Note the impact of Peters patch (accescan-v2r1) for 64 threads. The
machine is 2-socket (40 cores, 80 threads) so 64 is the load is
balancing between two domains (load balancing vs wakeup migrations).
altscan is my suggested patch on top and with Peter's patch, there is a
11.59% regression that is negligible with my patch on top.

The impact is machine-specific or specific to the CPU generation. Here
is just comparing just the suggested alteration on a slightly older
generation.

                          5.11.0-rc2             5.11.0-rc2
                       acctscan-v2r1           altscan-v2r8
Hmean     1        155.44 (   0.00%)      183.32 *  17.94%*
Hmean     2        445.46 (   0.00%)      548.51 *  23.13%*
Hmean     4       1080.25 (   0.00%)     1112.49 *   2.98%*
Hmean     8       2253.48 (   0.00%)     2457.46 *   9.05%*
Hmean     16      3996.73 (   0.00%)     4244.59 *   6.20%*
Hmean     32      5318.93 (   0.00%)     5798.17 *   9.01%*
Hmean     64      9301.55 (   0.00%)     9563.24 *   2.81%*
Hmean     128     8560.89 (   0.00%)     8873.72 *   3.65%*
Hmean     192     8526.92 (   0.00%)     8843.43 *   3.71%*

And another 2-socket machine on a newer generation.

Hmean     1        551.16 (   0.00%)      503.75 *  -8.60%*
Hmean     2       1074.19 (   0.00%)     1078.08 *   0.36%*
Hmean     4       2024.72 (   0.00%)     2049.29 *   1.21%*
Hmean     8       3762.49 (   0.00%)     4002.24 *   6.37%*
Hmean     16      6589.98 (   0.00%)     6688.21 *   1.49%*
Hmean     32     10080.23 (   0.00%)    10270.34 *   1.89%*
Hmean     64     11349.16 (   0.00%)    12452.68 *   9.72%*
Hmean     128    21670.93 (   0.00%)    21823.70 *   0.70%*
Hmean     256    20605.62 (   0.00%)    20615.01 *   0.05%*
Hmean     320    20974.29 (   0.00%)    20565.11 *  -1.95%*

For hackbench with processes communicating via pipes on the first
machine

                          5.11.0-rc2             5.11.0-rc2
                       acctscan-v2r1           altscan-v2r8
Amean     1        0.3927 (   0.00%)      0.3943 (  -0.42%)
Amean     4        0.9247 (   0.00%)      0.9267 (  -0.22%)
Amean     7        1.4587 (   0.00%)      1.5147 *  -3.84%*
Amean     12       2.3637 (   0.00%)      2.4507 *  -3.68%*
Amean     21       4.0700 (   0.00%)      4.1757 *  -2.60%*
Amean     30       5.6573 (   0.00%)      5.7390 *  -1.44%*
Amean     48       8.9037 (   0.00%)      8.8053 *   1.10%*
Amean     79      14.9190 (   0.00%)     14.4360 *   3.24%*
Amean     110     22.5703 (   0.00%)     21.9210 (   2.88%)
Amean     141     29.2400 (   0.00%)     28.0110 *   4.20%*
Amean     172     36.3720 (   0.00%)     34.7963 (   4.33%)
Amean     203     43.5783 (   0.00%)     42.5537 *   2.35%*
Amean     234     50.3653 (   0.00%)     47.3463 *   5.99%*
Amean     265     57.6153 (   0.00%)     55.6247 *   3.46%*
Amean     296     62.7370 (   0.00%)     62.0720 (   1.06%)

Adjusting the scan cost for successes is neither a universal win or
failure but it's closer to historical behaviour and the strict
accounting does hit corner cases.  If a deep scan is finding an idle CPU,
it makes some sense to continue scanning deeply by adjusting the weight
instead of prematurely failing.

The testing of the full series previously showed that some loads never
recovered from side-effects of the first patch and the last patch in the
series introduced new problems of its own. Hence, I would like to limit
the negative impact of the first patch and, if necessary, cut the last
patch altogether.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 14:40           ` Mel Gorman
@ 2021-01-08 15:10             ` Vincent Guittot
  2021-01-08 16:14               ` Mel Gorman
                                 ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Vincent Guittot @ 2021-01-08 15:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, 8 Jan 2021 at 15:41, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jan 08, 2021 at 02:41:19PM +0100, Vincent Guittot wrote:
> > > 1. avg_scan_cost is now based on the average scan cost of a rq but
> > >    avg_idle is still scaled to the domain size. This is a bit problematic
> > >    because it's comparing scan cost of a single rq with the estimated
> > >    average idle time of a domain. As a result, the scan depth can be much
> > >    larger than it was before the patch and led to some regressions.
> >
> > Point 1 makes sense to me too
> >
> > >
> > > 2. Accounting for the scan cost of success makes sense but there is a
> > >    big difference between a scan that finds an idle CPU and one that fails.
> > >    For failures, the scan cost is wasted CPU time where as a success
> > >    means that an uncontested CPU is used. This can cause a search to be
> > >    truncated earlier than it should be when the domain is lightly loaded.
> >
> > But I'm not sure to catch your problem with point 2.
> > track the average cost to scan one rq so looping all rqs are only few
> > should not impact (much) the avg_scan_cost
> >
> > Trying to bias the avg_scan_cost with:  loops <<= 2;
> > will just make avg_scan_cost lost any kind of meaning because it
> > doesn't reflect the avg cost of scanning a rq anymore
> >
>
> Before the series, the avg_scan_cost also did not represent the cost of
> scanning a RQ before either. Treating scan failures and successes equally

I agree that the previous avg_scan_cost was not representing a RQ
because it was the avg cost of scanning the full domain. And we were
comparing it with the average idle time (weighted by few factors).
And this cost was impacted by the fact that the scan can return early
because it found a cpu. This has advantage and drawback but at least
stays coherent in what we are comparing

Peter's patch wants to move on per rq avg scan cost. And what you're
proposing is to add a magic heuristic to bias the per rq which at the
end makes this value just an opaque metric.

If we really want to keep the impact of early return than IMO we
should stay on a full domain scan level instead of a per rq.

Also, there is another problem (that I'm investigating)  which is that
this_rq()->avg_idle is stalled when your cpu is busy. Which means that
this avg_idle can just be a very old and meaningless value. I think
that we should decay it periodically to reflect there is less and less
idle time (in fact no more)  on this busy CPU that never goes to idle.
If a cpu was idle for a long period but then a long running task
starts, the avg_idle will stay stalled to the large value which is
becoming less and less relevant.
At the opposite, a cpu with a short running/idle period task will have
a lower avg_idle whereas it is more often idle.

Another thing that worries me, is that we use the avg_idle of the
local cpu, which is obviously not idle otherwise it would have been
selected, to decide how much time we should spend on looking for
another idle CPU. I'm not sure that's the right metrics to use
especially with a possibly stalled value.


> can problems with the depth of the scan conducted. Previously the "cost"
> of a successful scan was 0 so successful scans allowed deeper scans in
> the near future. This partially illustrates the problem.
>
>                           5.11.0-rc2             5.11.0-rc2             5.11.0-rc2
>                        baseline-v2r1          acctscan-v2r1           altscan-v2r8
> Hmean     1        429.47 (   0.00%)      420.90 *  -2.00%*      414.27 *  -3.54%*
> Hmean     2        709.39 (   0.00%)      796.05 *  12.22%*      791.98 *  11.64%*
> Hmean     4       1449.19 (   0.00%)     1445.14 (  -0.28%)     1319.09 *  -8.98%*
> Hmean     8       2765.65 (   0.00%)     2750.07 *  -0.56%*     2756.17 *  -0.34%*
> Hmean     16      5158.47 (   0.00%)     5056.59 *  -1.97%*     5030.67 *  -2.48%*
> Hmean     32      8969.96 (   0.00%)     8796.96 *  -1.93%*     8768.34 *  -2.25%*
> Hmean     64     11210.05 (   0.00%)     9910.39 * -11.59%*    11073.42 *  -1.22%*
> Hmean     128    17978.21 (   0.00%)    17031.41 *  -5.27%*    17037.76 *  -5.23%*
> Hmean     256    16143.32 (   0.00%)    15636.59 *  -3.14%*    15761.12 *  -2.37%*
> Hmean     320    16388.59 (   0.00%)    15591.78 *  -4.86%*    15588.85 *  -4.88%*
>
> Note the impact of Peters patch (accescan-v2r1) for 64 threads. The
> machine is 2-socket (40 cores, 80 threads) so 64 is the load is
> balancing between two domains (load balancing vs wakeup migrations).
> altscan is my suggested patch on top and with Peter's patch, there is a
> 11.59% regression that is negligible with my patch on top.
>
> The impact is machine-specific or specific to the CPU generation. Here
> is just comparing just the suggested alteration on a slightly older
> generation.
>
>                           5.11.0-rc2             5.11.0-rc2
>                        acctscan-v2r1           altscan-v2r8
> Hmean     1        155.44 (   0.00%)      183.32 *  17.94%*
> Hmean     2        445.46 (   0.00%)      548.51 *  23.13%*
> Hmean     4       1080.25 (   0.00%)     1112.49 *   2.98%*
> Hmean     8       2253.48 (   0.00%)     2457.46 *   9.05%*
> Hmean     16      3996.73 (   0.00%)     4244.59 *   6.20%*
> Hmean     32      5318.93 (   0.00%)     5798.17 *   9.01%*
> Hmean     64      9301.55 (   0.00%)     9563.24 *   2.81%*
> Hmean     128     8560.89 (   0.00%)     8873.72 *   3.65%*
> Hmean     192     8526.92 (   0.00%)     8843.43 *   3.71%*
>
> And another 2-socket machine on a newer generation.
>
> Hmean     1        551.16 (   0.00%)      503.75 *  -8.60%*
> Hmean     2       1074.19 (   0.00%)     1078.08 *   0.36%*
> Hmean     4       2024.72 (   0.00%)     2049.29 *   1.21%*
> Hmean     8       3762.49 (   0.00%)     4002.24 *   6.37%*
> Hmean     16      6589.98 (   0.00%)     6688.21 *   1.49%*
> Hmean     32     10080.23 (   0.00%)    10270.34 *   1.89%*
> Hmean     64     11349.16 (   0.00%)    12452.68 *   9.72%*
> Hmean     128    21670.93 (   0.00%)    21823.70 *   0.70%*
> Hmean     256    20605.62 (   0.00%)    20615.01 *   0.05%*
> Hmean     320    20974.29 (   0.00%)    20565.11 *  -1.95%*
>
> For hackbench with processes communicating via pipes on the first
> machine
>
>                           5.11.0-rc2             5.11.0-rc2
>                        acctscan-v2r1           altscan-v2r8
> Amean     1        0.3927 (   0.00%)      0.3943 (  -0.42%)
> Amean     4        0.9247 (   0.00%)      0.9267 (  -0.22%)
> Amean     7        1.4587 (   0.00%)      1.5147 *  -3.84%*
> Amean     12       2.3637 (   0.00%)      2.4507 *  -3.68%*
> Amean     21       4.0700 (   0.00%)      4.1757 *  -2.60%*
> Amean     30       5.6573 (   0.00%)      5.7390 *  -1.44%*
> Amean     48       8.9037 (   0.00%)      8.8053 *   1.10%*
> Amean     79      14.9190 (   0.00%)     14.4360 *   3.24%*
> Amean     110     22.5703 (   0.00%)     21.9210 (   2.88%)
> Amean     141     29.2400 (   0.00%)     28.0110 *   4.20%*
> Amean     172     36.3720 (   0.00%)     34.7963 (   4.33%)
> Amean     203     43.5783 (   0.00%)     42.5537 *   2.35%*
> Amean     234     50.3653 (   0.00%)     47.3463 *   5.99%*
> Amean     265     57.6153 (   0.00%)     55.6247 *   3.46%*
> Amean     296     62.7370 (   0.00%)     62.0720 (   1.06%)
>
> Adjusting the scan cost for successes is neither a universal win or
> failure but it's closer to historical behaviour and the strict
> accounting does hit corner cases.  If a deep scan is finding an idle CPU,
> it makes some sense to continue scanning deeply by adjusting the weight
> instead of prematurely failing.
>
> The testing of the full series previously showed that some loads never
> recovered from side-effects of the first patch and the last patch in the
> series introduced new problems of its own. Hence, I would like to limit
> the negative impact of the first patch and, if necessary, cut the last
> patch altogether.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 15:10             ` Vincent Guittot
@ 2021-01-08 16:14               ` Mel Gorman
  2021-01-11 14:36                 ` Vincent Guittot
  2021-01-08 19:45               ` Peter Zijlstra
  2021-01-08 19:49               ` Peter Zijlstra
  2 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-01-08 16:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > > Trying to bias the avg_scan_cost with:  loops <<= 2;
> > > will just make avg_scan_cost lost any kind of meaning because it
> > > doesn't reflect the avg cost of scanning a rq anymore
> > >
> >
> > Before the series, the avg_scan_cost also did not represent the cost of
> > scanning a RQ before either. Treating scan failures and successes equally
> 
> I agree that the previous avg_scan_cost was not representing a RQ
> because it was the avg cost of scanning the full domain.

It was not even that. As the full domain was not necessarily scanned at
all, it simply reflected how much time was spent scanning in general. It
neither represented an rq scan cost (which is variable due to cache
traffic) nor did it represent a full domain scan.

> And we were
> comparing it with the average idle time (weighted by few factors).
> And this cost was impacted by the fact that the scan can return early
> because it found a cpu. This has advantage and drawback but at least
> stays coherent in what we are comparing
> 

Not really because it represented the cost of a scan of some number of
rqs from 1 to sd->span_weight.

> Peter's patch wants to move on per rq avg scan cost. And what you're
> proposing is to add a magic heuristic to bias the per rq which at the
> end makes this value just an opaque metric.
> 

The metric is a heuristic no matter what. The scan cost of a RQ is not
fixed as it depends on whether cache data needs to be updated or not. Also
bear in mind that the first round of results and the results that I posted
showed that Peter's patch has significant corner cases that the patch
mitigates. You also note that avg_idle is an unusual metric to compare
against because it has numerous timing artifacts. At least one of them
is that we are extrapolating the domain idle time from a single rq which
is a poor proxy measure when a domain is only partially used. There just
is not a better one available without heavy writes to sd_llc which causes
its own set of problems.

> If we really want to keep the impact of early return than IMO we
> should stay on a full domain scan level instead of a per rq.
> 

That also has the same class of side-effects. Once the scan cost of
a successful scan is strictly accounted for, there are problems. Even
tracking the success scans is costly as the CPU clock has to be read and
sd_llc has to be updated.

> Also, there is another problem (that I'm investigating)  which is that
> this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> this avg_idle can just be a very old and meaningless value.

Yes, avg_idle in itself is just the average inter-arrival time between
a CPU going idle and receiving a wakeup partially bound roughly
by 2*sysctl_sched_migration_cost. If avg_idle is traced for each
select_idle_cpu(), it's obvious that it takes time to adjust when a
load starts.

> I think
> that we should decay it periodically to reflect there is less and less
> idle time (in fact no more)  on this busy CPU that never goes to idle.
> If a cpu was idle for a long period but then a long running task
> starts, the avg_idle will stay stalled to the large value which is
> becoming less and less relevant.

While I get what you're saying, it does not help extrapolate what the
idleness of a domain is.

> At the opposite, a cpu with a short running/idle period task will have
> a lower avg_idle whereas it is more often idle.
> 
> Another thing that worries me, is that we use the avg_idle of the
> local cpu, which is obviously not idle otherwise it would have been
> selected, to decide how much time we should spend on looking for
> another idle CPU. I'm not sure that's the right metrics to use
> especially with a possibly stalled value.
> 

A better estimate requires heavy writes to sd_llc. The cost of that will
likely offset any benefit gained by a superior selection of a scan
depth.

Treating a successful scan cost and a failed scan cost as being equal has
too many corner cases. If we do not want to weight the successful scan
cost, then the compromise is to keep the old behaviour that accounts for
scan failures only (avoids an sd_llc write at least) but base it on the
estimated scan cost for a single rq. The fact that we do not account for
scan failures should be explicitly commented so we do not forget it
again in the future.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 15:10             ` Vincent Guittot
  2021-01-08 16:14               ` Mel Gorman
@ 2021-01-08 19:45               ` Peter Zijlstra
  2021-01-09 14:12                 ` Mel Gorman
  2021-01-11 14:39                 ` Vincent Guittot
  2021-01-08 19:49               ` Peter Zijlstra
  2 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2021-01-08 19:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Li, Aubrey, linux-kernel, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> Also, there is another problem (that I'm investigating)  which is that
> this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> this avg_idle can just be a very old and meaningless value. I think
> that we should decay it periodically to reflect there is less and less

https://lkml.kernel.org/r/20180530143105.977759909@infradead.org

:-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 15:10             ` Vincent Guittot
  2021-01-08 16:14               ` Mel Gorman
  2021-01-08 19:45               ` Peter Zijlstra
@ 2021-01-08 19:49               ` Peter Zijlstra
  2021-01-11 14:52                 ` Vincent Guittot
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-01-08 19:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Li, Aubrey, linux-kernel, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> Another thing that worries me, is that we use the avg_idle of the
> local cpu, which is obviously not idle otherwise it would have been
> selected, to decide how much time we should spend on looking for
> another idle CPU. I'm not sure that's the right metrics to use
> especially with a possibly stalled value.

The thinking was that if this CPU has little idle time, this CPU should
not spend a lot of time searching...

That is; if we spend more time looking for places to run, than we have
idle time, we're loosing cycles we could've ran (supposedly useful) work.

The only counter argument is tail latency.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 10:27       ` Mel Gorman
  2021-01-08 13:01         ` Qais Yousef
  2021-01-08 13:41         ` Vincent Guittot
@ 2021-01-08 20:21         ` Peter Zijlstra
  2021-01-09 13:59           ` Mel Gorman
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-01-08 20:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Li, Aubrey, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Fri, Jan 08, 2021 at 10:27:38AM +0000, Mel Gorman wrote:

> 1. avg_scan_cost is now based on the average scan cost of a rq but
>    avg_idle is still scaled to the domain size. This is a bit problematic
>    because it's comparing scan cost of a single rq with the estimated
>    average idle time of a domain. As a result, the scan depth can be much
>    larger than it was before the patch and led to some regressions.

> @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  		 */
>  		avg_idle = this_rq()->avg_idle / 512;
>  		avg_cost = this_sd->avg_scan_cost + 1;
> -
> -		span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> -			nr = div_u64(span_avg, avg_cost);
> -		else
> +		nr = div_u64(avg_idle, avg_cost);
> +		if (nr < 4)
>  			nr = 4;

Oooh, could it be I simply didn't remember how that code was supposed to
work and should kick my (much) younger self for not writing a comment?

Consider:

       span_weight * avg_idle               avg_cost
  nr = ---------------------- = avg_idle / ----------
               avg_cost                    span_weigt

Where: avg_cost / span_weight ~= cost-per-rq



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 20:21         ` Peter Zijlstra
@ 2021-01-09 13:59           ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-09 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Li, Aubrey, vincent.guittot, linux-kernel, mingo, juri.lelli,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, benbjiang

On Fri, Jan 08, 2021 at 09:21:48PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 08, 2021 at 10:27:38AM +0000, Mel Gorman wrote:
> 
> > 1. avg_scan_cost is now based on the average scan cost of a rq but
> >    avg_idle is still scaled to the domain size. This is a bit problematic
> >    because it's comparing scan cost of a single rq with the estimated
> >    average idle time of a domain. As a result, the scan depth can be much
> >    larger than it was before the patch and led to some regressions.
> 
> > @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >  		 */
> >  		avg_idle = this_rq()->avg_idle / 512;
> >  		avg_cost = this_sd->avg_scan_cost + 1;
> > -
> > -		span_avg = sd->span_weight * avg_idle;
> > -		if (span_avg > 4*avg_cost)
> > -			nr = div_u64(span_avg, avg_cost);
> > -		else
> > +		nr = div_u64(avg_idle, avg_cost);
> > +		if (nr < 4)
> >  			nr = 4;
> 
> Oooh, could it be I simply didn't remember how that code was supposed to
> work and should kick my (much) younger self for not writing a comment?
> 
> Consider:
> 
>        span_weight * avg_idle               avg_cost
>   nr = ---------------------- = avg_idle / ----------
>                avg_cost                    span_weigt
> 
> Where: avg_cost / span_weight ~= cost-per-rq
> 

This would definitely make sense and I even evaluated it but the nature
of avg_idle and the scale it works at (up to 2*sched_migration_cost)
just ended up generating lunatic values far outside the size of the domain
size. Fitting that to the domain size just ended up looking silly too and
avg_cost does not decay. Still, in principle, it's the right direction,
it's just not what the code does right now.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 19:45               ` Peter Zijlstra
@ 2021-01-09 14:12                 ` Mel Gorman
  2021-01-11 14:39                 ` Vincent Guittot
  1 sibling, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-09 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, Jan 08, 2021 at 08:45:44PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > Also, there is another problem (that I'm investigating)  which is that
> > this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> > this avg_idle can just be a very old and meaningless value. I think
> > that we should decay it periodically to reflect there is less and less
> 
> https://lkml.kernel.org/r/20180530143105.977759909@infradead.org
> 
> :-)

This needs to be revived. I'm of the opinion that your initial series
needs to be split somewhat into "single scan for core" parts followed by
the "Fix depth selection". I have tests in the queue and one of them is
just patch 2 on its own. Preliminary results for patch 2 on its own do not
look bad but that's based on one test (tbench). It'll be tomorrow before
it works through variants of patch 1 which I suspect will be inconclusive
and make me more convinced it should be split out separately.

The single scan for core would be patches 2-4 of the series this thread
is about which is an orthogonal problem to avoiding repeated scans of
the same runqueues during a single wakeup. I would prefer to drop patch
5 initially because the has_idle_cores throttling mechanism for idle core
searches is reasonably effective and the first round of tests indicated
that changing it was inconclusive and should be treated separately.

The depth scan stuff would go on top because right now the depth scan is
based on magic numbers and no amount of shuffling that around will make
it less magic without an avg_idle value that decays and potentially the
scan cost also aging.  It's much less straight-forward than the single
scan aspect.

Switching idle core searches to SIS_PROP should also be treated
separately.

Thoughts? I don't want to go too far in this direction on my own because
the testing requirements are severe in terms of time. Even then no matter
how much I test, stuff will be missed as it's sensitive to domain sizes,
CPU generation, cache topology, cache characteristics, domain utilisation,
architecture, kitchen sink etc.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 16:14               ` Mel Gorman
@ 2021-01-11 14:36                 ` Vincent Guittot
  2021-01-11 15:58                   ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2021-01-11 14:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, 8 Jan 2021 at 17:14, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > > > Trying to bias the avg_scan_cost with:  loops <<= 2;
> > > > will just make avg_scan_cost lost any kind of meaning because it
> > > > doesn't reflect the avg cost of scanning a rq anymore
> > > >
> > >
> > > Before the series, the avg_scan_cost also did not represent the cost of
> > > scanning a RQ before either. Treating scan failures and successes equally
> >
> > I agree that the previous avg_scan_cost was not representing a RQ
> > because it was the avg cost of scanning the full domain.
>
> It was not even that. As the full domain was not necessarily scanned at
> all, it simply reflected how much time was spent scanning in general. It
> neither represented an rq scan cost (which is variable due to cache
> traffic) nor did it represent a full domain scan.

My point was mainly that the goal was to monitor the avg scan cost for
the domain (whatever the number of rq effectively checked) so the
duration was impacted by a successful scan that returns earlier

>
> > And we were
> > comparing it with the average idle time (weighted by few factors).
> > And this cost was impacted by the fact that the scan can return early
> > because it found a cpu. This has advantage and drawback but at least
> > stays coherent in what we are comparing
> >
>
> Not really because it represented the cost of a scan of some number of
> rqs from 1 to sd->span_weight.
>
> > Peter's patch wants to move on per rq avg scan cost. And what you're
> > proposing is to add a magic heuristic to bias the per rq which at the
> > end makes this value just an opaque metric.
> >
>
> The metric is a heuristic no matter what. The scan cost of a RQ is not
> fixed as it depends on whether cache data needs to be updated or not. Also
> bear in mind that the first round of results and the results that I posted
> showed that Peter's patch has significant corner cases that the patch
> mitigates. You also note that avg_idle is an unusual metric to compare
> against because it has numerous timing artifacts. At least one of them
> is that we are extrapolating the domain idle time from a single rq which
> is a poor proxy measure when a domain is only partially used. There just
> is not a better one available without heavy writes to sd_llc which causes
> its own set of problems.
>
> > If we really want to keep the impact of early return than IMO we
> > should stay on a full domain scan level instead of a per rq.
> >
>
> That also has the same class of side-effects. Once the scan cost of
> a successful scan is strictly accounted for, there are problems. Even
> tracking the success scans is costly as the CPU clock has to be read and
> sd_llc has to be updated.
>
> > Also, there is another problem (that I'm investigating)  which is that
> > this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> > this avg_idle can just be a very old and meaningless value.
>
> Yes, avg_idle in itself is just the average inter-arrival time between
> a CPU going idle and receiving a wakeup partially bound roughly
> by 2*sysctl_sched_migration_cost. If avg_idle is traced for each
> select_idle_cpu(), it's obvious that it takes time to adjust when a
> load starts.
>
> > I think
> > that we should decay it periodically to reflect there is less and less
> > idle time (in fact no more)  on this busy CPU that never goes to idle.
> > If a cpu was idle for a long period but then a long running task
> > starts, the avg_idle will stay stalled to the large value which is
> > becoming less and less relevant.
>
> While I get what you're saying, it does not help extrapolate what the
> idleness of a domain is.

not but it gives a more up to date view of the idleness of the local
cpu which is better than a stalled value

>
> > At the opposite, a cpu with a short running/idle period task will have
> > a lower avg_idle whereas it is more often idle.
> >
> > Another thing that worries me, is that we use the avg_idle of the
> > local cpu, which is obviously not idle otherwise it would have been
> > selected, to decide how much time we should spend on looking for
> > another idle CPU. I'm not sure that's the right metrics to use
> > especially with a possibly stalled value.
> >
>
> A better estimate requires heavy writes to sd_llc. The cost of that will
> likely offset any benefit gained by a superior selection of a scan
> depth.
>
> Treating a successful scan cost and a failed scan cost as being equal has
> too many corner cases. If we do not want to weight the successful scan
> cost, then the compromise is to keep the old behaviour that accounts for

I think that keeping the current way to scane_cost id the best option for now

> scan failures only (avoids an sd_llc write at least) but base it on the
> estimated scan cost for a single rq. The fact that we do not account for
> scan failures should be explicitly commented so we do not forget it
> again in the future.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 19:45               ` Peter Zijlstra
  2021-01-09 14:12                 ` Mel Gorman
@ 2021-01-11 14:39                 ` Vincent Guittot
  1 sibling, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2021-01-11 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Li, Aubrey, linux-kernel, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, 8 Jan 2021 at 20:45, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > Also, there is another problem (that I'm investigating)  which is that
> > this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> > this avg_idle can just be a very old and meaningless value. I think
> > that we should decay it periodically to reflect there is less and less
>
> https://lkml.kernel.org/r/20180530143105.977759909@infradead.org
>
> :-)

:-)

I was more thinking of something like decaying avg_idle during
tick_fair but at leat we all agree that we can't stay with a stalled
avg_idle value

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-08 19:49               ` Peter Zijlstra
@ 2021-01-11 14:52                 ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2021-01-11 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Li, Aubrey, linux-kernel, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Fri, 8 Jan 2021 at 20:49, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > Another thing that worries me, is that we use the avg_idle of the
> > local cpu, which is obviously not idle otherwise it would have been
> > selected, to decide how much time we should spend on looking for
> > another idle CPU. I'm not sure that's the right metrics to use
> > especially with a possibly stalled value.
>
> The thinking was that if this CPU has little idle time, this CPU should
> not spend a lot of time searching...
>
> That is; if we spend more time looking for places to run, than we have
> idle time, we're loosing cycles we could've ran (supposedly useful) work.

I understand the rationale of looking at previous avg idle time to
decide how much time we can "waste" at looking at a cpu for the waking
task. The problem is that this value is "stalled" and it gives us an
idea of the duration of the next idle time but not really when the
next idle time will happen, which can be in several seconds from now.
And we can have already use this next avg_idle time for other wakeups
search

>
> The only counter argument is tail latency.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
  2021-01-11 14:36                 ` Vincent Guittot
@ 2021-01-11 15:58                   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-01-11 15:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Li, Aubrey, linux-kernel, Ingo Molnar,
	Juri Lelli, Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, Jiang Biao

On Mon, Jan 11, 2021 at 03:36:57PM +0100, Vincent Guittot wrote:
> > > <SNIP>
> > >
> > > I think
> > > that we should decay it periodically to reflect there is less and less
> > > idle time (in fact no more)  on this busy CPU that never goes to idle.
> > > If a cpu was idle for a long period but then a long running task
> > > starts, the avg_idle will stay stalled to the large value which is
> > > becoming less and less relevant.
> >
> > While I get what you're saying, it does not help extrapolate what the
> > idleness of a domain is.
> 
> not but it gives a more up to date view of the idleness of the local
> cpu which is better than a stalled value
> 

Fair enough.

> >
> > > At the opposite, a cpu with a short running/idle period task will have
> > > a lower avg_idle whereas it is more often idle.
> > >
> > > Another thing that worries me, is that we use the avg_idle of the
> > > local cpu, which is obviously not idle otherwise it would have been
> > > selected, to decide how much time we should spend on looking for
> > > another idle CPU. I'm not sure that's the right metrics to use
> > > especially with a possibly stalled value.
> > >
> >
> > A better estimate requires heavy writes to sd_llc. The cost of that will
> > likely offset any benefit gained by a superior selection of a scan
> > depth.
> >
> > Treating a successful scan cost and a failed scan cost as being equal has
> > too many corner cases. If we do not want to weight the successful scan
> > cost, then the compromise is to keep the old behaviour that accounts for
> 
> I think that keeping the current way to scane_cost id the best option for now
> 

I sent a series that drops this patch for the moment as well as the
SIS_PROP for selecting a core.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-01-11 15:58 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 16:48 [RFC][PATCH 0/5] select_idle_sibling() wreckage Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
2020-12-15  3:36   ` Li, Aubrey
2020-12-15  7:59     ` Peter Zijlstra
2020-12-15 11:45       ` Mel Gorman
2020-12-15 12:13       ` Li, Aubrey
2021-01-08 10:27       ` Mel Gorman
2021-01-08 13:01         ` Qais Yousef
2021-01-08 13:47           ` Mel Gorman
2021-01-08 13:41         ` Vincent Guittot
2021-01-08 14:40           ` Mel Gorman
2021-01-08 15:10             ` Vincent Guittot
2021-01-08 16:14               ` Mel Gorman
2021-01-11 14:36                 ` Vincent Guittot
2021-01-11 15:58                   ` Mel Gorman
2021-01-08 19:45               ` Peter Zijlstra
2021-01-09 14:12                 ` Mel Gorman
2021-01-11 14:39                 ` Vincent Guittot
2021-01-08 19:49               ` Peter Zijlstra
2021-01-11 14:52                 ` Vincent Guittot
2021-01-08 20:21         ` Peter Zijlstra
2021-01-09 13:59           ` Mel Gorman
2020-12-14 16:48 ` [RFC][PATCH 2/5] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
2020-12-23 13:31   ` Vincent Guittot
2020-12-14 16:48 ` [RFC][PATCH 3/5] sched/fair: Remove select_idle_smt() Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 4/5] sched/fair: Merge select_idle_core/cpu() Peter Zijlstra
2020-12-14 16:48 ` [RFC][PATCH 5/5] sched/fair: SIS_PROP the idle core scan Peter Zijlstra
2020-12-16 12:59 ` [RFC][PATCH 0/5] select_idle_sibling() wreckage Li, Aubrey
2020-12-16 18:07   ` Vincent Guittot
2020-12-23 13:23     ` Vincent Guittot
2021-01-04 15:40       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.