linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] Modify and/or delete SIS_PROP
@ 2021-07-26 10:22 Mel Gorman
  2021-07-26 10:22 ` [PATCH 1/9] sched/fair: Track efficiency of select_idle_sibling Mel Gorman
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

When scanning for an idle CPU, SIS_PROP limits the scan based on an average
estimated idle time for a domain but it's a somewhat inconsistent and
fuzzy heuristic. Idle time of a CPU does not necessarily correlate with
idle time of a domain, scanning for an idle core is not throttled at all
and successful scans are not accounted for.

The first three patches are accounting patches to put some numbers on the
scan efficiency. Ultimately, they may not be merged but they are useful
for determining if a patch works as advertised.

Patch 4 improves recent_used_cpu to reduce the amount of scanning that is
done in absolute terms.

Patch 5 notes that the target CPU has already been scanned with
select_idle_cpu starts and therefore should be skipped.

Patches 6-9 replace SIS_PROP with a scheme based on partially tracking
idle CPUs, first proposed by Aubrey Li and modified by this series.
It is likely to be the most controversial, not necessary a win, but the
possibility was discussed. The series is likely to be split with one
series being patches 4-5 and a second being patches 6-9.

The series is not a univeral win or loss but there are some improvements
and overall, scan efficiencies are improved. A limiting factor in the
evaluation is that tracking the statistics is expensive on its own and
all tests were run with schedstat enabled.

Depending on how this RFC is received, testing would be done without
schedstat and the series may be split.

 include/linux/sched/topology.h |  13 +++
 kernel/sched/core.c            |   7 +-
 kernel/sched/debug.c           |   8 ++
 kernel/sched/fair.c            | 149 ++++++++++++++++++---------------
 kernel/sched/features.h        |   5 --
 kernel/sched/idle.c            |   5 ++
 kernel/sched/sched.h           |  17 +++-
 kernel/sched/stats.c           |  10 ++-
 kernel/sched/topology.c        |   3 +-
 9 files changed, 133 insertions(+), 84 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/9] sched/fair: Track efficiency of select_idle_sibling
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 2/9] sched/fair: Track efficiency of task recent_used_cpu Mel Gorman
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

select_idle_sibling is an important path that finds a nearby idle CPU on
wakeup. As it is examining other CPUs state, it can be expensive in terms
of cache usage. This patch tracks the search efficiency if schedstats
are enabled. In general, this is only useful for kernel developers but
schedstats are typically disabled by default so it is convenient for
development and mostly free otherwise.

The series can be done without this patch but the stats were used to
generate a number of useful metrics in mmtest to analyse what was
going on.

SIS Search: Number of calls to select_idle_sibling

SIS Domain Search: Number of times the domain was searched because the
	fast path failed.

SIS Scanned: Generally the number of runqueues scanned but the fast
	path counts as 1 regardless of the values for target, prev
	and recent.

SIS Domain Scanned: Number of runqueues scanned during a search of the
	LLC domain.

SIS Failures: Number of SIS calls that failed to find an idle CPU

SIS Search Efficiency: A ratio expressed as a percentage of runqueues
	scanned versus idle CPUs found. A 100% efficiency indicates that
	the target, prev or recent CPU of a task was idle at wakeup. The
	lower the efficiency, the more runqueues were scanned before an
	idle CPU was found.

SIS Domain Search Efficiency: Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate: Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate: Percentage of scans that found an idle CPU.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/debug.c |  4 ++++
 kernel/sched/fair.c  | 14 ++++++++++++++
 kernel/sched/sched.h |  6 ++++++
 kernel/sched/stats.c |  8 +++++---
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0c5ec2776ddf..603d4bc71612 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -738,6 +738,10 @@ do {									\
 		P(sched_goidle);
 		P(ttwu_count);
 		P(ttwu_local);
+		P(sis_search);
+		P(sis_domain_search);
+		P(sis_scanned);
+		P(sis_failed);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..cc0b451d1794 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6064,6 +6064,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
+	schedstat_inc(this_rq()->sis_scanned);
 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
 	    sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
@@ -6138,6 +6139,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 		return __select_idle_cpu(core, p);
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
+		schedstat_inc(this_rq()->sis_scanned);
 		if (!available_idle_cpu(cpu)) {
 			idle = false;
 			if (*idle_cpu == -1) {
@@ -6168,6 +6170,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	int cpu;
 
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
+		schedstat_inc(this_rq()->sis_scanned);
 		if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
 		    !cpumask_test_cpu(cpu, sched_domain_span(sd)))
 			continue;
@@ -6334,6 +6337,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	unsigned long task_util;
 	int i, recent_used_cpu;
 
+	schedstat_inc(this_rq()->sis_search);
+
+	/*
+	 * Checking if prev, target and recent is treated as one scan. A
+	 * perfect hit on one of those is considered 100% efficiency.
+	 * Further scanning impairs efficiency.
+	 */
+	schedstat_inc(this_rq()->sis_scanned);
+
 	/*
 	 * On asymmetric system, update task utilization because we will check
 	 * that the task fits with cpu's capacity.
@@ -6414,6 +6426,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
+	schedstat_inc(this_rq()->sis_domain_search);
 	if (sched_smt_active()) {
 		has_idle_core = test_idle_cores(target, false);
 
@@ -6428,6 +6441,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	schedstat_inc(this_rq()->sis_failed);
 	return target;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a243f7b..4cf307763fe9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1072,6 +1072,12 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+
+	/* select_idle_sibling stats */
+	unsigned int		sis_search;
+	unsigned int		sis_domain_search;
+	unsigned int		sis_scanned;
+	unsigned int		sis_failed;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 3f93fc3b5648..7dd9b0dec437 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -30,12 +30,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->sis_search, rq->sis_domain_search,
+		    rq->sis_scanned, rq->sis_failed);
 
 		seq_printf(seq, "\n");
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/9] sched/fair: Track efficiency of task recent_used_cpu
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
  2021-07-26 10:22 ` [PATCH 1/9] sched/fair: Track efficiency of select_idle_sibling Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 3/9] sched/fair: Track efficiency of select_idle_core Mel Gorman
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

This simply tracks the efficiency of the recent_used_cpu. The hit rate
of this matters as it can avoid a domain search. Similarly, the miss
rate matters because each miss is a penalty to the fast path. MMTests
uses this to generate three additional metrics

SIS Recent Used Hit: A recent CPU was eligible and used. Each hit is
	a domain search avoided.

SIS Recent Used Miss: A recent CPU was eligible but unavailable. Each
	time this is hit, there was a penalty to the fast path before
	a domain search happened.

SIS Recent Success Rate: A percentage of the number of hits versus
	the total attempts to use the recent CPU.

SIS Recent Attempts: The total number of times the recent CPU was examined.
	A high number of Recent Attempts with a low Success Rate implies
	the fast path is being punished severely. This could have been
	presented as a weighting of hits and misses but calculating an
	appropriate weight for misses is problematic.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/debug.c |  2 ++
 kernel/sched/fair.c  | 28 ++++++++++++++++------------
 kernel/sched/sched.h |  2 ++
 kernel/sched/stats.c |  7 ++++---
 4 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 603d4bc71612..1ec87b7bb6a9 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -742,6 +742,8 @@ do {									\
 		P(sis_domain_search);
 		P(sis_scanned);
 		P(sis_failed);
+		P(sis_recent_hit);
+		P(sis_recent_miss);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc0b451d1794..4d48dc08a49b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6388,18 +6388,22 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
-	if (recent_used_cpu != prev &&
-	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
-	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
-	    asym_fits_capacity(task_util, recent_used_cpu)) {
-		/*
-		 * Replace recent_used_cpu with prev as it is a potential
-		 * candidate for the next wake:
-		 */
-		p->recent_used_cpu = prev;
-		return recent_used_cpu;
+	if (recent_used_cpu != prev && recent_used_cpu != target) {
+
+		if (cpus_share_cache(recent_used_cpu, target) &&
+		    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+		    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
+		    asym_fits_capacity(task_util, recent_used_cpu)) {
+			/*
+			 * Replace recent_used_cpu with prev as it is a potential
+			 * candidate for the next wake:
+			 */
+			schedstat_inc(this_rq()->sis_recent_hit);
+			p->recent_used_cpu = prev;
+			return recent_used_cpu;
+		}
+
+		schedstat_inc(this_rq()->sis_recent_miss);
 	}
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4cf307763fe9..1c04d7a97dbe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,6 +1078,8 @@ struct rq {
 	unsigned int		sis_domain_search;
 	unsigned int		sis_scanned;
 	unsigned int		sis_failed;
+	unsigned int		sis_recent_hit;
+	unsigned int		sis_recent_miss;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 7dd9b0dec437..1aa648edb88b 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 16
+#define SCHEDSTAT_VERSION 17
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -30,14 +30,15 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
 		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
 		    rq->sis_search, rq->sis_domain_search,
-		    rq->sis_scanned, rq->sis_failed);
+		    rq->sis_scanned, rq->sis_failed,
+		    rq->sis_recent_hit, rq->sis_recent_miss);
 
 		seq_printf(seq, "\n");
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/9] sched/fair: Track efficiency of select_idle_core
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
  2021-07-26 10:22 ` [PATCH 1/9] sched/fair: Track efficiency of select_idle_sibling Mel Gorman
  2021-07-26 10:22 ` [PATCH 2/9] sched/fair: Track efficiency of task recent_used_cpu Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 4/9] sched/fair: Use prev instead of new target as recent_used_cpu Mel Gorman
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

Add efficiency tracking for select_idle_core.

MMTests uses this to generate additional metrics.

SIS Core Search: The number of times a domain was searched for an idle core

SIS Core Hit: Idle core search success.

SIS Core Miss: Idle core search miss.

SIS Core Search Eff: The percentage of searches that were successful

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/debug.c | 2 ++
 kernel/sched/fair.c  | 2 ++
 kernel/sched/sched.h | 2 ++
 kernel/sched/stats.c | 7 ++++---
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ec87b7bb6a9..26bdc455e4f4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -744,6 +744,8 @@ do {									\
 		P(sis_failed);
 		P(sis_recent_hit);
 		P(sis_recent_miss);
+		P(sis_core_search);
+		P(sis_core_failed);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d48dc08a49b..4e2979b73cec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6138,6 +6138,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	if (!static_branch_likely(&sched_smt_present))
 		return __select_idle_cpu(core, p);
 
+	schedstat_inc(this_rq()->sis_core_search);
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		schedstat_inc(this_rq()->sis_scanned);
 		if (!available_idle_cpu(cpu)) {
@@ -6158,6 +6159,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	if (idle)
 		return core;
 
+	schedstat_inc(this_rq()->sis_core_failed);
 	cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 	return -1;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1c04d7a97dbe..e31179e6c6ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1080,6 +1080,8 @@ struct rq {
 	unsigned int		sis_failed;
 	unsigned int		sis_recent_hit;
 	unsigned int		sis_recent_miss;
+	unsigned int		sis_core_search;
+	unsigned int		sis_core_failed;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 1aa648edb88b..5672f3dc7002 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 17
+#define SCHEDSTAT_VERSION 18
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -30,7 +30,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
@@ -38,7 +38,8 @@ static int show_schedstat(struct seq_file *seq, void *v)
 		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
 		    rq->sis_search, rq->sis_domain_search,
 		    rq->sis_scanned, rq->sis_failed,
-		    rq->sis_recent_hit, rq->sis_recent_miss);
+		    rq->sis_recent_hit, rq->sis_recent_miss,
+		    rq->sis_core_search, rq->sis_core_failed);
 
 		seq_printf(seq, "\n");
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/9] sched/fair: Use prev instead of new target as recent_used_cpu
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (2 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 3/9] sched/fair: Track efficiency of select_idle_core Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 5/9] sched/fair: Avoid a second scan of target in select_idle_cpu Mel Gorman
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the
last wakeup. It still works, but is less efficient than it could be.
This patch preserves recent_used_cpu for longer.

The impact on SIS efficiency is tiny so the SIS statistic patches were
used to track the hit rate for using recent_used_cpu. With perf bench
pipe on a 2-socket Cascadelake machine, the hit rate went from 57.14%
to 85.32%. For more intensive wakeup loads like hackbench, the hit rate
is almost negligible but rose from 0.21% to 6.64%. For scaling loads
like tbench, the hit rate goes from almost 0% to 25.42% overall. Broadly
speaking, on tbench, the success rate is much higher for lower thread
counts and drops to almost 0 as the workload scales to towards saturation.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e2979b73cec..75ff991a460a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6390,6 +6390,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
+	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev && recent_used_cpu != target) {
 
 		if (cpus_share_cache(recent_used_cpu, target) &&
@@ -6922,9 +6923,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-		if (want_affine)
-			current->recent_used_cpu = cpu;
 	}
 	rcu_read_unlock();
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/9] sched/fair: Avoid a second scan of target in select_idle_cpu
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (3 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 4/9] sched/fair: Use prev instead of new target as recent_used_cpu Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 6/9] sched/fair: Make select_idle_cpu() proportional to cores Mel Gorman
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

When select_idle_cpu starts scanning for an idle CPU, it starts with
a target CPU that has already been checked by select_idle_sibling.
This patch starts with the next CPU instead.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75ff991a460a..125c020746b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6254,7 +6254,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
-	for_each_cpu_wrap(cpu, cpus, target) {
+	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/9] sched/fair: Make select_idle_cpu() proportional to cores
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (4 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 5/9] sched/fair: Avoid a second scan of target in select_idle_cpu Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-07-26 10:22 ` [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core Mel Gorman
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

From: Peter Zijlstra (Intel) <peterz@infradead.org>

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c  | 17 ++++++++++++-----
 kernel/sched/fair.c  | 11 +++++++++--
 kernel/sched/sched.h |  2 ++
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d9ff40f4661..7c073b67f1ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8647,11 +8647,18 @@ int sched_cpu_activate(unsigned int cpu)
 	balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-	/*
-	 * When going up, increment the number of cores with SMT present.
-	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-		static_branch_inc_cpuslocked(&sched_smt_present);
+	do {
+		int weight = cpumask_weight(cpu_smt_mask(cpu));
+
+		if (weight > sched_smt_weight)
+			sched_smt_weight = weight;
+
+		/*
+		 * When going up, increment the number of cores with SMT present.
+		 */
+		if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+			static_branch_inc_cpuslocked(&sched_smt_present);
+	} while (0);
 #endif
 	set_cpu_active(cpu, true);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 125c020746b8..20b9255ebf97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6076,6 +6076,8 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+int __read_mostly sched_smt_weight = 1;
+
 static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
@@ -6194,6 +6196,8 @@ static inline bool test_idle_cores(int cpu, bool def)
 	return def;
 }
 
+#define sched_smt_weight 1
+
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core, p);
@@ -6206,6 +6210,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#define sis_min_cores	2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6246,11 +6252,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		avg_cost = this_sd->avg_scan_cost + 1;
 
 		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > sis_min_cores * avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
-			nr = 4;
+			nr = sis_min_cores;
 
+		nr *= sched_smt_weight;
 		time = cpu_clock(this);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e31179e6c6ff..4d47a0969710 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1351,6 +1351,8 @@ do {						\
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
+extern int sched_smt_weight;
+
 static inline void update_idle_core(struct rq *rq)
 {
 	if (static_branch_unlikely(&sched_smt_present))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (5 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 6/9] sched/fair: Make select_idle_cpu() proportional to cores Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-08-02 10:52   ` Song Bao Hua (Barry Song)
  2021-07-26 10:22 ` [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup Mel Gorman
  2021-07-26 10:22 ` [PATCH 9/9] sched/core: Delete SIS_PROP and rely on the idle cpu mask Mel Gorman
  8 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

When scanning for a single CPU, the scan is limited based on the estimated
average idle time for a domain to reduce the risk that more time is spent
scanning for idle CPUs than we are idle for.

With SMT, if an idle core is expected to exist there is no scan depth
limits so the scan depth may or may not be related to average idle time.
Unfortunately has_idle_cores can be very inaccurate when workloads are
rapidly entering/exiting idle (e.g. hackbench).

As the scan depth is now proportional to cores and not CPUs, enforce
SIS_PROP for idle core scans.

The performance impact of this is variable and is neither a universal
gain nor loss. In some cases, has_idle_cores will be cleared prematurely
because the whole domain was not scanned but has_idle_cores is already
known to be an inaccurate heuristic. There is also additional cost because
time calculations are made even for an idle core scan and the delta is
calculated for both scan successes and failures. Finally, SMT siblings
may be used prematurely due to scan depth limitations.

On the flip side, scan depth is now consistent for both core and smt
scans. The reduction in scan depth improves performance in some cases
and wakeup latency is reduced in some cases.

There were few changes identified in the SIS statistics but notably,
"SIS Core Hit" was slightly reduced in tbench as thread counts increased,
presumably due to the core search depth being throttled.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20b9255ebf97..b180205e6b25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6232,7 +6232,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP) && !has_idle_core) {
+	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
 
@@ -6265,30 +6265,35 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
-				return i;
+				break;
 
+			nr -= sched_smt_weight;
 		} else {
-			if (!--nr)
-				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
+			nr--;
 		}
+
+		if (nr < 0)
+			break;
 	}
 
-	if (has_idle_core)
-		set_idle_cores(target, false);
+	if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+		if (has_idle_core)
+			set_idle_cores(target, false);
 
-	if (sched_feat(SIS_PROP) && !has_idle_core) {
-		time = cpu_clock(this) - time;
+		if (sched_feat(SIS_PROP)) {
+			time = cpu_clock(this) - time;
 
-		/*
-		 * Account for the scan cost of wakeups against the average
-		 * idle time.
-		 */
-		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
+			/*
+			 * Account for the scan cost of wakeups against the average
+			 * idle time.
+			 */
+			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
 
-		update_avg(&this_sd->avg_scan_cost, time);
+			update_avg(&this_sd->avg_scan_cost, time);
+		}
 	}
 
 	return idle_cpu;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (6 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  2021-08-02 10:41   ` Song Bao Hua (Barry Song)
  2021-07-26 10:22 ` [PATCH 9/9] sched/core: Delete SIS_PROP and rely on the idle cpu mask Mel Gorman
  8 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

From: Aubrey Li <aubrey.li@linux.intel.com>

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

v10-v11:
- Forward port to 5.13-rc5 based kernel

v9->v10:
- Update scan cost only when the idle cpumask is scanned, i.e, the
  idle cpumask is not empty

v8->v9:
- rebase on top of tip/sched/core, no code change

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

[mgorman@techsingularity.net: RCU protection in update_idle_cpumask]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jiang Biao <benbjiang@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h | 13 ++++++++++
 kernel/sched/core.c            |  2 ++
 kernel/sched/fair.c            | 46 +++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c            |  5 ++++
 kernel/sched/sched.h           |  4 +++
 kernel/sched/topology.c        |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..905e382ece1a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,8 +74,21 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	/*
+	 * Span of all idle CPUs in this domain.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c073b67f1ea..2751614ce0cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4965,6 +4965,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
+	update_idle_cpumask(cpu, rq->idle_balance);
 	trigger_load_balance(rq);
 #endif
 }
@@ -9031,6 +9032,7 @@ void __init sched_init(void)
 		rq->wake_stamp = jiffies;
 		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->last_idle_state = 1;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b180205e6b25..fe87af2ccc80 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6230,7 +6230,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (!this_sd)
 		return -1;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	/*
+	 * sched_domain_shared is set only at shared cache level,
+	 * this works only because select_idle_cpu is called with
+	 * sd_llc.
+	 */
+	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
 	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
@@ -7018,6 +7023,45 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+	struct sched_domain *sd;
+	struct rq *rq = cpu_rq(cpu);
+	int idle_state;
+
+	/*
+	 * Also set SCHED_IDLE cpu in idle cpumask to
+	 * allow SCHED_IDLE cpu as a wakeup target.
+	 */
+	idle_state = idle || sched_idle_cpu(cpu);
+	/*
+	 * No need to update idle cpumask if the state
+	 * does not change.
+	 */
+	if (rq->last_idle_state == idle_state)
+		return;
+
+	rcu_read_lock();
+	sd = per_cpu(sd_llc, cpu);
+	if (unlikely(!sd))
+		goto unlock;
+
+	if (idle_state)
+		cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+	else
+		cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+
+	rq->last_idle_state = idle_state;
+unlock:
+	rcu_read_unlock();
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 912b47aa99d8..86bfe81cc280 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -289,6 +289,11 @@ static void do_idle(void)
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
+		/*
+		 * The CPU is about to go idle, set it in idle cpumask
+		 * to be a wake up target.
+		 */
+		update_idle_cpumask(cpu, true);
 
 		arch_cpu_idle_enter();
 		rcu_nocb_flush_deferred_wakeup();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d47a0969710..2d6456fa15cb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -998,6 +998,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		last_idle_state;
 
 	unsigned long		misfit_task_load;
 
@@ -1846,6 +1847,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+void update_idle_cpumask(int cpu, bool idle);
+
 #ifdef CONFIG_SCHED_DEBUG
 void update_sched_domain_debugfs(void);
 void dirty_sched_domain_sysctl(int cpu);
@@ -1864,6 +1867,7 @@ extern void flush_smp_call_function_from_idle(void);
 
 #else /* !CONFIG_SMP: */
 static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_idle_cpumask(int cpu, bool idle) { }
 #endif
 
 #include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..2075bc417b90 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1611,6 +1611,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
 	}
 
 	sd->private = sdd;
@@ -1970,7 +1971,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 9/9] sched/core: Delete SIS_PROP and rely on the idle cpu mask
  2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
                   ` (7 preceding siblings ...)
  2021-07-26 10:22 ` [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup Mel Gorman
@ 2021-07-26 10:22 ` Mel Gorman
  8 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-07-26 10:22 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Mel Gorman

Now that there is an idle CPU mask that is approximately up to date, the
proportional scan depth can be removed and the scan depth is limited by
the estimated number of idle CPUs instead.

The plus side of this patch is that the time accounting overhead is gone.
The downside is that in some circumstances, this will scan more than
proportional scanning depending on whether an idle core is being scanned
or the accuracy of the idle CPU mask.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c     | 22 ++++----------
 kernel/sched/fair.c     | 65 ++---------------------------------------
 kernel/sched/features.h |  5 ----
 kernel/sched/sched.h    |  5 ----
 4 files changed, 8 insertions(+), 89 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2751614ce0cb..9fcf9d1ae21c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3333,9 +3333,6 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
 
-		rq->wake_stamp = jiffies;
-		rq->wake_avg_idle = rq->avg_idle / 2;
-
 		rq->idle_stamp = 0;
 	}
 #endif
@@ -8648,18 +8645,11 @@ int sched_cpu_activate(unsigned int cpu)
 	balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-	do {
-		int weight = cpumask_weight(cpu_smt_mask(cpu));
-
-		if (weight > sched_smt_weight)
-			sched_smt_weight = weight;
-
-		/*
-		 * When going up, increment the number of cores with SMT present.
-		 */
-		if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-			static_branch_inc_cpuslocked(&sched_smt_present);
-	} while (0);
+	/*
+	 * When going up, increment the number of cores with SMT present.
+	 */
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+		static_branch_inc_cpuslocked(&sched_smt_present);
 #endif
 	set_cpu_active(cpu, true);
 
@@ -9029,8 +9019,6 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
-		rq->wake_stamp = jiffies;
-		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 		rq->last_idle_state = 1;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe87af2ccc80..70b6d840426a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6076,8 +6076,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
-int __read_mostly sched_smt_weight = 1;
-
 static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
@@ -6196,8 +6194,6 @@ static inline bool test_idle_cores(int cpu, bool def)
 	return def;
 }
 
-#define sched_smt_weight 1
-
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core, p);
@@ -6210,8 +6206,6 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
-#define sis_min_cores	2
-
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6220,12 +6214,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	struct rq *this_rq = this_rq();
-	int this = smp_processor_id();
+	int i, cpu, idle_cpu = -1;
 	struct sched_domain *this_sd;
-	u64 time = 0;
-
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
@@ -6237,69 +6227,20 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	 */
 	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP)) {
-		u64 avg_cost, avg_idle, span_avg;
-		unsigned long now = jiffies;
-
-		/*
-		 * If we're busy, the assumption that the last idle period
-		 * predicts the future is flawed; age away the remaining
-		 * predicted idle time.
-		 */
-		if (unlikely(this_rq->wake_stamp < now)) {
-			while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
-				this_rq->wake_stamp++;
-				this_rq->wake_avg_idle >>= 1;
-			}
-		}
-
-		avg_idle = this_rq->wake_avg_idle;
-		avg_cost = this_sd->avg_scan_cost + 1;
-
-		span_avg = sd->span_weight * avg_idle;
-		if (span_avg > sis_min_cores * avg_cost)
-			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = sis_min_cores;
-
-		nr *= sched_smt_weight;
-		time = cpu_clock(this);
-	}
-
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
 				break;
-
-			nr -= sched_smt_weight;
 		} else {
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
-			nr--;
 		}
-
-		if (nr < 0)
-			break;
 	}
 
-	if ((unsigned int)idle_cpu < nr_cpumask_bits) {
-		if (has_idle_core)
-			set_idle_cores(target, false);
-
-		if (sched_feat(SIS_PROP)) {
-			time = cpu_clock(this) - time;
-
-			/*
-			 * Account for the scan cost of wakeups against the average
-			 * idle time.
-			 */
-			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
-
-			update_avg(&this_sd->avg_scan_cost, time);
-		}
-	}
+	if ((unsigned int)idle_cpu < nr_cpumask_bits && has_idle_core)
+		set_idle_cores(target, false);
 
 	return idle_cpu;
 }
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..4bb29c830b9d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -52,11 +52,6 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 
-/*
- * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
- */
-SCHED_FEAT(SIS_PROP, true)
-
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2d6456fa15cb..35a0b591a2de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1024,9 +1024,6 @@ struct rq {
 	u64			idle_stamp;
 	u64			avg_idle;
 
-	unsigned long		wake_stamp;
-	u64			wake_avg_idle;
-
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
 
@@ -1352,8 +1349,6 @@ do {						\
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-extern int sched_smt_weight;
-
 static inline void update_idle_core(struct rq *rq)
 {
 	if (static_branch_unlikely(&sched_smt_present))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-07-26 10:22 ` [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup Mel Gorman
@ 2021-08-02 10:41   ` Song Bao Hua (Barry Song)
  2021-08-04 10:26     ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-08-02 10:41 UTC (permalink / raw)
  To: Mel Gorman, LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, yangyicong



> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@techsingularity.net]
> Sent: Monday, July 26, 2021 10:23 PM
> To: LKML <linux-kernel@vger.kernel.org>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Vincent Guittot <vincent.guittot@linaro.org>; Valentin Schneider
> <valentin.schneider@arm.com>; Aubrey Li <aubrey.li@linux.intel.com>; Mel
> Gorman <mgorman@techsingularity.net>
> Subject: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task
> wakeup
> 
> From: Aubrey Li <aubrey.li@linux.intel.com>
> 
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> v10-v11:
> - Forward port to 5.13-rc5 based kernel
> 
> v9->v10:
> - Update scan cost only when the idle cpumask is scanned, i.e, the
>   idle cpumask is not empty
> 
> v8->v9:
> - rebase on top of tip/sched/core, no code change
> 
> v7->v8:
> - refine update_idle_cpumask, no functionality change
> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
> 
> v6->v7:
> - place the whole idle cpumask mechanism under CONFIG_SMP
> 
> v5->v6:
> - decouple idle cpumask update from stop_tick signal, set idle CPU
>   in idle cpumask every time the CPU enters idle
> 
> v4->v5:
> - add update_idle_cpumask for s2idle case
> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>   idle_cpumask() everywhere
> 
> v3->v4:
> - change setting idle cpumask from every idle entry to tickless idle
>   if cpu driver is available
> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
> 
> v2->v3:
> - change setting idle cpumask to every idle entry, otherwise schbench
>   has a regression of 99th percentile latency
> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>   idle cpumask is ratelimited in the idle exiting path
> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
> 
> v1->v2:
> - idle cpumask is updated in the nohz routines, by initializing idle
>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>   behavior
> 
> [mgorman@techsingularity.net: RCU protection in update_idle_cpumask]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Qais Yousef <qais.yousef@arm.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jiang Biao <benbjiang@gmail.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h | 13 ++++++++++
>  kernel/sched/core.c            |  2 ++
>  kernel/sched/fair.c            | 46 +++++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c            |  5 ++++
>  kernel/sched/sched.h           |  4 +++
>  kernel/sched/topology.c        |  3 ++-
>  6 files changed, 71 insertions(+), 2 deletions(-)
> 

Hi Mel, Aubrey,
A similar thing Yicong and me has discussed is having a mask or a count for
idle cores. And then we can only scan idle cores in this mask in
select_idle_cpu().

A obvious problem is that has_idle_cores is a bool, it can seriously lag
from the real status. I mean, after system enters the status without idle
cores, has_idle_cores could be still true.

Right now, we are setting has_idle_cores to true while cpu enters idle
and its smt sibling is also idle. But we are setting has_idle_cores to
false only after we scan all cores in a llc.

So we have thought for a while to provide an idle core mask. But never
really made a workable patch.

Mel's patch7/9 limits the number of cores which will be scanned in
select_idle_cpu(), it might somehow alleviate the problem we redundantly
scan all cores while we actually have no idle core even has_idle_cores
is true.

However, if we can get idle core mask, it could be also good to
select_idle_core()? Maybe after that, we don't have to enforce
proportional scan limits while scanning for an idle core?

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..905e382ece1a 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -74,8 +74,21 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	/*
> +	 * Span of all idle CPUs in this domain.
> +	 *
> +	 * NOTE: this field is variable length. (Allocated dynamically
> +	 * by attaching extra space to the end of the structure,
> +	 * depending on how many CPUs the kernel has booted up with)
> +	 */
> +	unsigned long	idle_cpus_span[];
>  };
> 
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> +	return to_cpumask(sds->idle_cpus_span);
> +}
> +
>  struct sched_domain {
>  	/* These fields must be setup */
>  	struct sched_domain __rcu *parent;	/* top domain must be null terminated
> */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7c073b67f1ea..2751614ce0cb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
> 
>  #ifdef CONFIG_SMP
>  	rq->idle_balance = idle_cpu(cpu);
> +	update_idle_cpumask(cpu, rq->idle_balance);
>  	trigger_load_balance(rq);
>  #endif
>  }
> @@ -9031,6 +9032,7 @@ void __init sched_init(void)
>  		rq->wake_stamp = jiffies;
>  		rq->wake_avg_idle = rq->avg_idle;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> +		rq->last_idle_state = 1;
> 
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b180205e6b25..fe87af2ccc80 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6230,7 +6230,12 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
>  	if (!this_sd)
>  		return -1;
> 
> -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +	/*
> +	 * sched_domain_shared is set only at shared cache level,
> +	 * this works only because select_idle_cpu is called with
> +	 * sd_llc.
> +	 */
> +	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> 
>  	if (sched_feat(SIS_PROP)) {
>  		u64 avg_cost, avg_idle, span_avg;
> @@ -7018,6 +7023,45 @@ balance_fair(struct rq *rq, struct task_struct *prev,
> struct rq_flags *rf)
> 
>  	return newidle_balance(rq, rf) != 0;
>  }
> +
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + *
> + * This function is called with interrupts disabled.
> + */
> +void update_idle_cpumask(int cpu, bool idle)
> +{
> +	struct sched_domain *sd;
> +	struct rq *rq = cpu_rq(cpu);
> +	int idle_state;
> +
> +	/*
> +	 * Also set SCHED_IDLE cpu in idle cpumask to
> +	 * allow SCHED_IDLE cpu as a wakeup target.
> +	 */
> +	idle_state = idle || sched_idle_cpu(cpu);
> +	/*
> +	 * No need to update idle cpumask if the state
> +	 * does not change.
> +	 */
> +	if (rq->last_idle_state == idle_state)
> +		return;
> +
> +	rcu_read_lock();
> +	sd = per_cpu(sd_llc, cpu);
> +	if (unlikely(!sd))
> +		goto unlock;
> +
> +	if (idle_state)
> +		cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> +	else
> +		cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> +
> +	rq->last_idle_state = idle_state;
> +unlock:
> +	rcu_read_unlock();
> +}
>  #endif /* CONFIG_SMP */
> 
>  static unsigned long wakeup_gran(struct sched_entity *se)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 912b47aa99d8..86bfe81cc280 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -289,6 +289,11 @@ static void do_idle(void)
>  			cpuhp_report_idle_dead();
>  			arch_cpu_idle_dead();
>  		}
> +		/*
> +		 * The CPU is about to go idle, set it in idle cpumask
> +		 * to be a wake up target.
> +		 */
> +		update_idle_cpumask(cpu, true);
> 
>  		arch_cpu_idle_enter();
>  		rcu_nocb_flush_deferred_wakeup();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4d47a0969710..2d6456fa15cb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -998,6 +998,7 @@ struct rq {
> 
>  	unsigned char		nohz_idle_balance;
>  	unsigned char		idle_balance;
> +	unsigned char		last_idle_state;
> 
>  	unsigned long		misfit_task_load;
> 
> @@ -1846,6 +1847,8 @@ static inline unsigned int group_first_cpu(struct
> sched_group *group)
> 
>  extern int group_balance_cpu(struct sched_group *sg);
> 
> +void update_idle_cpumask(int cpu, bool idle);
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  void update_sched_domain_debugfs(void);
>  void dirty_sched_domain_sysctl(int cpu);
> @@ -1864,6 +1867,7 @@ extern void flush_smp_call_function_from_idle(void);
> 
>  #else /* !CONFIG_SMP: */
>  static inline void flush_smp_call_function_from_idle(void) { }
> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>  #endif
> 
>  #include "stats.h"
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..2075bc417b90 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1611,6 +1611,7 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>  		atomic_inc(&sd->shared->ref);
>  		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> +		cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>  	}
> 
>  	sd->private = sdd;
> @@ -1970,7 +1971,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> 
>  			*per_cpu_ptr(sdd->sd, j) = sd;
> 
> -			sds = kzalloc_node(sizeof(struct sched_domain_shared),
> +			sds = kzalloc_node(sizeof(struct sched_domain_shared) +
> cpumask_size(),
>  					GFP_KERNEL, cpu_to_node(j));
>  			if (!sds)
>  				return -ENOMEM;
> --
> 2.26.2


Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core
  2021-07-26 10:22 ` [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core Mel Gorman
@ 2021-08-02 10:52   ` Song Bao Hua (Barry Song)
  2021-08-04 10:22     ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-08-02 10:52 UTC (permalink / raw)
  To: Mel Gorman, LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Valentin Schneider,
	Aubrey Li, yangyicong



> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@techsingularity.net]
> Sent: Monday, July 26, 2021 10:23 PM
> To: LKML <linux-kernel@vger.kernel.org>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Vincent Guittot <vincent.guittot@linaro.org>; Valentin Schneider
> <valentin.schneider@arm.com>; Aubrey Li <aubrey.li@linux.intel.com>; Mel
> Gorman <mgorman@techsingularity.net>
> Subject: [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning
> for an idle core
> 
> When scanning for a single CPU, the scan is limited based on the estimated
> average idle time for a domain to reduce the risk that more time is spent
> scanning for idle CPUs than we are idle for.
> 
> With SMT, if an idle core is expected to exist there is no scan depth
> limits so the scan depth may or may not be related to average idle time.
> Unfortunately has_idle_cores can be very inaccurate when workloads are
> rapidly entering/exiting idle (e.g. hackbench).
> 
> As the scan depth is now proportional to cores and not CPUs, enforce
> SIS_PROP for idle core scans.
> 
> The performance impact of this is variable and is neither a universal
> gain nor loss. In some cases, has_idle_cores will be cleared prematurely
> because the whole domain was not scanned but has_idle_cores is already
> known to be an inaccurate heuristic. There is also additional cost because
> time calculations are made even for an idle core scan and the delta is
> calculated for both scan successes and failures. Finally, SMT siblings
> may be used prematurely due to scan depth limitations.
> 
> On the flip side, scan depth is now consistent for both core and smt
> scans. The reduction in scan depth improves performance in some cases
> and wakeup latency is reduced in some cases.
> 
> There were few changes identified in the SIS statistics but notably,
> "SIS Core Hit" was slightly reduced in tbench as thread counts increased,
> presumably due to the core search depth being throttled.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 20b9255ebf97..b180205e6b25 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6232,7 +6232,7 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
> 
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> 
> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> +	if (sched_feat(SIS_PROP)) {
>  		u64 avg_cost, avg_idle, span_avg;
>  		unsigned long now = jiffies;
> 
> @@ -6265,30 +6265,35 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>  			if ((unsigned int)i < nr_cpumask_bits)
> -				return i;
> +				break;
> 
> +			nr -= sched_smt_weight;
>  		} else {
> -			if (!--nr)
> -				return -1;
>  			idle_cpu = __select_idle_cpu(cpu, p);
>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>  				break;
> +			nr--;
>  		}
> +
> +		if (nr < 0)
> +			break;
>  	}
> 
> -	if (has_idle_core)
> -		set_idle_cores(target, false);
> +	if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> +		if (has_idle_core)
> +			set_idle_cores(target, false);
> 

For example, if we have 16 cpus(8 SMT2 cores). In case core7 is idle,
we only have scanned core0+core1(cpu0-cpu3) and if these two cores
are not idle, but here we set has_idle_cores to false while core7 is
idle. It seems incorrect.

> -	if (sched_feat(SIS_PROP) && !has_idle_core) {
> -		time = cpu_clock(this) - time;
> +		if (sched_feat(SIS_PROP)) {
> +			time = cpu_clock(this) - time;
> 
> -		/*
> -		 * Account for the scan cost of wakeups against the average
> -		 * idle time.
> -		 */
> -		this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> +			/*
> +			 * Account for the scan cost of wakeups against the average
> +			 * idle time.
> +			 */
> +			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> 
> -		update_avg(&this_sd->avg_scan_cost, time);
> +			update_avg(&this_sd->avg_scan_cost, time);
> +		}
>  	}
> 
>  	return idle_cpu;
> --
> 2.26.2


Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core
  2021-08-02 10:52   ` Song Bao Hua (Barry Song)
@ 2021-08-04 10:22     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-08-04 10:22 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Valentin Schneider, Aubrey Li, yangyicong

On Mon, Aug 02, 2021 at 10:52:01AM +0000, Song Bao Hua (Barry Song) wrote:
> > @@ -6265,30 +6265,35 @@ static int select_idle_cpu(struct task_struct *p, struct
> > sched_domain *sd, bool
> >  		if (has_idle_core) {
> >  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >  			if ((unsigned int)i < nr_cpumask_bits)
> > -				return i;
> > +				break;
> > 
> > +			nr -= sched_smt_weight;
> >  		} else {
> > -			if (!--nr)
> > -				return -1;
> >  			idle_cpu = __select_idle_cpu(cpu, p);
> >  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >  				break;
> > +			nr--;
> >  		}
> > +
> > +		if (nr < 0)
> > +			break;
> >  	}
> > 
> > -	if (has_idle_core)
> > -		set_idle_cores(target, false);
> > +	if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> > +		if (has_idle_core)
> > +			set_idle_cores(target, false);
> > 
> 
> For example, if we have 16 cpus(8 SMT2 cores). In case core7 is idle,
> we only have scanned core0+core1(cpu0-cpu3) and if these two cores
> are not idle, but here we set has_idle_cores to false while core7 is
> idle. It seems incorrect.
> 

Yep, that block needs to be revisited.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-08-02 10:41   ` Song Bao Hua (Barry Song)
@ 2021-08-04 10:26     ` Mel Gorman
  2021-08-05  0:23       ` Aubrey Li
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-08-04 10:26 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Valentin Schneider, Aubrey Li, yangyicong

On Mon, Aug 02, 2021 at 10:41:13AM +0000, Song Bao Hua (Barry Song) wrote:
> Hi Mel, Aubrey,
> A similar thing Yicong and me has discussed is having a mask or a count for
> idle cores. And then we can only scan idle cores in this mask in
> select_idle_cpu().
> 

Either approach would require a lot of updates.

> A obvious problem is that has_idle_cores is a bool, it can seriously lag
> from the real status. I mean, after system enters the status without idle
> cores, has_idle_cores could be still true.
> 

True.

> Right now, we are setting has_idle_cores to true while cpu enters idle
> and its smt sibling is also idle. But we are setting has_idle_cores to
> false only after we scan all cores in a llc.
> 
> So we have thought for a while to provide an idle core mask. But never
> really made a workable patch.
> 
> Mel's patch7/9 limits the number of cores which will be scanned in
> select_idle_cpu(), it might somehow alleviate the problem we redundantly
> scan all cores while we actually have no idle core even has_idle_cores
> is true.
> 

I prototyped a patch that tracked the location of a known idle core and
use it as a starting hint for a search. It's cleared if the core is
selected for use. Unfortunately, it was not a universal win so was
dropped for the moment but either way, improving the accurate of
has_idle_cores without being too expensive would be niuce.

> However, if we can get idle core mask, it could be also good to
> select_idle_core()? Maybe after that, we don't have to enforce
> proportional scan limits while scanning for an idle core?
> 

To remove the limits entirely, it would have to be very accurate and
it's hard to see how that can be done without having a heavily updated
shared cache line.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-08-04 10:26     ` Mel Gorman
@ 2021-08-05  0:23       ` Aubrey Li
  2021-09-17  3:44         ` Barry Song
  2021-09-17  4:15         ` Barry Song
  0 siblings, 2 replies; 19+ messages in thread
From: Aubrey Li @ 2021-08-05  0:23 UTC (permalink / raw)
  To: Mel Gorman, Song Bao Hua (Barry Song)
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Valentin Schneider, yangyicong

On 8/4/21 6:26 PM, Mel Gorman wrote:
> On Mon, Aug 02, 2021 at 10:41:13AM +0000, Song Bao Hua (Barry Song) wrote:
>> Hi Mel, Aubrey,
>> A similar thing Yicong and me has discussed is having a mask or a count for
>> idle cores. And then we can only scan idle cores in this mask in
>> select_idle_cpu().
>>
> 
> Either approach would require a lot of updates.
> 
>> A obvious problem is that has_idle_cores is a bool, it can seriously lag
>> from the real status. I mean, after system enters the status without idle
>> cores, has_idle_cores could be still true.
>>
> 
> True.
> 
>> Right now, we are setting has_idle_cores to true while cpu enters idle
>> and its smt sibling is also idle. But we are setting has_idle_cores to
>> false only after we scan all cores in a llc.
>>
>> So we have thought for a while to provide an idle core mask. But never
>> really made a workable patch.
>>
>> Mel's patch7/9 limits the number of cores which will be scanned in
>> select_idle_cpu(), it might somehow alleviate the problem we redundantly
>> scan all cores while we actually have no idle core even has_idle_cores
>> is true.
>>
> 
> I prototyped a patch that tracked the location of a known idle core and
> use it as a starting hint for a search. It's cleared if the core is
> selected for use. Unfortunately, it was not a universal win so was
> dropped for the moment but either way, improving the accurate of
> has_idle_cores without being too expensive would be niuce.

The idle core information is already in idle cpu mask, do we have a quick
and cheap way to extract out?


> 
>> However, if we can get idle core mask, it could be also good to
>> select_idle_core()? Maybe after that, we don't have to enforce
>> proportional scan limits while scanning for an idle core?
>>
> 
> To remove the limits entirely, it would have to be very accurate and
> it's hard to see how that can be done without having a heavily updated
> shared cache line.
> 

Bad idea, is it helpful to make a sparse cpu mask for this? :p

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-08-05  0:23       ` Aubrey Li
@ 2021-09-17  3:44         ` Barry Song
  2021-09-17  4:15         ` Barry Song
  1 sibling, 0 replies; 19+ messages in thread
From: Barry Song @ 2021-09-17  3:44 UTC (permalink / raw)
  To: aubrey.li
  Cc: linux-kernel, mgorman, mingo, peterz, song.bao.hua,
	valentin.schneider, vincent.guittot, yangyicong

[Aubrey wrote]
> The idle core information is already in idle cpu mask, do we have a quick
> and cheap way to extract out?

seems hard.

since we need an intermediate cpumask memory to save the result of cpumask
bitmap operations, the only way i can get the idle core and make sure "cpus"
isn't broken is:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 361927803e31..abd844bcfb86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6187,8 +6187,14 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
-       bool idle = true;
-       int cpu;
+       bool idle = true, possible_idle;
+       int cpu, wb, wa;
+
+       wb = cpumask_weight(cpus);
+       cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+       wa = cpumask_weight(cpus);
+
+       possible_idle = (wa - wb == cpumask_weight(cpu_smt_mask(core)));
 
        if (!static_branch_likely(&sched_smt_present))
                return __select_idle_cpu(core, p);
@@ -6212,7 +6218,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
        if (idle)
                return core;
 
-       cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
        return -1;
 }

but the questions are:
1. is it really cheap?
2. how to use possible_idle? right now, select_idle_core() is also selecting idle cpu. we shouldn't simply return when possible_idle is false.

Thanks
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-08-05  0:23       ` Aubrey Li
  2021-09-17  3:44         ` Barry Song
@ 2021-09-17  4:15         ` Barry Song
  2021-09-17  9:11           ` Aubrey Li
  1 sibling, 1 reply; 19+ messages in thread
From: Barry Song @ 2021-09-17  4:15 UTC (permalink / raw)
  To: aubrey.li
  Cc: linux-kernel, mgorman, mingo, peterz, song.bao.hua,
	valentin.schneider, vincent.guittot, yangyicong

> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
>
>  #ifdef CONFIG_SMP
> 	rq->idle_balance = idle_cpu(cpu);
> +	update_idle_cpumask(cpu, rq->idle_balance);
>  	trigger_load_balance(rq);
>  #endif
> }

might be stupid, a question bothering yicong and me is that why don't we
choose to update_idle_cpumask() while idle task exits and switches to a
normal task?
for example, before tick comes, cpu has exited from idle, but we are only
able to update it in tick. this makes idle_cpus_span inaccurate, thus we
will scan cpu which isn't actually idle in select_idle_sibling.
is it because of the huge update overhead?

Thanks
barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-09-17  4:15         ` Barry Song
@ 2021-09-17  9:11           ` Aubrey Li
  2021-09-17 13:35             ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Aubrey Li @ 2021-09-17  9:11 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel, mgorman, mingo, peterz, song.bao.hua,
	valentin.schneider, vincent.guittot, yangyicong

On 9/17/21 12:15 PM, Barry Song wrote:
>> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>> 	rq->idle_balance = idle_cpu(cpu);
>> +	update_idle_cpumask(cpu, rq->idle_balance);
>>  	trigger_load_balance(rq);
>>  #endif
>> }
> 
> might be stupid, a question bothering yicong and me is that why don't we
> choose to update_idle_cpumask() while idle task exits and switches to a
> normal task?

I implemented that way and we discussed before(RFC v1 ?), updating a cpumask
at every enter/exit idle is more expensive than we expected, though it's
per LLC domain, Vincent saw a significant regression IIRC. You can also
take a look at nohz.idle_cpus_mask as a reference.

> for example, before tick comes, cpu has exited from idle, but we are only
> able to update it in tick. this makes idle_cpus_span inaccurate, thus we
> will scan cpu which isn't actually idle in select_idle_sibling.
> is it because of the huge update overhead?
> 

Yes, we'll have false positive but we don't miss true positive. So things
won't be worse than the current way.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-09-17  9:11           ` Aubrey Li
@ 2021-09-17 13:35             ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-09-17 13:35 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Barry Song, linux-kernel, mingo, peterz, song.bao.hua,
	valentin.schneider, vincent.guittot, yangyicong

On Fri, Sep 17, 2021 at 05:11:11PM +0800, Aubrey Li wrote:
> On 9/17/21 12:15 PM, Barry Song wrote:
> >> @@ -4965,6 +4965,7 @@ void scheduler_tick(void)
> >>
> >>  #ifdef CONFIG_SMP
> >> 	rq->idle_balance = idle_cpu(cpu);
> >> +	update_idle_cpumask(cpu, rq->idle_balance);
> >>  	trigger_load_balance(rq);
> >>  #endif
> >> }
> > 
> > might be stupid, a question bothering yicong and me is that why don't we
> > choose to update_idle_cpumask() while idle task exits and switches to a
> > normal task?
> 
> I implemented that way and we discussed before(RFC v1 ?), updating a cpumask
> at every enter/exit idle is more expensive than we expected, though it's
> per LLC domain, Vincent saw a significant regression IIRC. You can also
> take a look at nohz.idle_cpus_mask as a reference.
> 

It's possible to track it differently and I prototyped it some time
back. The results were mixed at the time. It helped some workloads
and was marginal on others. It appeared to help hackbench but I found
that hackbench is much more vulnerable to the wakeup_granularity and
overscheduling. For hackbench, it makes more sense to target that directly
before revisiting the alt-idlecore to see what it really helps. I'm waiting
on test results on various ways wakeup_gran can be scaled depending on
rq activity.

For alternative idle core tracking, the current 5.15-rc1 rebase
prototype looks like this

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-altidlecore-v2r8&id=b2af1a88245f6cbeb28343e89f3183a77b29d52d

Test results still pending and as usual the queue is busy. I swear, my
primary bottleneck for doing anything is benchmark and validation :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-17 13:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-26 10:22 [RFC PATCH 0/9] Modify and/or delete SIS_PROP Mel Gorman
2021-07-26 10:22 ` [PATCH 1/9] sched/fair: Track efficiency of select_idle_sibling Mel Gorman
2021-07-26 10:22 ` [PATCH 2/9] sched/fair: Track efficiency of task recent_used_cpu Mel Gorman
2021-07-26 10:22 ` [PATCH 3/9] sched/fair: Track efficiency of select_idle_core Mel Gorman
2021-07-26 10:22 ` [PATCH 4/9] sched/fair: Use prev instead of new target as recent_used_cpu Mel Gorman
2021-07-26 10:22 ` [PATCH 5/9] sched/fair: Avoid a second scan of target in select_idle_cpu Mel Gorman
2021-07-26 10:22 ` [PATCH 6/9] sched/fair: Make select_idle_cpu() proportional to cores Mel Gorman
2021-07-26 10:22 ` [PATCH 7/9] sched/fair: Enforce proportional scan limits when scanning for an idle core Mel Gorman
2021-08-02 10:52   ` Song Bao Hua (Barry Song)
2021-08-04 10:22     ` Mel Gorman
2021-07-26 10:22 ` [PATCH 8/9] sched/fair: select idle cpu from idle cpumask for task wakeup Mel Gorman
2021-08-02 10:41   ` Song Bao Hua (Barry Song)
2021-08-04 10:26     ` Mel Gorman
2021-08-05  0:23       ` Aubrey Li
2021-09-17  3:44         ` Barry Song
2021-09-17  4:15         ` Barry Song
2021-09-17  9:11           ` Aubrey Li
2021-09-17 13:35             ` Mel Gorman
2021-07-26 10:22 ` [PATCH 9/9] sched/core: Delete SIS_PROP and rely on the idle cpu mask Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).