All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] sched/fair: wake_affine improvements
@ 2021-05-13  7:40 Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah,
	Aubrey Li

Changelog v2->v3:
v2: http://lore.kernel.org/lkml/20210506164543.90688-1-srikar@linux.vnet.ibm.com/t/#u
 - Rebased to tip/sched/core
		(Valentin)
 - Update schedstat if target is current CPU
				(Valentin)
 - Search for idle-cores in LLC only if idle-core is -1

Changelog v1->v2:
v1 Link: http://lore.kernel.org/lkml/20210422102326.35889-1-srikar@linux.vnet.ibm.com/t/#u
 - Fallback LLC domain has been split out as a subsequent patchset.
   					(suggested by Mel)
 - Fix a panic due to two wakeups racing for the same idle-core
					(Reported by Mel)
 - Differentiate if a LLC surely has no idle-cores(-2) vs a LLC may or
   may not have idle-cores(-1).
 - Rebased to v5.12

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
a patch that marks MC domain as LLC.

Here the focus is on improving the current core scheduler's wakeup
mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC)

Hence this approach can work well with the mc-llc too. It can help
other architectures too.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  8
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Model:               1.0 (pvr 0080 0100)
Model name:          POWER10 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8K
NUMA node2 CPU(s):   0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc+
instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.240824 usecs/op, median =  usecs/op, max = 19.240824 usecs/op
instances = 2, min = 19.143470 usecs/op, median = 19.143470 usecs/op, max = 19.249875 usecs/op
instances = 4, min = 19.399812 usecs/op, median = 19.487433 usecs/op, max = 19.501298 usecs/op
instances = 8, min = 19.024297 usecs/op, median = 19.908682 usecs/op, max = 20.741605 usecs/op
instances = 16, min = 22.209444 usecs/op, median = 23.971275 usecs/op, max = 25.145198 usecs/op
instances = 32, min = 31.220392 usecs/op, median = 32.689189 usecs/op, max = 34.081588 usecs/op
instances = 64, min = 39.012110 usecs/op, median = 44.062042 usecs/op, max = 45.370525 usecs/op
instances = 80, min = 43.884358 usecs/op, median = 44.326417 usecs/op, max = 48.031303 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
		     v5.12-rc5   v5.12-rc5     v5.12-rc5
                                 + mc-llc      + patchset
64CPUs/1JVM/ 60Users  6373.7      7520.5        7375.6
64CPUs/1JVM/ 80Users  6742.1      7940.9        7832.9
64CPUs/1JVM/100Users  6482.2      7730.3        7538.4
64CPUs/2JVM/ 60Users  6335        8081.6        8000.2
64CPUs/2JVM/ 80Users  6360.8      8259.6        8315.4
64CPUs/2JVM/100Users  6215.6      8046.5        8049.4
64CPUs/4JVM/ 60Users  5385.4      7685.3        8013.5
64CPUs/4JVM/ 80Users  5380.8      7753.3        7868
64CPUs/4JVM/100Users  5275.2      7549.2        7620

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset

50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 53 (1285 samples)
75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 72 (635 samples)
90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 88 (388 samples)
95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 94 (118 samples)
*99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 108 (108 samples)
99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 108 (0 samples)
99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 110 (8 samples)
min=13, max=20684          |     min=15, max=113                 | min=15, max=1433

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 51 (3843 samples)
75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 69 (1962 samples)
90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 84 (1103 samples)
95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 93 (386 samples)
*99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 107 (297 samples)
99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 110 (39 samples)
99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 1714 (27 samples)
min=13, max=20684          |     min=12, max=113                 | min=15, max=4456

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 50 (6443 samples)
75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 67 (3263 samples)
90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 82 (1890 samples)
95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 92 (652 samples)
*99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 107 (464 samples)
99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 110 (61 samples)
99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1434 (47 samples)
min=12, max=21421          |     min=1, max=126                  | min=15, max=4456

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5

mongodb (threads=6) (throughput, higher is better)
					 Throughput         read        clean      update
					                    latency     latency    latency
v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us

v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67604.51 ops/sec   1117.91 us  947.07 us  1353.41 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63979.39 ops/sec   1793.63 us  869.72 us  2130.22 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62032.34 ops/sec   2475.89 us  869.06 us  2922.01 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60152.96 ops/sec   3203.84 us  972.00 us  3756.52 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>

Srikar Dronamraju (8):
  sched/fair: Update affine statistics when needed
  sched/fair: Maintain the identity of idle-core
  sched/fair: Update idle-core more often
  sched/fair: Prefer idle CPU to cache affinity
  sched/fair: Use affine_idler_llc for wakeups across LLC
  sched/idle: Move busy_cpu accounting to idle callback
  sched/fair: Remove ifdefs in waker_affine_idler_llc
  sched/fair: Dont iterate if no idle CPUs

 include/linux/sched/topology.h |   2 +-
 kernel/sched/fair.c            | 220 ++++++++++++++++++++++++++-------
 kernel/sched/features.h        |   1 +
 kernel/sched/idle.c            |  33 ++++-
 kernel/sched/sched.h           |   6 +
 kernel/sched/topology.c        |   9 ++
 6 files changed, 222 insertions(+), 49 deletions(-)


base-commit: 2ea46c6fc9452ac100ad907b051d797225847e33
-- 
2.18.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/8] sched/fair: Update affine statistics when needed
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

wake_affine_idle() can return prev_cpu. Even in such a scenario,
scheduler was going ahead and updating schedstats related to wake
affine. i.e even if the task is not moved to current CPU,
schedstats would have accounted.

Hence add a check before updating schedstats.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2-v3:
 - Update schedstat if target is current CPU
				(Valentin)

 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..7920f2a4d257 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5908,8 +5908,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
-	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	if (target == this_cpu) {
+		schedstat_inc(sd->ttwu_move_affine);
+		schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	}
 	return target;
 }
 
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-21 12:36   ` Vincent Guittot
  2021-05-13  7:40 ` [PATCH v3 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Scheduler maintains a per LLC info which tells if there is any idle core
in that LLC. However this information doesn't provide which core is idle.

So when iterating for idle-cores, if select_idle_core() finds an
idle-core, then it doesn't try to reset this information.

So if there was only one idle core in the LLC and select_idle_core()
selected the idle-core, the LLC will maintain that it still has a
idle-core.

On the converse, if a task is pinned, and has a restricted
cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
cannot find a idle-core, LLC will no more maintain that it has an
idle-core.

As a first step to solve this problem, LLC will maintain the identity of
the idle core instead of just the information that LLC has an idle core

Along with maintaining, this change will solve both the problems listed
above. However there are other problems that exist with the current
infrastructure and those will continue to exist with this change and
would be handled in subsequent patches.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
 -  Rebase to tip/sched/core
		(Valentin)

 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 52 ++++++++++++++++++----------------
 kernel/sched/sched.h           |  3 ++
 kernel/sched/topology.c        |  7 +++++
 4 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..285165a35f21 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -73,7 +73,7 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
+	int		idle_core;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7920f2a4d257..c42b2b3cd08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1578,11 +1578,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
 
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
+static inline int get_idle_core(int cpu, int def);
 static inline int numa_idle_core(int idle_core, int cpu)
 {
 	if (!static_branch_likely(&sched_smt_present) ||
-	    idle_core >= 0 || !test_idle_cores(cpu, false))
+	    idle_core >= 0 || get_idle_core(cpu, -1) < 0)
 		return idle_core;
 
 	/*
@@ -6039,29 +6039,31 @@ static inline int __select_idle_cpu(int cpu)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
+		WRITE_ONCE(sds->idle_core, val);
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline int get_idle_core(int cpu, int def)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
+	if (static_branch_likely(&sched_smt_present)) {
+		sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+		if (sds)
+			return READ_ONCE(sds->idle_core);
+	}
 
 	return def;
 }
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_llc_shared->idle_core.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -6072,7 +6074,7 @@ void __update_idle_core(struct rq *rq)
 	int cpu;
 
 	rcu_read_lock();
-	if (test_idle_cores(core, true))
+	if (get_idle_core(core, 0) >= 0)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6083,7 +6085,7 @@ void __update_idle_core(struct rq *rq)
 			goto unlock;
 	}
 
-	set_idle_cores(core, 1);
+	set_idle_core(core, per_cpu(smt_id, core));
 unlock:
 	rcu_read_unlock();
 }
@@ -6091,7 +6093,7 @@ void __update_idle_core(struct rq *rq)
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * sd_llc->shared->idle_core and enabled through update_idle_core() above.
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
@@ -6144,11 +6146,11 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool get_idle_core(int cpu, int def)
 {
 	return def;
 }
@@ -6170,10 +6172,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  * average idle time for this rq (as found in rq->avg_idle).
  */
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	bool has_idle_core = (idle_core != -1);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6206,8 +6209,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned int)i < nr_cpumask_bits)
+			if ((unsigned int)i < nr_cpumask_bits) {
+#ifdef CONFIG_SCHED_SMT
+				if ((per_cpu(smt_id, i)) == idle_core)
+					set_idle_core(i, -1);
+#endif
 				return i;
+			}
 
 		} else {
 			if (!--nr)
@@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
-	if (has_idle_core)
-		set_idle_cores(this, false);
-
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		time = cpu_clock(this) - time;
 		update_avg(&this_sd->avg_scan_cost, time);
@@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
  */
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
-	bool has_idle_core = false;
+	int i, recent_used_cpu, idle_core = -1;
 	struct sched_domain *sd;
 	unsigned long task_util;
-	int i, recent_used_cpu;
 
 	/*
 	 * On asymmetric system, update task utilization because we will check
@@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return target;
 
 	if (sched_smt_active()) {
-		has_idle_core = test_idle_cores(target, false);
+		idle_core = get_idle_core(target, -1);
 
-		if (!has_idle_core && cpus_share_cache(prev, target)) {
+		if (idle_core < 0 && cpus_share_cache(prev, target)) {
 			i = select_idle_smt(p, sd, prev);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 		}
 	}
 
-	i = select_idle_cpu(p, sd, has_idle_core, target);
+	i = select_idle_cpu(p, sd, idle_core, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..22fbb50b036e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DECLARE_PER_CPU(int, smt_id);
+#endif
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 55a0a243e871..232fb261dfc2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DEFINE_PER_CPU(int, smt_id);
+#endif
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+#ifdef CONFIG_SCHED_SMT
+	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
+#endif
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
@@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		sd->shared->idle_core = -1;
 	}
 
 	sd->private = sdd;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/8] sched/fair: Update idle-core more often
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Currently when the scheduler does a load balance and pulls a task or
when a CPU picks up a task during wakeup without having to call
select_idle_cpu(), it never checks if the target CPU is part of the
idle-core. This makes idle-core less accurate.

Given that the identity of idle-core for LLC is maintained, its easy to
update the idle-core as soon as the CPU picks up a task.

This change will update the idle-core whenever a CPU from the idle-core
picks up a task. However if there are multiple idle-cores in the LLC,
and waking CPU happens to be part of the designated idle-core, idle-core
is set to -1. In cases where the scheduler is sure that there are no
more idle-cores, idle-core is set to -2.

To reduce this case, whenever a CPU updates idle-core, it will look for
other cores in the LLC for an idle-core, if the core to which it belongs
to is not idle.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
 - Search for idle-cores in LLC only if idle-core is -1

 kernel/sched/fair.c  | 61 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/idle.c  |  6 +++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c42b2b3cd08f..d002bc95c0bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6039,6 +6039,13 @@ static inline int __select_idle_cpu(int cpu)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+/*
+ * Value of -2 indicates there are no idle-cores in LLC.
+ * Value of -1 indicates an idle-core turned to busy recently.
+ * However there could be other idle-cores in the system.
+ * Anyother value indicates core to which the CPU(value)
+ * belongs is idle.
+ */
 static inline void set_idle_core(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
@@ -6061,6 +6068,44 @@ static inline int get_idle_core(int cpu, int def)
 	return def;
 }
 
+static void set_next_idle_core(int target)
+{
+	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, target));
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int core, cpu;
+
+	if (!sd)
+		return;
+
+	cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
+	for_each_cpu_wrap(core, cpus, target) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			if (!available_idle_cpu(cpu)) {
+				idle = false;
+				break;
+			}
+		}
+
+		if (idle) {
+			set_idle_core(core, per_cpu(smt_id, core));
+			return;
+		}
+
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+	}
+	set_idle_core(target, -2);
+}
+
+void set_core_busy(int core)
+{
+	rcu_read_lock();
+	if (get_idle_core(core, -1) == per_cpu(smt_id, core))
+		set_idle_core(core, -1);
+	rcu_read_unlock();
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->idle_core.
@@ -6071,21 +6116,25 @@ static inline int get_idle_core(int cpu, int def)
 void __update_idle_core(struct rq *rq)
 {
 	int core = cpu_of(rq);
-	int cpu;
+	int cpu, idlecore;
 
 	rcu_read_lock();
-	if (get_idle_core(core, 0) >= 0)
+	idlecore = get_idle_core(core, 0);
+	if (idlecore >= 0)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (cpu == core)
+		if (cpu == core || available_idle_cpu(cpu))
 			continue;
 
-		if (!available_idle_cpu(cpu))
-			goto unlock;
+		if (idlecore == -1)
+			set_next_idle_core(core);
+
+		goto unlock;
 	}
 
 	set_idle_core(core, per_cpu(smt_id, core));
+
 unlock:
 	rcu_read_unlock();
 }
@@ -6176,7 +6225,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int i
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	bool has_idle_core = (idle_core != -1);
+	bool has_idle_core = (idle_core != -2);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7ca3d3d86c2a..a9f5a8ace59e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -431,6 +431,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SCHED_SMT
+	int cpu = rq->cpu;
+
+	if (static_branch_likely(&sched_smt_present))
+		set_core_busy(cpu);
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22fbb50b036e..98c3cfbc5d26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1115,6 +1115,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
+extern void set_core_busy(int cpu);
 
 static inline void update_idle_core(struct rq *rq)
 {
@@ -1124,6 +1125,7 @@ static inline void update_idle_core(struct rq *rq)
 
 #else
 static inline void update_idle_core(struct rq *rq) { }
+static inline void set_core_busy(int cpu) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/8] sched/fair: Prefer idle CPU to cache affinity
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Michael Neuling,
	Gautham R Shenoy, Parth Shah, Aubrey Li

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a method where Scheduler
compares idle CPUs in waker and previous LLCs and picks the appropriate
one.

The new method looks at idle-core to figure out idle LLC. If there are
no idle LLCs, it compares the ratio of busy CPUs to the total number of
CPUs in the LLC. This method will only be useful to compare 2 LLCs. If
the previous CPU and the waking CPU are in the same LLC, this method
would not be useful. For now the new method is disabled by default.

sync flag decides which CPU/LLC to try first. If sync is set, choose
current LLC, else choose previous LLC.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog: v1->v2:
 - Swap the cpus, if the wakeup is not sync, so that a single order of
 code suffices for both sync and non-sync wakeups.

 - Mel reported a crash. Apparently two threads can race to find an
 idle-core. I now cache the idlecore. Also use compare-exchange, so
 that no 2 waking tasks contend on the same CPU.

Also Based on similar posting:
http://lore.kernel.org/lkml/20210226164029.122432-1-srikar@linux.vnet.ibm.com/t/#u
- Make WA_WAKER default (Suggested by Rik) : done in next patch
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- Rename WA_WAKER to WA_IDLER_LLC (Suggested by Vincent)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains
- Add idle-core support

 kernel/sched/fair.c     | 66 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 67 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d002bc95c0bc..d95a2c9c8797 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5893,6 +5893,59 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);
+
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+{
+#ifdef CONFIG_NO_HZ_COMMON
+	int tnr_busy, tllc_size, pnr_busy, pllc_size;
+#endif
+	struct sched_domain_shared *pref_sds, *try_sds;
+	int diff, idle_core;
+
+	if (!sync)
+		swap(pref_cpu, try_cpu);
+
+	pref_sds = rcu_dereference(per_cpu(sd_llc_shared, pref_cpu));
+	try_sds = rcu_dereference(per_cpu(sd_llc_shared, try_cpu));
+	if (!pref_sds || !try_sds)
+		return nr_cpumask_bits;
+
+	if (available_idle_cpu(pref_cpu) || sched_idle_cpu(pref_cpu))
+		return pref_cpu;
+
+	idle_core = READ_ONCE(pref_sds->idle_core);
+	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+				test_reset_idle_core(pref_sds, idle_core))
+		return idle_core;
+
+	if (available_idle_cpu(try_cpu) || sched_idle_cpu(try_cpu))
+		return try_cpu;
+
+	idle_core = READ_ONCE(try_sds->idle_core);
+	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+				test_reset_idle_core(try_sds, idle_core))
+		return idle_core;
+
+#ifdef CONFIG_NO_HZ_COMMON
+	pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
+	tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
+	pllc_size = per_cpu(sd_llc_size, pref_cpu);
+	tllc_size = per_cpu(sd_llc_size, try_cpu);
+
+	if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+		return nr_cpumask_bits;
+
+	diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
+	if (diff > 0)
+		return pref_cpu;
+	if (diff < 0)
+		return try_cpu;
+#endif /* CONFIG_NO_HZ_COMMON */
+
+	return nr_cpumask_bits;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5901,6 +5954,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
+	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
@@ -6068,6 +6124,11 @@ static inline int get_idle_core(int cpu, int def)
 	return def;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+	return cmpxchg(&sds->idle_core, val, -1) == val;
+}
+
 static void set_next_idle_core(int target)
 {
 	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, target));
@@ -6204,6 +6265,11 @@ static inline bool get_idle_core(int cpu, int def)
 	return def;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+	return false;
+}
+
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7f8dace0964c..77e0b2c4e02c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, false)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah,
	Aubrey Li

Currently if a waker is the only running thread of the CPU, and waking a
wakee (both of them interacting over a pipe aka sync wakeups) with the
wakee previously running on a different LLC), then wakee is pulled and
queued on the CPU that is running the waker.

This is true even if the previous CPU was completely idle. The rationale
being waker would soon relinquish the CPU and wakee would benefit from
the cache access. However if the previous CPU is idle, then wakee thread
will incur latency cost of migration to the waker CPU + latency cost of
the waker context switching.

Before the waker switches out, load balancer could also kick in and pull
the wakee out resulting in another migration cost. This will mostly hurt
systems where LLC have just one core. For example:- Hackbench. Both the
threads of hackbench would then end up running on the same core
resulting in CPUs running in higher SMT modes, more load balances and
non optimal performance.

It would be beneficial to run the wakee thread on the same CPU as waker
only if other CPUs are busy. To achieve this move this part of the code
that check if waker is the only running thread to early part of
wake_affine_weight().

Also post this change, wake_affine_idle() will only look at wakeups
within the LLC.  For wakeups within LLC, there should be no difference
between sync and no-sync. For wakeups across LLC boundaries, lets use
wake_affine_idler_llc.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog
v1->v2:
 - Update the patch subject line.

Changelog v1->v2:
- Update the patch subject line.

 kernel/sched/fair.c     | 22 +++++++++++-----------
 kernel/sched/features.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d95a2c9c8797..0dfe01de22d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5823,8 +5823,7 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-static int
-wake_affine_idle(int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idle(int this_cpu, int prev_cpu)
 {
 	/*
 	 * If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5838,15 +5837,12 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * a cpufreq perspective, it's better to have higher utilisation
 	 * on one CPU.
 	 */
-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+	if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+		return prev_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
 		return this_cpu;
 
-	if (available_idle_cpu(prev_cpu))
-		return prev_cpu;
-
 	return nr_cpumask_bits;
 }
 
@@ -5862,6 +5858,9 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
+		if (cpu_rq(this_cpu)->nr_running <= 1)
+			return this_cpu;
+
 		if (current_load > this_eff_load)
 			return this_cpu;
 
@@ -5949,12 +5948,13 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
+	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
 
-	if (sched_feat(WA_IDLE))
-		target = wake_affine_idle(this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_IDLE) && share_caches)
+		target = wake_affine_idle(this_cpu, prev_cpu);
 
-	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
 		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 77e0b2c4e02c..049e3f33f36a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
-SCHED_FEAT(WA_IDLER_LLC, false)
+SCHED_FEAT(WA_IDLER_LLC, true)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-21 12:37   ` Vincent Guittot
  2021-05-13  7:40 ` [PATCH v3 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Currently we account nr_busy_cpus in no_hz idle functions.
There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
configs only. Also scheduler can mark a CPU as non-busy as soon as an
idle class task starts to run. Scheduler can then mark a CPU as busy
as soon as its woken up from idle or a new task is placed on it's
runqueue.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c     |  6 ++++--
 kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  2 ++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dfe01de22d6..8f86359efdbd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10410,7 +10410,10 @@ static void set_cpu_sd_state_busy(int cpu)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->shared->nr_busy_cpus);
+	if (sd && per_cpu(is_idle, cpu)) {
+		atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
+		per_cpu(is_idle, cpu) = 0;
+	}
 unlock:
 	rcu_read_unlock();
 }
@@ -10440,7 +10443,6 @@ static void set_cpu_sd_state_idle(int cpu)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a9f5a8ace59e..c13105fe06b3 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -431,12 +431,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
 	int cpu = rq->cpu;
 
+#ifdef CONFIG_SCHED_SMT
 	if (static_branch_likely(&sched_smt_present))
 		set_core_busy(cpu);
 #endif
+
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		if (per_cpu(is_idle, cpu)) {
+			atomic_inc(&sds->nr_busy_cpus);
+			per_cpu(is_idle, cpu) = 0;
+		}
+	}
+	rcu_read_unlock();
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
@@ -448,9 +461,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
+	int cpu = rq->cpu;
 
-	set_next_task_idle(rq, next, true);
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
+		per_cpu(is_idle, cpu) = 1;
+	}
 
+	rcu_read_unlock();
+#endif
+
+	set_next_task_idle(rq, next, true);
 	return next;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98c3cfbc5d26..b66c4dad5fd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1496,6 +1496,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DECLARE_PER_CPU(int, smt_id);
 #endif
+DECLARE_PER_CPU(int, is_idle);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 232fb261dfc2..730252937712 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DEFINE_PER_CPU(int, smt_id);
 #endif
+DEFINE_PER_CPU(int, is_idle);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
 #ifdef CONFIG_SCHED_SMT
 	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
 #endif
+	per_cpu(is_idle, cpu) = 1;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-13  7:40 ` [PATCH v3 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
  2021-05-19  9:36 ` [PATCH v3 0/8] sched/fair: wake_affine improvements Mel Gorman
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Now that idle callbacks are updating nr_busy_cpus, remove ifdefs in
wake_affine_idler_llc

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f86359efdbd..1ca05176ad18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5896,9 +5896,7 @@ static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val
 
 static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
 {
-#ifdef CONFIG_NO_HZ_COMMON
 	int tnr_busy, tllc_size, pnr_busy, pllc_size;
-#endif
 	struct sched_domain_shared *pref_sds, *try_sds;
 	int diff, idle_core;
 
@@ -5926,7 +5924,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 				test_reset_idle_core(try_sds, idle_core))
 		return idle_core;
 
-#ifdef CONFIG_NO_HZ_COMMON
 	pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
 	tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
 	pllc_size = per_cpu(sd_llc_size, pref_cpu);
@@ -5940,7 +5937,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 		return pref_cpu;
 	if (diff < 0)
 		return try_cpu;
-#endif /* CONFIG_NO_HZ_COMMON */
 
 	return nr_cpumask_bits;
 }
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 8/8] sched/fair: Dont iterate if no idle CPUs
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
@ 2021-05-13  7:40 ` Srikar Dronamraju
  2021-05-19  9:36 ` [PATCH v3 0/8] sched/fair: wake_affine improvements Mel Gorman
  8 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Now that the nr_busy_cpus for a LLC are updated in idle callbacks,
scheduler can detect if all threads of a LLC are busy. In such cases, it
can avoid searching for idle CPUs in the LLC that can run the wakee
thread.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ca05176ad18..4e6e2571537a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -730,7 +730,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, bool idle);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -5894,7 +5894,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 
 static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);
 
-static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu,
+				int sync, bool *idle)
 {
 	int tnr_busy, tllc_size, pnr_busy, pllc_size;
 	struct sched_domain_shared *pref_sds, *try_sds;
@@ -5929,8 +5930,10 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 	pllc_size = per_cpu(sd_llc_size, pref_cpu);
 	tllc_size = per_cpu(sd_llc_size, try_cpu);
 
-	if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+	if (tnr_busy == tllc_size && pnr_busy == pllc_size) {
+		*idle = false;
 		return nr_cpumask_bits;
+	}
 
 	diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
 	if (diff > 0)
@@ -5942,7 +5945,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-		       int this_cpu, int prev_cpu, int sync)
+		       int this_cpu, int prev_cpu, int sync, bool *idle)
 {
 	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
@@ -5951,7 +5954,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		target = wake_affine_idle(this_cpu, prev_cpu);
 
 	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
-		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync, idle);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -6390,7 +6393,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, bool idle)
 {
 	int i, recent_used_cpu, idle_core = -1;
 	struct sched_domain *sd;
@@ -6467,6 +6470,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		}
 	}
 
+	if (!idle)
+		return target;
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
@@ -6901,6 +6907,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int want_affine = 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
+	bool idle = true;
 
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
@@ -6924,7 +6931,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			if (cpu != prev_cpu)
-				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
+				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, &idle);
 
 			sd = NULL; /* Prefer wake_affine over balance flags */
 			break;
@@ -6941,7 +6948,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, idle);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: wake_affine improvements
  2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2021-05-13  7:40 ` [PATCH v3 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
@ 2021-05-19  9:36 ` Mel Gorman
  2021-05-19 16:55   ` Srikar Dronamraju
  8 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2021-05-19  9:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah, Aubrey Li

On Thu, May 13, 2021 at 01:10:19PM +0530, Srikar Dronamraju wrote:
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> 
> One probable solution to the problem was worked by Gautham where he posted
> http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
> a patch that marks MC domain as LLC.
> 
> Here the focus is on improving the current core scheduler's wakeup
> mechanism by looking at idle-cores and nr_busy_cpus that is already
> maintained per Last level cache(aka LLC)
> 
> Hence this approach can work well with the mc-llc too. It can help
> other architectures too.
> 

It took a while for the test grid to process this series but I finally
got results of this on different machines, including Zen1-3 that has
many LLCs per NUMA node and it's a mixed bag.  While there are some
substantial gains depending on workload and machine, there are also some
substantial losses. The most consistent in terms of showing losses was
tbench4 but in general, the mix of major gains and losses may lead to a
game of whack-a-mole regression hinting where fixes for one workload and
machine introduce regressions in another setup.

The mix of gains/losses across multiple machines makes it difficult to
reach a solid ack/nak conclusion. My biggest concern is that machines with
multiple LLCs-per-node were most likely to show major gains or major losses
depending on the workload and the treatment of LLCs is a big aspect of
what the series does. I have not read the series in detail but the idler
LLC selection appears to ignore NUMA locality and that might be part of
the problem.

Relative to your own results

Hackbench: I didn't see the same gains but that is likely related to
	architecture and the topology.

DayTrader: I think this is an IBM benchmark that I'm guessing has a
	complex software stack required to reproduce it so I cannot
	comment.

Schbench: What parameters did you use? For me, I saw some big losses on
	the 99%+ percentiles but I could be using different parameters
	to you

Mongodb: I do not have an equivalent workload. What load generator did
	you use and what parameters?

More details below

NAS Parallal Benchmark was mostly ok. Zen2 for NAS-OMP-SP showed a 20%
	regression and higher system CPU usage but it was an outlier.
	While there were some other negative results, they were relatively
	short-lived in terms of absolute time.

DBench on XFS -- mostly ok. Intel machines were neutral. Zen2 showed
	0-8% regressions depending on client count but Zen3 showed major
	gains 0-39%

SpecJBB 2005 (1 VM per NUMA node) showed variable results across both
	Intel and AMD based machines.
	2-socket Broadwell:	-10% loss to 16% gain but the gain was
				an outlier, it was mostly losses
	2-socket Haswell:	-11% loss to 2% gain, mostly losses
	2-socket Zen1:		-5% loss to 20% gain, low thread loss, high thread gain
	2-socket Zen2:		-25% loss to 44% gain, mostly losses
	2-socket Zen3:		-4% loss to 42% gain, mostly gains

Perf bench pipe showed mostly small losses and higher system CPU usage.
	Worst case was an 8% loss, but mostly it was in the 0-2% range

Git checkout (shellscript intensive) was mostly neutral on Intel but
	gained heavily on Zen*
	Zen1:	+5.48% gain
	Zen2:	+5.42% gain
	Zen3:	+17%   gain

Kernel compilation: Mix of gains and losses, mostly neural with one
	exception

Schbench (Facebook-based benchmark sensitive to wakeup latency) fine on
	most machines but some major regressing outliers on zen3

Hackbench-process-pipes: Mostly neutral, Zen 1 showed gains, Zen 3
	showed losses

Hackbench-process-sockets: Mostly neural but for Zen, Zen1 showed
	losses, Zen 3 showed gains (the opposite of pipes)

Hackbench-process-pipes: Bit mixed but mostly either netural or gains

Hackbench-socket-pipes: Bit mixed but Zen1 showed mostly losses

Netperf-TCP_STREAM: Mostly slight mix except for Zen3 with 15-26% losses

        Zen 3 netperf-tcp
                                    5.13.0-rc1             5.13.0-rc1
                                       vanilla   sched-wakeidler-v3r1
        Hmean     64        2278.21 (   0.00%)     1670.67 * -26.67%*
        Hmean     128       4299.84 (   0.00%)     3199.04 * -25.60%*
        Hmean     256       8176.62 (   0.00%)     5800.95 * -29.05%*
        Hmean     1024     27050.12 (   0.00%)    20743.79 * -23.31%*
        Hmean     2048     42558.18 (   0.00%)    33819.75 * -20.53%*
        Hmean     3312     50576.24 (   0.00%)    39549.06 * -21.80%*
        Hmean     4096     57782.00 (   0.00%)    49030.97 * -15.14%*
        Hmean     8192     72400.49 (   0.00%)    53489.29 * -26.12%*
        Hmean     16384    80997.97 (   0.00%)    63521.05 * -21.58%*

Netperf-UDP_STREAM: Mostly neutral but some machines show 0-13% losses

Tbench4: Mostly leaned towards being a loss

Single-socket Skylake
                         5.13.0-rc1             5.13.0-rc1
                            vanilla   sched-wakeidler-v3r1
Hmean     1       591.86 (   0.00%)      585.56 *  -1.07%*
Hmean     2      1050.84 (   0.00%)     1069.35 *   1.76%*
Hmean     4      1546.96 (   0.00%)     1501.62 *  -2.93%*
Hmean     8      2740.19 (   0.00%)     2675.56 *  -2.36%*
Hmean     16     2450.22 (   0.00%)     2413.80 *  -1.49%*
Hmean     32     2426.44 (   0.00%)     2361.36 *  -2.68%*

	Small losses, higher system CPU usage (not shown)

2-socket Cascadelake
                          5.13.0-rc1             5.13.0-rc1
                             vanilla   sched-wakeidler-v3r1
Hmean     1        544.22 (   0.00%)      548.16 *   0.72%*
Hmean     2       1058.67 (   0.00%)     1044.67 *  -1.32%*
Hmean     4       2049.38 (   0.00%)     2017.99 *  -1.53%*
Hmean     8       4071.51 (   0.00%)     3893.55 *  -4.37%*
Hmean     16      6575.50 (   0.00%)     6576.01 (   0.01%)
Hmean     32     10185.98 (   0.00%)    10303.26 *   1.15%*
Hmean     64     12145.38 (   0.00%)    11616.73 *  -4.35%*
Hmean     128    22335.44 (   0.00%)    21765.91 *  -2.55%*
Hmean     256    20274.37 (   0.00%)    21505.92 *   6.07%*
Hmean     320    20709.22 (   0.00%)    20733.21 *   0.12%*

	Mix of gains and losses, higher system CPU usage

2-socket Broadwell
                          5.13.0-rc1             5.13.0-rc1
                             vanilla   sched-wakeidler-v3r1
Hmean     1        438.53 (   0.00%)      447.02 *   1.94%*
Hmean     2        835.80 (   0.00%)      786.98 *  -5.84%*
Hmean     4       1527.05 (   0.00%)     1436.70 *  -5.92%*
Hmean     8       2952.17 (   0.00%)     2806.30 *  -4.94%*
Hmean     16      5237.13 (   0.00%)     5191.26 *  -0.88%*
Hmean     32      9222.13 (   0.00%)     9004.89 *  -2.36%*
Hmean     64     10805.29 (   0.00%)    10342.93 *  -4.28%*
Hmean     128    18469.14 (   0.00%)    17522.78 *  -5.12%*
Hmean     256    16641.85 (   0.00%)    16278.08 *  -2.19%*
Hmean     320    16623.42 (   0.00%)    16521.47 *  -0.61%*

	Mostly small losses, slight increase CPU usage

2-soocket Zen1

tbench4
                          5.13.0-rc1             5.13.0-rc1
                             vanilla   sched-wakeidler-v3r1
Hmean     1        220.27 (   0.00%)      218.67 *  -0.73%*
Hmean     2        455.18 (   0.00%)      430.82 *  -5.35%*
Hmean     4        845.38 (   0.00%)      887.05 *   4.93%*
Hmean     8       1645.02 (   0.00%)     1563.07 *  -4.98%*
Hmean     16      3109.18 (   0.00%)     3074.53 *  -1.11%*
Hmean     32      4854.40 (   0.00%)     5167.61 *   6.45%*
Hmean     64     10793.06 (   0.00%)     7767.98 * -28.03%*
Hmean     128    12398.50 (   0.00%)    15067.49 *  21.53%*
Hmean     256    16756.69 (   0.00%)    11214.53 * -33.07%*
Hmean     512    10186.47 (   0.00%)    15159.09 *  48.82%*

	Two substantial losses, one substantial gain

2-socket Zen2
tbench4
                          5.13.0-rc1             5.13.0-rc1
                             vanilla   sched-wakeidler-v3r1
Hmean     1        341.84 (   0.00%)      337.45 *  -1.28%*
Hmean     2        675.90 (   0.00%)      659.10 *  -2.49%*
Hmean     4       1312.66 (   0.00%)     1250.00 *  -4.77%*
Hmean     8       2495.62 (   0.00%)     2386.57 *  -4.37%*
Hmean     16      4237.23 (   0.00%)     4835.29 *  14.11%*
Hmean     32      8505.60 (   0.00%)     8428.12 *  -0.91%*
Hmean     64     22452.58 (   0.00%)    20637.45 *  -8.08%*
Hmean     128    32493.62 (   0.00%)    27491.73 * -15.39%*
Hmean     256    40975.73 (   0.00%)    29466.08 * -28.09%*
Hmean     512    39320.56 (   0.00%)    34480.84 * -12.31%*

	Some major losses, one major gain

2-socket Zen3
tbench4
                           5.13.0-rc1             5.13.0-rc1
                              vanilla   sched-wakeidler-v3r1
Hmean     1         764.71 (   0.00%)      771.17 *   0.85%*
Hmean     2        1536.93 (   0.00%)     1504.18 *  -2.13%*
Hmean     4        2836.19 (   0.00%)     2805.02 *  -1.10%*
Hmean     8        4726.61 (   0.00%)     4762.61 *   0.76%*
Hmean     16       8341.73 (   0.00%)     8183.48 *  -1.90%*
Hmean     32      14446.04 (   0.00%)    13628.25 *  -5.66%*
Hmean     64      21852.72 (   0.00%)    24039.33 *  10.01%*
Hmean     128     27674.40 (   0.00%)    29107.56 *   5.18%*
Hmean     256     42985.16 (   0.00%)    36482.84 * -15.13%*
Hmean     512     50210.59 (   0.00%)    40899.44 * -18.54%*
Hmean     1024    63696.89 (   0.00%)    46715.28 * -26.66%*

	Some major losses, one big gain

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/8] sched/fair: wake_affine improvements
  2021-05-19  9:36 ` [PATCH v3 0/8] sched/fair: wake_affine improvements Mel Gorman
@ 2021-05-19 16:55   ` Srikar Dronamraju
  0 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-19 16:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah, Aubrey Li

* Mel Gorman <mgorman@techsingularity.net> [2021-05-19 10:36:44]:

> On Thu, May 13, 2021 at 01:10:19PM +0530, Srikar Dronamraju wrote:
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> > 
> > One probable solution to the problem was worked by Gautham where he posted
> > http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
> > a patch that marks MC domain as LLC.
> > 
> > Here the focus is on improving the current core scheduler's wakeup
> > mechanism by looking at idle-cores and nr_busy_cpus that is already
> > maintained per Last level cache(aka LLC)
> > 
> > Hence this approach can work well with the mc-llc too. It can help
> > other architectures too.
> > 
> 
> It took a while for the test grid to process this series but I finally
> got results of this on different machines, including Zen1-3 that has
> many LLCs per NUMA node and it's a mixed bag.  While there are some
> substantial gains depending on workload and machine, there are also some
> substantial losses. The most consistent in terms of showing losses was
> tbench4 but in general, the mix of major gains and losses may lead to a
> game of whack-a-mole regression hinting where fixes for one workload and
> machine introduce regressions in another setup.
> 

Thanks Mel for taking time to run some of the benchmarks, analyzing the
same. I do agree that we should strive for gains with as less regressions as
possible.

> The mix of gains/losses across multiple machines makes it difficult to
> reach a solid ack/nak conclusion. My biggest concern is that machines with
> multiple LLCs-per-node were most likely to show major gains or major losses
> depending on the workload and the treatment of LLCs is a big aspect of
> what the series does. I have not read the series in detail but the idler
> LLC selection appears to ignore NUMA locality and that might be part of
> the problem.
> 

Based on the interactions with Aubrey in the v2 post thread, I think some of
this problem is due to searching for the *evasive* idle-core in
overcommitted high cores-per-LLC systems.

My hunch that we could easily search for idle-cores when the CPUs have just
gone to idle seems to have hurt the case where the system has lot of small
running tasks like tbench or netperf. It looks like we end up with a lot
threads unnecessarily trying to set idle-core at the same time.

The reverse where we drop the search for idle-cores if the current core is
not idle (aka set_next_idle_core) seems to affect mildly loaded hackbench
as they no more get to choose an idle-core.

For now the LLC selection does ignore NUMA locality, because that would
further complicate the current problem. Once we could iron out the LLC
selection for the die, we could look at extending this further.
But please do let me know if you have ideas to try with NUMA locality.

> Relative to your own results
> 
> Hackbench: I didn't see the same gains but that is likely related to
> 	architecture and the topology.
> 

In the POWER10 machine, there is just one LLC, so current behaviour of
pulling thread to the same LLC, hurts even when there are other idle-cores
in the system, affects the performance.

> DayTrader: I think this is an IBM benchmark that I'm guessing has a
> 	complex software stack required to reproduce it so I cannot
> 	comment.
> 

Agree.

> Schbench: What parameters did you use? For me, I saw some big losses on
> 	the 99%+ percentiles but I could be using different parameters
> 	to you
> 

While I ran multiple schbench, the numbers shown were for 
schbench -m 3 -r 30

> Mongodb: I do not have an equivalent workload. What load generator did
> 	you use and what parameters?
> 

The mongodb workload that we are running is called Workload A which is an
update heavy workload with 50% reads and 50% writes. It is driven via YCSB
With 35 clients, this mongodb workload hits close to 90% utilization.

> More details below
> 
> NAS Parallal Benchmark was mostly ok. Zen2 for NAS-OMP-SP showed a 20%
> 	regression and higher system CPU usage but it was an outlier.
> 	While there were some other negative results, they were relatively
> 	short-lived in terms of absolute time.
> 
> DBench on XFS -- mostly ok. Intel machines were neutral. Zen2 showed
> 	0-8% regressions depending on client count but Zen3 showed major
> 	gains 0-39%
> 
> SpecJBB 2005 (1 VM per NUMA node) showed variable results across both
> 	Intel and AMD based machines.
> 	2-socket Broadwell:	-10% loss to 16% gain but the gain was
> 				an outlier, it was mostly losses
> 	2-socket Haswell:	-11% loss to 2% gain, mostly losses
> 	2-socket Zen1:		-5% loss to 20% gain, low thread loss, high thread gain
> 	2-socket Zen2:		-25% loss to 44% gain, mostly losses
> 	2-socket Zen3:		-4% loss to 42% gain, mostly gains
> 
> Perf bench pipe showed mostly small losses and higher system CPU usage.
> 	Worst case was an 8% loss, but mostly it was in the 0-2% range
> 
> Git checkout (shellscript intensive) was mostly neutral on Intel but
> 	gained heavily on Zen*
> 	Zen1:	+5.48% gain
> 	Zen2:	+5.42% gain
> 	Zen3:	+17%   gain
> 
> Kernel compilation: Mix of gains and losses, mostly neural with one
> 	exception
> 
> Schbench (Facebook-based benchmark sensitive to wakeup latency) fine on
> 	most machines but some major regressing outliers on zen3
> 
> Hackbench-process-pipes: Mostly neutral, Zen 1 showed gains, Zen 3
> 	showed losses
> 
> Hackbench-process-sockets: Mostly neural but for Zen, Zen1 showed
> 	losses, Zen 3 showed gains (the opposite of pipes)
> 
> Hackbench-process-pipes: Bit mixed but mostly either netural or gains
> 
> Hackbench-socket-pipes: Bit mixed but Zen1 showed mostly losses
> 
> Netperf-TCP_STREAM: Mostly slight mix except for Zen3 with 15-26% losses
> 
>         Zen 3 netperf-tcp
>                                     5.13.0-rc1             5.13.0-rc1
>                                        vanilla   sched-wakeidler-v3r1
>         Hmean     64        2278.21 (   0.00%)     1670.67 * -26.67%*
>         Hmean     128       4299.84 (   0.00%)     3199.04 * -25.60%*
>         Hmean     256       8176.62 (   0.00%)     5800.95 * -29.05%*
>         Hmean     1024     27050.12 (   0.00%)    20743.79 * -23.31%*
>         Hmean     2048     42558.18 (   0.00%)    33819.75 * -20.53%*
>         Hmean     3312     50576.24 (   0.00%)    39549.06 * -21.80%*
>         Hmean     4096     57782.00 (   0.00%)    49030.97 * -15.14%*
>         Hmean     8192     72400.49 (   0.00%)    53489.29 * -26.12%*
>         Hmean     16384    80997.97 (   0.00%)    63521.05 * -21.58%*
> 
> Netperf-UDP_STREAM: Mostly neutral but some machines show 0-13% losses
> 
> Tbench4: Mostly leaned towards being a loss
> 
> Single-socket Skylake
>                          5.13.0-rc1             5.13.0-rc1
>                             vanilla   sched-wakeidler-v3r1
> Hmean     1       591.86 (   0.00%)      585.56 *  -1.07%*
> Hmean     2      1050.84 (   0.00%)     1069.35 *   1.76%*
> Hmean     4      1546.96 (   0.00%)     1501.62 *  -2.93%*
> Hmean     8      2740.19 (   0.00%)     2675.56 *  -2.36%*
> Hmean     16     2450.22 (   0.00%)     2413.80 *  -1.49%*
> Hmean     32     2426.44 (   0.00%)     2361.36 *  -2.68%*
> 
> 	Small losses, higher system CPU usage (not shown)
> 
> 2-socket Cascadelake
>                           5.13.0-rc1             5.13.0-rc1
>                              vanilla   sched-wakeidler-v3r1
> Hmean     1        544.22 (   0.00%)      548.16 *   0.72%*
> Hmean     2       1058.67 (   0.00%)     1044.67 *  -1.32%*
> Hmean     4       2049.38 (   0.00%)     2017.99 *  -1.53%*
> Hmean     8       4071.51 (   0.00%)     3893.55 *  -4.37%*
> Hmean     16      6575.50 (   0.00%)     6576.01 (   0.01%)
> Hmean     32     10185.98 (   0.00%)    10303.26 *   1.15%*
> Hmean     64     12145.38 (   0.00%)    11616.73 *  -4.35%*
> Hmean     128    22335.44 (   0.00%)    21765.91 *  -2.55%*
> Hmean     256    20274.37 (   0.00%)    21505.92 *   6.07%*
> Hmean     320    20709.22 (   0.00%)    20733.21 *   0.12%*
> 
> 	Mix of gains and losses, higher system CPU usage
> 
> 2-socket Broadwell
>                           5.13.0-rc1             5.13.0-rc1
>                              vanilla   sched-wakeidler-v3r1
> Hmean     1        438.53 (   0.00%)      447.02 *   1.94%*
> Hmean     2        835.80 (   0.00%)      786.98 *  -5.84%*
> Hmean     4       1527.05 (   0.00%)     1436.70 *  -5.92%*
> Hmean     8       2952.17 (   0.00%)     2806.30 *  -4.94%*
> Hmean     16      5237.13 (   0.00%)     5191.26 *  -0.88%*
> Hmean     32      9222.13 (   0.00%)     9004.89 *  -2.36%*
> Hmean     64     10805.29 (   0.00%)    10342.93 *  -4.28%*
> Hmean     128    18469.14 (   0.00%)    17522.78 *  -5.12%*
> Hmean     256    16641.85 (   0.00%)    16278.08 *  -2.19%*
> Hmean     320    16623.42 (   0.00%)    16521.47 *  -0.61%*
> 
> 	Mostly small losses, slight increase CPU usage
> 
> 2-soocket Zen1
> 
> tbench4
>                           5.13.0-rc1             5.13.0-rc1
>                              vanilla   sched-wakeidler-v3r1
> Hmean     1        220.27 (   0.00%)      218.67 *  -0.73%*
> Hmean     2        455.18 (   0.00%)      430.82 *  -5.35%*
> Hmean     4        845.38 (   0.00%)      887.05 *   4.93%*
> Hmean     8       1645.02 (   0.00%)     1563.07 *  -4.98%*
> Hmean     16      3109.18 (   0.00%)     3074.53 *  -1.11%*
> Hmean     32      4854.40 (   0.00%)     5167.61 *   6.45%*
> Hmean     64     10793.06 (   0.00%)     7767.98 * -28.03%*
> Hmean     128    12398.50 (   0.00%)    15067.49 *  21.53%*
> Hmean     256    16756.69 (   0.00%)    11214.53 * -33.07%*
> Hmean     512    10186.47 (   0.00%)    15159.09 *  48.82%*
> 
> 	Two substantial losses, one substantial gain
> 
> 2-socket Zen2
> tbench4
>                           5.13.0-rc1             5.13.0-rc1
>                              vanilla   sched-wakeidler-v3r1
> Hmean     1        341.84 (   0.00%)      337.45 *  -1.28%*
> Hmean     2        675.90 (   0.00%)      659.10 *  -2.49%*
> Hmean     4       1312.66 (   0.00%)     1250.00 *  -4.77%*
> Hmean     8       2495.62 (   0.00%)     2386.57 *  -4.37%*
> Hmean     16      4237.23 (   0.00%)     4835.29 *  14.11%*
> Hmean     32      8505.60 (   0.00%)     8428.12 *  -0.91%*
> Hmean     64     22452.58 (   0.00%)    20637.45 *  -8.08%*
> Hmean     128    32493.62 (   0.00%)    27491.73 * -15.39%*
> Hmean     256    40975.73 (   0.00%)    29466.08 * -28.09%*
> Hmean     512    39320.56 (   0.00%)    34480.84 * -12.31%*
> 
> 	Some major losses, one major gain
> 
> 2-socket Zen3
> tbench4
>                            5.13.0-rc1             5.13.0-rc1
>                               vanilla   sched-wakeidler-v3r1
> Hmean     1         764.71 (   0.00%)      771.17 *   0.85%*
> Hmean     2        1536.93 (   0.00%)     1504.18 *  -2.13%*
> Hmean     4        2836.19 (   0.00%)     2805.02 *  -1.10%*
> Hmean     8        4726.61 (   0.00%)     4762.61 *   0.76%*
> Hmean     16       8341.73 (   0.00%)     8183.48 *  -1.90%*
> Hmean     32      14446.04 (   0.00%)    13628.25 *  -5.66%*
> Hmean     64      21852.72 (   0.00%)    24039.33 *  10.01%*
> Hmean     128     27674.40 (   0.00%)    29107.56 *   5.18%*
> Hmean     256     42985.16 (   0.00%)    36482.84 * -15.13%*
> Hmean     512     50210.59 (   0.00%)    40899.44 * -18.54%*
> Hmean     1024    63696.89 (   0.00%)    46715.28 * -26.66%*
> 
> 	Some major losses, one big gain

I have looked at the data and I am trying to analyze what all things can be
done differently.

But thanks again for the feedback.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-13  7:40 ` [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
@ 2021-05-21 12:36   ` Vincent Guittot
  2021-05-21 13:31     ` Srikar Dronamraju
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2021-05-21 12:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

On Thu, 13 May 2021 at 09:40, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Scheduler maintains a per LLC info which tells if there is any idle core
> in that LLC. However this information doesn't provide which core is idle.
>
> So when iterating for idle-cores, if select_idle_core() finds an
> idle-core, then it doesn't try to reset this information.
>
> So if there was only one idle core in the LLC and select_idle_core()
> selected the idle-core, the LLC will maintain that it still has a
> idle-core.
>
> On the converse, if a task is pinned, and has a restricted
> cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> cannot find a idle-core, LLC will no more maintain that it has an
> idle-core.
>
> As a first step to solve this problem, LLC will maintain the identity of
> the idle core instead of just the information that LLC has an idle core
>
> Along with maintaining, this change will solve both the problems listed
> above. However there are other problems that exist with the current
> infrastructure and those will continue to exist with this change and
> would be handled in subsequent patches.
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v2->v3:
>  -  Rebase to tip/sched/core
>                 (Valentin)
>
>  include/linux/sched/topology.h |  2 +-
>  kernel/sched/fair.c            | 52 ++++++++++++++++++----------------
>  kernel/sched/sched.h           |  3 ++
>  kernel/sched/topology.c        |  7 +++++
>  4 files changed, 39 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..285165a35f21 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -73,7 +73,7 @@ struct sched_group;
>  struct sched_domain_shared {
>         atomic_t        ref;
>         atomic_t        nr_busy_cpus;
> -       int             has_idle_cores;
> +       int             idle_core;
>  };
>
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7920f2a4d257..c42b2b3cd08f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1578,11 +1578,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
>
>  #ifdef CONFIG_SCHED_SMT
>  /* Forward declarations of select_idle_sibling helpers */
> -static inline bool test_idle_cores(int cpu, bool def);
> +static inline int get_idle_core(int cpu, int def);
>  static inline int numa_idle_core(int idle_core, int cpu)
>  {
>         if (!static_branch_likely(&sched_smt_present) ||
> -           idle_core >= 0 || !test_idle_cores(cpu, false))
> +           idle_core >= 0 || get_idle_core(cpu, -1) < 0)
>                 return idle_core;
>
>         /*
> @@ -6039,29 +6039,31 @@ static inline int __select_idle_cpu(int cpu)
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
>
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void set_idle_core(int cpu, int val)
>  {
>         struct sched_domain_shared *sds;
>
>         sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>         if (sds)
> -               WRITE_ONCE(sds->has_idle_cores, val);
> +               WRITE_ONCE(sds->idle_core, val);
>  }
>
> -static inline bool test_idle_cores(int cpu, bool def)
> +static inline int get_idle_core(int cpu, int def)
>  {
>         struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> -       if (sds)
> -               return READ_ONCE(sds->has_idle_cores);
> +       if (static_branch_likely(&sched_smt_present)) {

Would be good to explain why it is needed to add back the statis branch

> +               sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +               if (sds)
> +                       return READ_ONCE(sds->idle_core);
> +       }
>
>         return def;
>  }
>
>  /*
>   * Scans the local SMT mask to see if the entire core is idle, and records this
> - * information in sd_llc_shared->has_idle_cores.
> + * information in sd_llc_shared->idle_core.
>   *
>   * Since SMT siblings share all cache levels, inspecting this limited remote
>   * state should be fairly cheap.
> @@ -6072,7 +6074,7 @@ void __update_idle_core(struct rq *rq)
>         int cpu;
>
>         rcu_read_lock();
> -       if (test_idle_cores(core, true))
> +       if (get_idle_core(core, 0) >= 0)
>                 goto unlock;
>
>         for_each_cpu(cpu, cpu_smt_mask(core)) {
> @@ -6083,7 +6085,7 @@ void __update_idle_core(struct rq *rq)
>                         goto unlock;
>         }
>
> -       set_idle_cores(core, 1);
> +       set_idle_core(core, per_cpu(smt_id, core));
>  unlock:
>         rcu_read_unlock();
>  }
> @@ -6091,7 +6093,7 @@ void __update_idle_core(struct rq *rq)
>  /*
>   * Scan the entire LLC domain for idle cores; this dynamically switches off if
>   * there are no idle cores left in the system; tracked through
> - * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> + * sd_llc->shared->idle_core and enabled through update_idle_core() above.
>   */
>  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
>  {
> @@ -6144,11 +6146,11 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>
>  #else /* CONFIG_SCHED_SMT */
>
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void set_idle_core(int cpu, int val)
>  {
>  }
>
> -static inline bool test_idle_cores(int cpu, bool def)
> +static inline bool get_idle_core(int cpu, int def)
>  {
>         return def;
>  }
> @@ -6170,10 +6172,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>   * average idle time for this rq (as found in rq->avg_idle).
>   */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int idle_core, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>         int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +       bool has_idle_core = (idle_core != -1);
>         int this = smp_processor_id();
>         struct sched_domain *this_sd;
>         u64 time;
> @@ -6206,8 +6209,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 if (has_idle_core) {
>                         i = select_idle_core(p, cpu, cpus, &idle_cpu);
> -                       if ((unsigned int)i < nr_cpumask_bits)
> +                       if ((unsigned int)i < nr_cpumask_bits) {
> +#ifdef CONFIG_SCHED_SMT
> +                               if ((per_cpu(smt_id, i)) == idle_core)
> +                                       set_idle_core(i, -1);
> +#endif

CPUA-core0 enters idle
All other CPUs of core0 are already idle
set idle_core = core0
CPUB-core1 enters idle
All other CPUs of core1 are already idle so core1 becomes idle

A task wakes up and select_idle_core returns CPUA-core0
then idle_core=-1

At next wake up, we skip select_idlecore whereas core1 is idle

Do I miss something ?




>                                 return i;
> +                       }
>
>                 } else {
>                         if (!--nr)
> @@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>                 }
>         }
>
> -       if (has_idle_core)
> -               set_idle_cores(this, false);
> -
>         if (sched_feat(SIS_PROP) && !has_idle_core) {
>                 time = cpu_clock(this) - time;
>                 update_avg(&this_sd->avg_scan_cost, time);
> @@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
>   */
>  static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  {
> -       bool has_idle_core = false;
> +       int i, recent_used_cpu, idle_core = -1;
>         struct sched_domain *sd;
>         unsigned long task_util;
> -       int i, recent_used_cpu;
>
>         /*
>          * On asymmetric system, update task utilization because we will check
> @@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>                 return target;
>
>         if (sched_smt_active()) {
> -               has_idle_core = test_idle_cores(target, false);
> +               idle_core = get_idle_core(target, -1);
>
> -               if (!has_idle_core && cpus_share_cache(prev, target)) {
> +               if (idle_core < 0 && cpus_share_cache(prev, target)) {
>                         i = select_idle_smt(p, sd, prev);
>                         if ((unsigned int)i < nr_cpumask_bits)
>                                 return i;
>                 }
>         }
>
> -       i = select_idle_cpu(p, sd, has_idle_core, target);
> +       i = select_idle_cpu(p, sd, idle_core, target);
>         if ((unsigned)i < nr_cpumask_bits)
>                 return i;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a189bec13729..22fbb50b036e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
> +#ifdef CONFIG_SCHED_SMT
> +DECLARE_PER_CPU(int, smt_id);
> +#endif
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 55a0a243e871..232fb261dfc2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +#ifdef CONFIG_SCHED_SMT
> +DEFINE_PER_CPU(int, smt_id);
> +#endif
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>         per_cpu(sd_llc_size, cpu) = size;
>         per_cpu(sd_llc_id, cpu) = id;
> +#ifdef CONFIG_SCHED_SMT
> +       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> +#endif
>         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
>         sd = lowest_flag_domain(cpu, SD_NUMA);
> @@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>                 atomic_inc(&sd->shared->ref);
>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> +               sd->shared->idle_core = -1;
>         }
>
>         sd->private = sdd;
> --
> 2.18.2
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-13  7:40 ` [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
@ 2021-05-21 12:37   ` Vincent Guittot
  2021-05-21 13:21     ` Srikar Dronamraju
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2021-05-21 12:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

On Thu, 13 May 2021 at 09:41, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Currently we account nr_busy_cpus in no_hz idle functions.
> There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> configs only. Also scheduler can mark a CPU as non-busy as soon as an
> idle class task starts to run. Scheduler can then mark a CPU as busy
> as soon as its woken up from idle or a new task is placed on it's
> runqueue.
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Aubrey Li <aubrey.li@linux.intel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c     |  6 ++++--
>  kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c |  2 ++
>  4 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0dfe01de22d6..8f86359efdbd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10410,7 +10410,10 @@ static void set_cpu_sd_state_busy(int cpu)
>                 goto unlock;
>         sd->nohz_idle = 0;
>
> -       atomic_inc(&sd->shared->nr_busy_cpus);
> +       if (sd && per_cpu(is_idle, cpu)) {
> +               atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> +               per_cpu(is_idle, cpu) = 0;
> +       }
>  unlock:
>         rcu_read_unlock();
>  }
> @@ -10440,7 +10443,6 @@ static void set_cpu_sd_state_idle(int cpu)
>                 goto unlock;
>         sd->nohz_idle = 1;
>
> -       atomic_dec(&sd->shared->nr_busy_cpus);
>  unlock:
>         rcu_read_unlock();
>  }
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a9f5a8ace59e..c13105fe06b3 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -431,12 +431,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
>
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> -#ifdef CONFIG_SCHED_SMT
> +#ifdef CONFIG_SMP
> +       struct sched_domain_shared *sds;
>         int cpu = rq->cpu;
>
> +#ifdef CONFIG_SCHED_SMT
>         if (static_branch_likely(&sched_smt_present))
>                 set_core_busy(cpu);
>  #endif
> +
> +       rcu_read_lock();
> +       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +       if (sds) {
> +               if (per_cpu(is_idle, cpu)) {
> +                       atomic_inc(&sds->nr_busy_cpus);
> +                       per_cpu(is_idle, cpu) = 0;
> +               }
> +       }
> +       rcu_read_unlock();
> +#endif
>  }
>
>  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> @@ -448,9 +461,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
>  struct task_struct *pick_next_task_idle(struct rq *rq)
>  {
>         struct task_struct *next = rq->idle;
> +#ifdef CONFIG_SMP
> +       struct sched_domain_shared *sds;
> +       int cpu = rq->cpu;
>
> -       set_next_task_idle(rq, next, true);
> +       rcu_read_lock();
> +       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +       if (sds) {
> +               atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> +               per_cpu(is_idle, cpu) = 1;
> +       }

One reason to update nr_busy_cpus only during tick is and not at each
and every single sleep/wakeup to limit the number of atomic_inc/dec in
case of storm of short running tasks. Because at the end , you waste
more time trying to accurately follow the current state of the CPU
than doing work

>
> +       rcu_read_unlock();
> +#endif
> +
> +       set_next_task_idle(rq, next, true);
>         return next;
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 98c3cfbc5d26..b66c4dad5fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1496,6 +1496,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
>  #ifdef CONFIG_SCHED_SMT
>  DECLARE_PER_CPU(int, smt_id);
>  #endif
> +DECLARE_PER_CPU(int, is_idle);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 232fb261dfc2..730252937712 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_PER_CPU(int, smt_id);
>  #endif
> +DEFINE_PER_CPU(int, is_idle);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
>  #ifdef CONFIG_SCHED_SMT
>         per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
>  #endif
> +       per_cpu(is_idle, cpu) = 1;
>         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
>         sd = lowest_flag_domain(cpu, SD_NUMA);
> --
> 2.18.2
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-21 12:37   ` Vincent Guittot
@ 2021-05-21 13:21     ` Srikar Dronamraju
  0 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-21 13:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

* Vincent Guittot <vincent.guittot@linaro.org> [2021-05-21 14:37:51]:

> On Thu, 13 May 2021 at 09:41, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > Currently we account nr_busy_cpus in no_hz idle functions.
> > There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> > configs only. Also scheduler can mark a CPU as non-busy as soon as an
> > idle class task starts to run. Scheduler can then mark a CPU as busy
> > as soon as its woken up from idle or a new task is placed on it's
> > runqueue.
> >
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Parth Shah <parth@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  kernel/sched/fair.c     |  6 ++++--
> >  kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
> >  kernel/sched/sched.h    |  1 +
> >  kernel/sched/topology.c |  2 ++
> >  4 files changed, 34 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0dfe01de22d6..8f86359efdbd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10410,7 +10410,10 @@ static void set_cpu_sd_state_busy(int cpu)
> >                 goto unlock;
> >         sd->nohz_idle = 0;
> >
> > -       atomic_inc(&sd->shared->nr_busy_cpus);
> > +       if (sd && per_cpu(is_idle, cpu)) {
> > +               atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> > +               per_cpu(is_idle, cpu) = 0;
> > +       }
> >  unlock:
> >         rcu_read_unlock();
> >  }
> > @@ -10440,7 +10443,6 @@ static void set_cpu_sd_state_idle(int cpu)
> >                 goto unlock;
> >         sd->nohz_idle = 1;
> >
> > -       atomic_dec(&sd->shared->nr_busy_cpus);
> >  unlock:
> >         rcu_read_unlock();
> >  }
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index a9f5a8ace59e..c13105fe06b3 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -431,12 +431,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> >
> >  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> >  {
> > -#ifdef CONFIG_SCHED_SMT
> > +#ifdef CONFIG_SMP
> > +       struct sched_domain_shared *sds;
> >         int cpu = rq->cpu;
> >
> > +#ifdef CONFIG_SCHED_SMT
> >         if (static_branch_likely(&sched_smt_present))
> >                 set_core_busy(cpu);
> >  #endif
> > +
> > +       rcu_read_lock();
> > +       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +       if (sds) {
> > +               if (per_cpu(is_idle, cpu)) {
> > +                       atomic_inc(&sds->nr_busy_cpus);
> > +                       per_cpu(is_idle, cpu) = 0;
> > +               }
> > +       }
> > +       rcu_read_unlock();
> > +#endif
> >  }
> >
> >  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> > @@ -448,9 +461,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> >  struct task_struct *pick_next_task_idle(struct rq *rq)
> >  {
> >         struct task_struct *next = rq->idle;
> > +#ifdef CONFIG_SMP
> > +       struct sched_domain_shared *sds;
> > +       int cpu = rq->cpu;
> >
> > -       set_next_task_idle(rq, next, true);
> > +       rcu_read_lock();
> > +       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +       if (sds) {
> > +               atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> > +               per_cpu(is_idle, cpu) = 1;
> > +       }
> 
> One reason to update nr_busy_cpus only during tick is and not at each
> and every single sleep/wakeup to limit the number of atomic_inc/dec in
> case of storm of short running tasks. Because at the end , you waste
> more time trying to accurately follow the current state of the CPU
> than doing work
> 

Yes, I do understand that for short running tasks or if the CPUs are
entering idle for a very short interval; we are unnecessarily tracking the
number of busy_cpus.

However lets assume we have to compare 2 LLCs and have to choose a better
one for a wakeup.
1. We can look at nr_busy_cpus which may not have been updated at every
CPU idle.
2. We can look at nr_busy_cpus which has been updated at every CPU idle.
3. We start aggregating the load of all the CPUs in the LLC.
4. Use the current method, where it only compares the load on previous CPU
and current CPU. However that doesnt give too much indication if the other
CPUs in those LLCs were free.

or probably some other method.

I thought option 2 would be better but I am okay with option 1 too.
Please let me know what option you would prefer.

> >
> > +       rcu_read_unlock();
> > +#endif
> > +
> > +       set_next_task_idle(rq, next, true);
> >         return next;
> >  }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 98c3cfbc5d26..b66c4dad5fd2 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1496,6 +1496,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
> >  #ifdef CONFIG_SCHED_SMT
> >  DECLARE_PER_CPU(int, smt_id);
> >  #endif
> > +DECLARE_PER_CPU(int, is_idle);
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 232fb261dfc2..730252937712 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_PER_CPU(int, smt_id);
> >  #endif
> > +DEFINE_PER_CPU(int, is_idle);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> >  #ifdef CONFIG_SCHED_SMT
> >         per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> >  #endif
> > +       per_cpu(is_idle, cpu) = 1;
> >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> >         sd = lowest_flag_domain(cpu, SD_NUMA);
> > --
> > 2.18.2
> >

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-21 12:36   ` Vincent Guittot
@ 2021-05-21 13:31     ` Srikar Dronamraju
  2021-05-22 12:42       ` Vincent Guittot
  0 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-21 13:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

* Vincent Guittot <vincent.guittot@linaro.org> [2021-05-21 14:36:15]:

> On Thu, 13 May 2021 at 09:40, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > Scheduler maintains a per LLC info which tells if there is any idle core
> > in that LLC. However this information doesn't provide which core is idle.
> >
> > So when iterating for idle-cores, if select_idle_core() finds an
> > idle-core, then it doesn't try to reset this information.
> >
> > So if there was only one idle core in the LLC and select_idle_core()
> > selected the idle-core, the LLC will maintain that it still has a
> > idle-core.
> >
> > On the converse, if a task is pinned, and has a restricted
> > cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> > cannot find a idle-core, LLC will no more maintain that it has an
> > idle-core.
> >
> > As a first step to solve this problem, LLC will maintain the identity of
> > the idle core instead of just the information that LLC has an idle core
> >
> > Along with maintaining, this change will solve both the problems listed
> > above. However there are other problems that exist with the current
> > infrastructure and those will continue to exist with this change and
> > would be handled in subsequent patches.
> >
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Parth Shah <parth@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> > Changelog v2->v3:
> >  -  Rebase to tip/sched/core
> >                 (Valentin)
> >
> >  include/linux/sched/topology.h |  2 +-
> >  kernel/sched/fair.c            | 52 ++++++++++++++++++----------------
> >  kernel/sched/sched.h           |  3 ++
> >  kernel/sched/topology.c        |  7 +++++
> >  4 files changed, 39 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 8f0f778b7c91..285165a35f21 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -73,7 +73,7 @@ struct sched_group;
> >  struct sched_domain_shared {
> >         atomic_t        ref;
> >         atomic_t        nr_busy_cpus;
> > -       int             has_idle_cores;
> > +       int             idle_core;
> >  };
> >
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7920f2a4d257..c42b2b3cd08f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1578,11 +1578,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
> >
> >  #ifdef CONFIG_SCHED_SMT
> >  /* Forward declarations of select_idle_sibling helpers */
> > -static inline bool test_idle_cores(int cpu, bool def);
> > +static inline int get_idle_core(int cpu, int def);
> >  static inline int numa_idle_core(int idle_core, int cpu)
> >  {
> >         if (!static_branch_likely(&sched_smt_present) ||
> > -           idle_core >= 0 || !test_idle_cores(cpu, false))
> > +           idle_core >= 0 || get_idle_core(cpu, -1) < 0)
> >                 return idle_core;
> >
> >         /*
> > @@ -6039,29 +6039,31 @@ static inline int __select_idle_cpu(int cpu)
> >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >  EXPORT_SYMBOL_GPL(sched_smt_present);
> >
> > -static inline void set_idle_cores(int cpu, int val)
> > +static inline void set_idle_core(int cpu, int val)
> >  {
> >         struct sched_domain_shared *sds;
> >
> >         sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> >         if (sds)
> > -               WRITE_ONCE(sds->has_idle_cores, val);
> > +               WRITE_ONCE(sds->idle_core, val);
> >  }
> >
> > -static inline bool test_idle_cores(int cpu, bool def)
> > +static inline int get_idle_core(int cpu, int def)
> >  {
> >         struct sched_domain_shared *sds;
> >
> > -       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > -       if (sds)
> > -               return READ_ONCE(sds->has_idle_cores);
> > +       if (static_branch_likely(&sched_smt_present)) {
> 
> Would be good to explain why it is needed to add back the statis branch
> 

Agree, this is not needed. will fix in the next version.
Thanks for pointing out.

> > +               sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +               if (sds)
> > +                       return READ_ONCE(sds->idle_core);
> > +       }
> >
> >         return def;
> >  }
> >
> >  /*
> >   * Scans the local SMT mask to see if the entire core is idle, and records this
> > - * information in sd_llc_shared->has_idle_cores.
> > + * information in sd_llc_shared->idle_core.
> >   *
> >   * Since SMT siblings share all cache levels, inspecting this limited remote
> >   * state should be fairly cheap.
> > @@ -6072,7 +6074,7 @@ void __update_idle_core(struct rq *rq)
> >         int cpu;
> >
> >         rcu_read_lock();
> > -       if (test_idle_cores(core, true))
> > +       if (get_idle_core(core, 0) >= 0)
> >                 goto unlock;
> >
> >         for_each_cpu(cpu, cpu_smt_mask(core)) {
> > @@ -6083,7 +6085,7 @@ void __update_idle_core(struct rq *rq)
> >                         goto unlock;
> >         }
> >
> > -       set_idle_cores(core, 1);
> > +       set_idle_core(core, per_cpu(smt_id, core));
> >  unlock:
> >         rcu_read_unlock();
> >  }
> > @@ -6091,7 +6093,7 @@ void __update_idle_core(struct rq *rq)
> >  /*
> >   * Scan the entire LLC domain for idle cores; this dynamically switches off if
> >   * there are no idle cores left in the system; tracked through
> > - * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> > + * sd_llc->shared->idle_core and enabled through update_idle_core() above.
> >   */
> >  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> >  {
> > @@ -6144,11 +6146,11 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> >
> >  #else /* CONFIG_SCHED_SMT */
> >
> > -static inline void set_idle_cores(int cpu, int val)
> > +static inline void set_idle_core(int cpu, int val)
> >  {
> >  }
> >
> > -static inline bool test_idle_cores(int cpu, bool def)
> > +static inline bool get_idle_core(int cpu, int def)
> >  {
> >         return def;
> >  }
> > @@ -6170,10 +6172,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> >   * average idle time for this rq (as found in rq->avg_idle).
> >   */
> > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int idle_core, int target)
> >  {
> >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> >         int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > +       bool has_idle_core = (idle_core != -1);
> >         int this = smp_processor_id();
> >         struct sched_domain *this_sd;
> >         u64 time;
> > @@ -6206,8 +6209,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         for_each_cpu_wrap(cpu, cpus, target) {
> >                 if (has_idle_core) {
> >                         i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > -                       if ((unsigned int)i < nr_cpumask_bits)
> > +                       if ((unsigned int)i < nr_cpumask_bits) {
> > +#ifdef CONFIG_SCHED_SMT
> > +                               if ((per_cpu(smt_id, i)) == idle_core)
> > +                                       set_idle_core(i, -1);
> > +#endif
> 
> CPUA-core0 enters idle
> All other CPUs of core0 are already idle
> set idle_core = core0
> CPUB-core1 enters idle
> All other CPUs of core1 are already idle so core1 becomes idle
> 
> A task wakes up and select_idle_core returns CPUA-core0
> then idle_core=-1
> 
> At next wake up, we skip select_idlecore whereas core1 is idle
> 
> Do I miss something ?
> 

You are right, but this is similar to what we do currently do too. Even
without this patch, we got ahead an unconditionally (We dont even have an
option to see if the selected CPU was from an idle-core.) set the idle-core
to -1. (Please see the hunk I removed below)

I try to improve upon this in the next iteration. But that again we are
seeing some higher utilization probably with that change.

I plan to move to a cpumask based approach in v4.
By which we dont have to search for setting an idle-core but we still know
if any idle-cores are around. However that will have the extra penalty of
atomic operations that you commented to in one of my patches.

But if you have other ideas, I would be willing to try out.

> 
> 
> >                                 return i;
> > +                       }
> >
> >                 } else {
> >                         if (!--nr)
> > @@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >                 }
> >         }
> >
> > -       if (has_idle_core)
> > -               set_idle_cores(this, false);
> > -

I was referring to this hunk.

> >         if (sched_feat(SIS_PROP) && !has_idle_core) {
> >                 time = cpu_clock(this) - time;
> >                 update_avg(&this_sd->avg_scan_cost, time);
> > @@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
> >   */
> >  static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >  {
> > -       bool has_idle_core = false;
> > +       int i, recent_used_cpu, idle_core = -1;
> >         struct sched_domain *sd;
> >         unsigned long task_util;
> > -       int i, recent_used_cpu;
> >
> >         /*
> >          * On asymmetric system, update task utilization because we will check
> > @@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >                 return target;
> >
> >         if (sched_smt_active()) {
> > -               has_idle_core = test_idle_cores(target, false);
> > +               idle_core = get_idle_core(target, -1);
> >
> > -               if (!has_idle_core && cpus_share_cache(prev, target)) {
> > +               if (idle_core < 0 && cpus_share_cache(prev, target)) {
> >                         i = select_idle_smt(p, sd, prev);
> >                         if ((unsigned int)i < nr_cpumask_bits)
> >                                 return i;
> >                 }
> >         }
> >
> > -       i = select_idle_cpu(p, sd, has_idle_core, target);
> > +       i = select_idle_cpu(p, sd, idle_core, target);
> >         if ((unsigned)i < nr_cpumask_bits)
> >                 return i;
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index a189bec13729..22fbb50b036e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DECLARE_PER_CPU(int, sd_llc_size);
> >  DECLARE_PER_CPU(int, sd_llc_id);
> > +#ifdef CONFIG_SCHED_SMT
> > +DECLARE_PER_CPU(int, smt_id);
> > +#endif
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 55a0a243e871..232fb261dfc2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> > +#ifdef CONFIG_SCHED_SMT
> > +DEFINE_PER_CPU(int, smt_id);
> > +#endif
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
> >         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> >         per_cpu(sd_llc_size, cpu) = size;
> >         per_cpu(sd_llc_id, cpu) = id;
> > +#ifdef CONFIG_SCHED_SMT
> > +       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > +#endif
> >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> >         sd = lowest_flag_domain(cpu, SD_NUMA);
> > @@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >                 atomic_inc(&sd->shared->ref);
> >                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> > +               sd->shared->idle_core = -1;
> >         }
> >
> >         sd->private = sdd;
> > --
> > 2.18.2
> >

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-21 13:31     ` Srikar Dronamraju
@ 2021-05-22 12:42       ` Vincent Guittot
  2021-05-22 14:10         ` Srikar Dronamraju
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2021-05-22 12:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

On Fri, 21 May 2021 at 15:31, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Vincent Guittot <vincent.guittot@linaro.org> [2021-05-21 14:36:15]:
>
> > On Thu, 13 May 2021 at 09:40, Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > >
> > > Scheduler maintains a per LLC info which tells if there is any idle core
> > > in that LLC. However this information doesn't provide which core is idle.
> > >
> > > So when iterating for idle-cores, if select_idle_core() finds an
> > > idle-core, then it doesn't try to reset this information.
> > >
> > > So if there was only one idle core in the LLC and select_idle_core()
> > > selected the idle-core, the LLC will maintain that it still has a
> > > idle-core.
> > >
> > > On the converse, if a task is pinned, and has a restricted
> > > cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> > > cannot find a idle-core, LLC will no more maintain that it has an
> > > idle-core.
> > >
> > > As a first step to solve this problem, LLC will maintain the identity of
> > > the idle core instead of just the information that LLC has an idle core
> > >
> > > Along with maintaining, this change will solve both the problems listed
> > > above. However there are other problems that exist with the current
> > > infrastructure and those will continue to exist with this change and
> > > would be handled in subsequent patches.
> > >
> > > Cc: LKML <linux-kernel@vger.kernel.org>
> > > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > > Cc: Parth Shah <parth@linux.ibm.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Rik van Riel <riel@surriel.com>
> > > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > ---
> > > Changelog v2->v3:
> > >  -  Rebase to tip/sched/core
> > >                 (Valentin)
> > >
> > >  include/linux/sched/topology.h |  2 +-
> > >  kernel/sched/fair.c            | 52 ++++++++++++++++++----------------
> > >  kernel/sched/sched.h           |  3 ++
> > >  kernel/sched/topology.c        |  7 +++++
> > >  4 files changed, 39 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..285165a35f21 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -73,7 +73,7 @@ struct sched_group;
> > >  struct sched_domain_shared {
> > >         atomic_t        ref;
> > >         atomic_t        nr_busy_cpus;
> > > -       int             has_idle_cores;
> > > +       int             idle_core;
> > >  };
> > >
> > >  struct sched_domain {
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 7920f2a4d257..c42b2b3cd08f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1578,11 +1578,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
> > >
> > >  #ifdef CONFIG_SCHED_SMT
> > >  /* Forward declarations of select_idle_sibling helpers */
> > > -static inline bool test_idle_cores(int cpu, bool def);
> > > +static inline int get_idle_core(int cpu, int def);
> > >  static inline int numa_idle_core(int idle_core, int cpu)
> > >  {
> > >         if (!static_branch_likely(&sched_smt_present) ||
> > > -           idle_core >= 0 || !test_idle_cores(cpu, false))
> > > +           idle_core >= 0 || get_idle_core(cpu, -1) < 0)
> > >                 return idle_core;
> > >
> > >         /*
> > > @@ -6039,29 +6039,31 @@ static inline int __select_idle_cpu(int cpu)
> > >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> > >  EXPORT_SYMBOL_GPL(sched_smt_present);
> > >
> > > -static inline void set_idle_cores(int cpu, int val)
> > > +static inline void set_idle_core(int cpu, int val)
> > >  {
> > >         struct sched_domain_shared *sds;
> > >
> > >         sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > >         if (sds)
> > > -               WRITE_ONCE(sds->has_idle_cores, val);
> > > +               WRITE_ONCE(sds->idle_core, val);
> > >  }
> > >
> > > -static inline bool test_idle_cores(int cpu, bool def)
> > > +static inline int get_idle_core(int cpu, int def)
> > >  {
> > >         struct sched_domain_shared *sds;
> > >
> > > -       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > -       if (sds)
> > > -               return READ_ONCE(sds->has_idle_cores);
> > > +       if (static_branch_likely(&sched_smt_present)) {
> >
> > Would be good to explain why it is needed to add back the statis branch
> >
>
> Agree, this is not needed. will fix in the next version.
> Thanks for pointing out.
>
> > > +               sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > +               if (sds)
> > > +                       return READ_ONCE(sds->idle_core);
> > > +       }
> > >
> > >         return def;
> > >  }
> > >
> > >  /*
> > >   * Scans the local SMT mask to see if the entire core is idle, and records this
> > > - * information in sd_llc_shared->has_idle_cores.
> > > + * information in sd_llc_shared->idle_core.
> > >   *
> > >   * Since SMT siblings share all cache levels, inspecting this limited remote
> > >   * state should be fairly cheap.
> > > @@ -6072,7 +6074,7 @@ void __update_idle_core(struct rq *rq)
> > >         int cpu;
> > >
> > >         rcu_read_lock();
> > > -       if (test_idle_cores(core, true))
> > > +       if (get_idle_core(core, 0) >= 0)
> > >                 goto unlock;
> > >
> > >         for_each_cpu(cpu, cpu_smt_mask(core)) {
> > > @@ -6083,7 +6085,7 @@ void __update_idle_core(struct rq *rq)
> > >                         goto unlock;
> > >         }
> > >
> > > -       set_idle_cores(core, 1);
> > > +       set_idle_core(core, per_cpu(smt_id, core));
> > >  unlock:
> > >         rcu_read_unlock();
> > >  }
> > > @@ -6091,7 +6093,7 @@ void __update_idle_core(struct rq *rq)
> > >  /*
> > >   * Scan the entire LLC domain for idle cores; this dynamically switches off if
> > >   * there are no idle cores left in the system; tracked through
> > > - * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> > > + * sd_llc->shared->idle_core and enabled through update_idle_core() above.
> > >   */
> > >  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> > >  {
> > > @@ -6144,11 +6146,11 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> > >
> > >  #else /* CONFIG_SCHED_SMT */
> > >
> > > -static inline void set_idle_cores(int cpu, int val)
> > > +static inline void set_idle_core(int cpu, int val)
> > >  {
> > >  }
> > >
> > > -static inline bool test_idle_cores(int cpu, bool def)
> > > +static inline bool get_idle_core(int cpu, int def)
> > >  {
> > >         return def;
> > >  }
> > > @@ -6170,10 +6172,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> > >   * average idle time for this rq (as found in rq->avg_idle).
> > >   */
> > > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> > > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int idle_core, int target)
> > >  {
> > >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > >         int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > +       bool has_idle_core = (idle_core != -1);
> > >         int this = smp_processor_id();
> > >         struct sched_domain *this_sd;
> > >         u64 time;
> > > @@ -6206,8 +6209,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > >         for_each_cpu_wrap(cpu, cpus, target) {
> > >                 if (has_idle_core) {
> > >                         i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > -                       if ((unsigned int)i < nr_cpumask_bits)
> > > +                       if ((unsigned int)i < nr_cpumask_bits) {
> > > +#ifdef CONFIG_SCHED_SMT
> > > +                               if ((per_cpu(smt_id, i)) == idle_core)
> > > +                                       set_idle_core(i, -1);
> > > +#endif
> >
> > CPUA-core0 enters idle
> > All other CPUs of core0 are already idle
> > set idle_core = core0
> > CPUB-core1 enters idle
> > All other CPUs of core1 are already idle so core1 becomes idle
> >
> > A task wakes up and select_idle_core returns CPUA-core0
> > then idle_core=-1
> >
> > At next wake up, we skip select_idlecore whereas core1 is idle
> >
> > Do I miss something ?
> >
>
> You are right, but this is similar to what we do currently do too. Even
> without this patch, we got ahead an unconditionally (We dont even have an
> option to see if the selected CPU was from an idle-core.) set the idle-core
> to -1. (Please see the hunk I removed below)

The current race window is limited between select_idle_core() not
finding an idle core and the call of set_idle_cores(this, false);
later in end select_idle_cpu().

In your proposal, the race is not limited in time anymore. As soon as
the 1st core being idle and setting idle_core is then selected by
select_idle_core, then idle_core is broken

>
> I try to improve upon this in the next iteration. But that again we are
> seeing some higher utilization probably with that change.
>
> I plan to move to a cpumask based approach in v4.
> By which we dont have to search for setting an idle-core but we still know
> if any idle-cores are around. However that will have the extra penalty of
> atomic operations that you commented to in one of my patches.
>
> But if you have other ideas, I would be willing to try out.
>
> >
> >
> > >                                 return i;
> > > +                       }
> > >
> > >                 } else {
> > >                         if (!--nr)
> > > @@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > >                 }
> > >         }
> > >
> > > -       if (has_idle_core)
> > > -               set_idle_cores(this, false);
> > > -
>
> I was referring to this hunk.
>
> > >         if (sched_feat(SIS_PROP) && !has_idle_core) {
> > >                 time = cpu_clock(this) - time;
> > >                 update_avg(&this_sd->avg_scan_cost, time);
> > > @@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
> > >   */
> > >  static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > >  {
> > > -       bool has_idle_core = false;
> > > +       int i, recent_used_cpu, idle_core = -1;
> > >         struct sched_domain *sd;
> > >         unsigned long task_util;
> > > -       int i, recent_used_cpu;
> > >
> > >         /*
> > >          * On asymmetric system, update task utilization because we will check
> > > @@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > >                 return target;
> > >
> > >         if (sched_smt_active()) {
> > > -               has_idle_core = test_idle_cores(target, false);
> > > +               idle_core = get_idle_core(target, -1);
> > >
> > > -               if (!has_idle_core && cpus_share_cache(prev, target)) {
> > > +               if (idle_core < 0 && cpus_share_cache(prev, target)) {
> > >                         i = select_idle_smt(p, sd, prev);
> > >                         if ((unsigned int)i < nr_cpumask_bits)
> > >                                 return i;
> > >                 }
> > >         }
> > >
> > > -       i = select_idle_cpu(p, sd, has_idle_core, target);
> > > +       i = select_idle_cpu(p, sd, idle_core, target);
> > >         if ((unsigned)i < nr_cpumask_bits)
> > >                 return i;
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index a189bec13729..22fbb50b036e 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > >  DECLARE_PER_CPU(int, sd_llc_size);
> > >  DECLARE_PER_CPU(int, sd_llc_id);
> > > +#ifdef CONFIG_SCHED_SMT
> > > +DECLARE_PER_CPU(int, smt_id);
> > > +#endif
> > >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 55a0a243e871..232fb261dfc2 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > >  DEFINE_PER_CPU(int, sd_llc_size);
> > >  DEFINE_PER_CPU(int, sd_llc_id);
> > > +#ifdef CONFIG_SCHED_SMT
> > > +DEFINE_PER_CPU(int, smt_id);
> > > +#endif
> > >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > @@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
> > >         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > >         per_cpu(sd_llc_size, cpu) = size;
> > >         per_cpu(sd_llc_id, cpu) = id;
> > > +#ifdef CONFIG_SCHED_SMT
> > > +       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > > +#endif
> > >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > >
> > >         sd = lowest_flag_domain(cpu, SD_NUMA);
> > > @@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > >                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> > >                 atomic_inc(&sd->shared->ref);
> > >                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> > > +               sd->shared->idle_core = -1;
> > >         }
> > >
> > >         sd->private = sdd;
> > > --
> > > 2.18.2
> > >
>
> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-22 12:42       ` Vincent Guittot
@ 2021-05-22 14:10         ` Srikar Dronamraju
  2021-05-25  7:11           ` Vincent Guittot
  0 siblings, 1 reply; 18+ messages in thread
From: Srikar Dronamraju @ 2021-05-22 14:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

* Vincent Guittot <vincent.guittot@linaro.org> [2021-05-22 14:42:00]:

> On Fri, 21 May 2021 at 15:31, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > * Vincent Guittot <vincent.guittot@linaro.org> [2021-05-21 14:36:15]:
> >
> > > On Thu, 13 May 2021 at 09:40, Srikar Dronamraju
> > > <srikar@linux.vnet.ibm.com> wrote:
> > > >
> > > > Scheduler maintains a per LLC info which tells if there is any idle core
> > > > in that LLC. However this information doesn't provide which core is idle.
> > > >
> > > > So when iterating for idle-cores, if select_idle_core() finds an
> > > > idle-core, then it doesn't try to reset this information.
> > > >
> > > > So if there was only one idle core in the LLC and select_idle_core()
> > > > selected the idle-core, the LLC will maintain that it still has a
> > > > idle-core.
> > > >
> > > > On the converse, if a task is pinned, and has a restricted
> > > > cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> > > > cannot find a idle-core, LLC will no more maintain that it has an
> > > > idle-core.
> > > >
> > > > As a first step to solve this problem, LLC will maintain the identity of
> > > > the idle core instead of just the information that LLC has an idle core
> > > >
> > > > Along with maintaining, this change will solve both the problems listed
> > > > above. However there are other problems that exist with the current
> > > > infrastructure and those will continue to exist with this change and
> > > > would be handled in subsequent patches.
> > > >
> > > > Cc: LKML <linux-kernel@vger.kernel.org>
> > > > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > > > Cc: Parth Shah <parth@linux.ibm.com>
> > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > Cc: Rik van Riel <riel@surriel.com>
> > > > Cc: Aubrey Li <aubrey.li@linux.intel.com>
> > > > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > > ---
> > > > Changelog v2->v3:
> > > >  -  Rebase to tip/sched/core
> > > >                 (Valentin)
> > > >
> > > >  include/linux/sched/topology.h |  2 +-
> > > >  kernel/sched/fair.c            | 52 ++++++++++++++++++----------------
> > > >  kernel/sched/sched.h           |  3 ++
> > > >  kernel/sched/topology.c        |  7 +++++
> > > >  4 files changed, 39 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > > index 8f0f778b7c91..285165a35f21 100644
> > > > --- a/include/linux/sched/topology.h
> > > > +++ b/include/linux/sched/topology.h
> > > > @@ -73,7 +73,7 @@ struct sched_group;
> > > >  struct sched_domain_shared {
> > > >         atomic_t        ref;
> > > >         atomic_t        nr_busy_cpus;
> > > > -       int             has_idle_cores;
> > > > +       int             idle_core;
> > > >  };
> > > >
> > > >  struct sched_domain {
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 7920f2a4d257..c42b2b3cd08f 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1578,11 +1578,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
> > > >
> > > >  #ifdef CONFIG_SCHED_SMT
> > > >  /* Forward declarations of select_idle_sibling helpers */
> > > > -static inline bool test_idle_cores(int cpu, bool def);
> > > > +static inline int get_idle_core(int cpu, int def);
> > > >  static inline int numa_idle_core(int idle_core, int cpu)
> > > >  {
> > > >         if (!static_branch_likely(&sched_smt_present) ||
> > > > -           idle_core >= 0 || !test_idle_cores(cpu, false))
> > > > +           idle_core >= 0 || get_idle_core(cpu, -1) < 0)
> > > >                 return idle_core;
> > > >
> > > >         /*
> > > > @@ -6039,29 +6039,31 @@ static inline int __select_idle_cpu(int cpu)
> > > >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> > > >  EXPORT_SYMBOL_GPL(sched_smt_present);
> > > >
> > > > -static inline void set_idle_cores(int cpu, int val)
> > > > +static inline void set_idle_core(int cpu, int val)
> > > >  {
> > > >         struct sched_domain_shared *sds;
> > > >
> > > >         sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > >         if (sds)
> > > > -               WRITE_ONCE(sds->has_idle_cores, val);
> > > > +               WRITE_ONCE(sds->idle_core, val);
> > > >  }
> > > >
> > > > -static inline bool test_idle_cores(int cpu, bool def)
> > > > +static inline int get_idle_core(int cpu, int def)
> > > >  {
> > > >         struct sched_domain_shared *sds;
> > > >
> > > > -       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > > -       if (sds)
> > > > -               return READ_ONCE(sds->has_idle_cores);
> > > > +       if (static_branch_likely(&sched_smt_present)) {
> > >
> > > Would be good to explain why it is needed to add back the statis branch
> > >
> >
> > Agree, this is not needed. will fix in the next version.
> > Thanks for pointing out.
> >
> > > > +               sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > > +               if (sds)
> > > > +                       return READ_ONCE(sds->idle_core);
> > > > +       }
> > > >
> > > >         return def;
> > > >  }
> > > >
> > > >  /*
> > > >   * Scans the local SMT mask to see if the entire core is idle, and records this
> > > > - * information in sd_llc_shared->has_idle_cores.
> > > > + * information in sd_llc_shared->idle_core.
> > > >   *
> > > >   * Since SMT siblings share all cache levels, inspecting this limited remote
> > > >   * state should be fairly cheap.
> > > > @@ -6072,7 +6074,7 @@ void __update_idle_core(struct rq *rq)
> > > >         int cpu;
> > > >
> > > >         rcu_read_lock();
> > > > -       if (test_idle_cores(core, true))
> > > > +       if (get_idle_core(core, 0) >= 0)
> > > >                 goto unlock;
> > > >
> > > >         for_each_cpu(cpu, cpu_smt_mask(core)) {
> > > > @@ -6083,7 +6085,7 @@ void __update_idle_core(struct rq *rq)
> > > >                         goto unlock;
> > > >         }
> > > >
> > > > -       set_idle_cores(core, 1);
> > > > +       set_idle_core(core, per_cpu(smt_id, core));
> > > >  unlock:
> > > >         rcu_read_unlock();
> > > >  }
> > > > @@ -6091,7 +6093,7 @@ void __update_idle_core(struct rq *rq)
> > > >  /*
> > > >   * Scan the entire LLC domain for idle cores; this dynamically switches off if
> > > >   * there are no idle cores left in the system; tracked through
> > > > - * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> > > > + * sd_llc->shared->idle_core and enabled through update_idle_core() above.
> > > >   */
> > > >  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> > > >  {
> > > > @@ -6144,11 +6146,11 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> > > >
> > > >  #else /* CONFIG_SCHED_SMT */
> > > >
> > > > -static inline void set_idle_cores(int cpu, int val)
> > > > +static inline void set_idle_core(int cpu, int val)
> > > >  {
> > > >  }
> > > >
> > > > -static inline bool test_idle_cores(int cpu, bool def)
> > > > +static inline bool get_idle_core(int cpu, int def)
> > > >  {
> > > >         return def;
> > > >  }
> > > > @@ -6170,10 +6172,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > > >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> > > >   * average idle time for this rq (as found in rq->avg_idle).
> > > >   */
> > > > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> > > > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int idle_core, int target)
> > > >  {
> > > >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > > >         int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > > +       bool has_idle_core = (idle_core != -1);
> > > >         int this = smp_processor_id();
> > > >         struct sched_domain *this_sd;
> > > >         u64 time;
> > > > @@ -6206,8 +6209,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > >         for_each_cpu_wrap(cpu, cpus, target) {
> > > >                 if (has_idle_core) {
> > > >                         i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > > -                       if ((unsigned int)i < nr_cpumask_bits)
> > > > +                       if ((unsigned int)i < nr_cpumask_bits) {
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +                               if ((per_cpu(smt_id, i)) == idle_core)
> > > > +                                       set_idle_core(i, -1);
> > > > +#endif
> > >
> > > CPUA-core0 enters idle
> > > All other CPUs of core0 are already idle
> > > set idle_core = core0
> > > CPUB-core1 enters idle
> > > All other CPUs of core1 are already idle so core1 becomes idle
> > >
> > > A task wakes up and select_idle_core returns CPUA-core0
> > > then idle_core=-1
> > >
> > > At next wake up, we skip select_idlecore whereas core1 is idle
> > >
> > > Do I miss something ?
> > >
> >
> > You are right, but this is similar to what we do currently do too. Even
> > without this patch, we got ahead an unconditionally (We dont even have an
> > option to see if the selected CPU was from an idle-core.) set the idle-core
> > to -1. (Please see the hunk I removed below)
> 
> The current race window is limited between select_idle_core() not
> finding an idle core and the call of set_idle_cores(this, false);
> later in end select_idle_cpu().
> 

However lets say there was only one idle-core, and select_idle_core() finds
this idle-core, it just doesn't reset has-idle-core. So on the next wakeup,
we end up iterating through the whole LLC to find if we have an idle-core.

Also even if there were more than one idle-core in LLC and the task had a
limited cpu_allowed_list, and hence had to skip the idle-core, then we still
go ahead and reset the idle-core.

> In your proposal, the race is not limited in time anymore. As soon as
> the 1st core being idle and setting idle_core is then selected by
> select_idle_core, then idle_core is broken
> 

Yes, but with the next patch, as soon as a CPU within this LLC goes to idle,
it will search and set the right idle-core.

> >
> > I try to improve upon this in the next iteration. But that again we are
> > seeing some higher utilization probably with that change.
> >
> > I plan to move to a cpumask based approach in v4.
> > By which we dont have to search for setting an idle-core but we still know
> > if any idle-cores are around. However that will have the extra penalty of
> > atomic operations that you commented to in one of my patches.
> >
> > But if you have other ideas, I would be willing to try out.
> >
> > >
> > >
> > > >                                 return i;
> > > > +                       }
> > > >
> > > >                 } else {
> > > >                         if (!--nr)
> > > > @@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > >                 }
> > > >         }
> > > >
> > > > -       if (has_idle_core)
> > > > -               set_idle_cores(this, false);
> > > > -
> >
> > I was referring to this hunk.
> >
> > > >         if (sched_feat(SIS_PROP) && !has_idle_core) {
> > > >                 time = cpu_clock(this) - time;
> > > >                 update_avg(&this_sd->avg_scan_cost, time);
> > > > @@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
> > > >   */
> > > >  static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > > >  {
> > > > -       bool has_idle_core = false;
> > > > +       int i, recent_used_cpu, idle_core = -1;
> > > >         struct sched_domain *sd;
> > > >         unsigned long task_util;
> > > > -       int i, recent_used_cpu;
> > > >
> > > >         /*
> > > >          * On asymmetric system, update task utilization because we will check
> > > > @@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > > >                 return target;
> > > >
> > > >         if (sched_smt_active()) {
> > > > -               has_idle_core = test_idle_cores(target, false);
> > > > +               idle_core = get_idle_core(target, -1);
> > > >
> > > > -               if (!has_idle_core && cpus_share_cache(prev, target)) {
> > > > +               if (idle_core < 0 && cpus_share_cache(prev, target)) {
> > > >                         i = select_idle_smt(p, sd, prev);
> > > >                         if ((unsigned int)i < nr_cpumask_bits)
> > > >                                 return i;
> > > >                 }
> > > >         }
> > > >
> > > > -       i = select_idle_cpu(p, sd, has_idle_core, target);
> > > > +       i = select_idle_cpu(p, sd, idle_core, target);
> > > >         if ((unsigned)i < nr_cpumask_bits)
> > > >                 return i;
> > > >
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index a189bec13729..22fbb50b036e 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > >  DECLARE_PER_CPU(int, sd_llc_size);
> > > >  DECLARE_PER_CPU(int, sd_llc_id);
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +DECLARE_PER_CPU(int, smt_id);
> > > > +#endif
> > > >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index 55a0a243e871..232fb261dfc2 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > >  DEFINE_PER_CPU(int, sd_llc_size);
> > > >  DEFINE_PER_CPU(int, sd_llc_id);
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +DEFINE_PER_CPU(int, smt_id);
> > > > +#endif
> > > >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > > @@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
> > > >         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > >         per_cpu(sd_llc_size, cpu) = size;
> > > >         per_cpu(sd_llc_id, cpu) = id;
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > > > +#endif
> > > >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > >
> > > >         sd = lowest_flag_domain(cpu, SD_NUMA);
> > > > @@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > > >                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> > > >                 atomic_inc(&sd->shared->ref);
> > > >                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> > > > +               sd->shared->idle_core = -1;
> > > >         }
> > > >
> > > >         sd->private = sdd;
> > > > --
> > > > 2.18.2
> > > >
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-22 14:10         ` Srikar Dronamraju
@ 2021-05-25  7:11           ` Vincent Guittot
  0 siblings, 0 replies; 18+ messages in thread
From: Vincent Guittot @ 2021-05-25  7:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

On Sat, 22 May 2021 at 16:11, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Vincent Guittot <vincent.guittot@linaro.org> [2021-05-22 14:42:00]:
>
> > On Fri, 21 May 2021 at 15:31, Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > >
> > > * Vincent Guittot <vincent.guittot@linaro.org> [2021-05-21 14:36:15]:
> > >
> > > > On Thu, 13 May 2021 at 09:40, Srikar Dronamraju
> > > > <srikar@linux.vnet.ibm.com> wrote:

...

> > > > > +#endif
> > > >
> > > > CPUA-core0 enters idle
> > > > All other CPUs of core0 are already idle
> > > > set idle_core = core0
> > > > CPUB-core1 enters idle
> > > > All other CPUs of core1 are already idle so core1 becomes idle
> > > >
> > > > A task wakes up and select_idle_core returns CPUA-core0
> > > > then idle_core=-1
> > > >
> > > > At next wake up, we skip select_idlecore whereas core1 is idle
> > > >
> > > > Do I miss something ?
> > > >
> > >
> > > You are right, but this is similar to what we do currently do too. Even
> > > without this patch, we got ahead an unconditionally (We dont even have an
> > > option to see if the selected CPU was from an idle-core.) set the idle-core
> > > to -1. (Please see the hunk I removed below)
> >
> > The current race window is limited between select_idle_core() not
> > finding an idle core and the call of set_idle_cores(this, false);
> > later in end select_idle_cpu().
> >
>
> However lets say there was only one idle-core, and select_idle_core() finds
> this idle-core, it just doesn't reset has-idle-core. So on the next wakeup,
> we end up iterating through the whole LLC to find if we have an idle-core.

Yes, the current algorithm is clearing the idle core flag only when it
hasn't been able to find one in order to stay cheap in the fast wakeup
path.


>
> Also even if there were more than one idle-core in LLC and the task had a
> limited cpu_allowed_list, and hence had to skip the idle-core, then we still
> go ahead and reset the idle-core.
>
> > In your proposal, the race is not limited in time anymore. As soon as
> > the 1st core being idle and setting idle_core is then selected by
> > select_idle_core, then idle_core is broken
> >
>
> Yes, but with the next patch, as soon as a CPU within this LLC goes to idle,
> it will search and set the right idle-core.
>
> > >
> > > I try to improve upon this in the next iteration. But that again we are
> > > seeing some higher utilization probably with that change.
> > >
> > > I plan to move to a cpumask based approach in v4.
> > > By which we dont have to search for setting an idle-core but we still know
> > > if any idle-cores are around. However that will have the extra penalty of
> > > atomic operations that you commented to in one of my patches.
> > >
> > > But if you have other ideas, I would be willing to try out.
> > >
> > > >
> > > >
> > > > >                                 return i;
> > > > > +                       }
> > > > >
> > > > >                 } else {
> > > > >                         if (!--nr)
> > > > > @@ -6218,9 +6226,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > > >                 }
> > > > >         }
> > > > >
> > > > > -       if (has_idle_core)
> > > > > -               set_idle_cores(this, false);
> > > > > -
> > >
> > > I was referring to this hunk.
> > >
> > > > >         if (sched_feat(SIS_PROP) && !has_idle_core) {
> > > > >                 time = cpu_clock(this) - time;
> > > > >                 update_avg(&this_sd->avg_scan_cost, time);
> > > > > @@ -6276,10 +6281,9 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
> > > > >   */
> > > > >  static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > > > >  {
> > > > > -       bool has_idle_core = false;
> > > > > +       int i, recent_used_cpu, idle_core = -1;
> > > > >         struct sched_domain *sd;
> > > > >         unsigned long task_util;
> > > > > -       int i, recent_used_cpu;
> > > > >
> > > > >         /*
> > > > >          * On asymmetric system, update task utilization because we will check
> > > > > @@ -6357,16 +6361,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > > > >                 return target;
> > > > >
> > > > >         if (sched_smt_active()) {
> > > > > -               has_idle_core = test_idle_cores(target, false);
> > > > > +               idle_core = get_idle_core(target, -1);
> > > > >
> > > > > -               if (!has_idle_core && cpus_share_cache(prev, target)) {
> > > > > +               if (idle_core < 0 && cpus_share_cache(prev, target)) {
> > > > >                         i = select_idle_smt(p, sd, prev);
> > > > >                         if ((unsigned int)i < nr_cpumask_bits)
> > > > >                                 return i;
> > > > >                 }
> > > > >         }
> > > > >
> > > > > -       i = select_idle_cpu(p, sd, has_idle_core, target);
> > > > > +       i = select_idle_cpu(p, sd, idle_core, target);
> > > > >         if ((unsigned)i < nr_cpumask_bits)
> > > > >                 return i;
> > > > >
> > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > > index a189bec13729..22fbb50b036e 100644
> > > > > --- a/kernel/sched/sched.h
> > > > > +++ b/kernel/sched/sched.h
> > > > > @@ -1491,6 +1491,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> > > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > >  DECLARE_PER_CPU(int, sd_llc_size);
> > > > >  DECLARE_PER_CPU(int, sd_llc_id);
> > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > +DECLARE_PER_CPU(int, smt_id);
> > > > > +#endif
> > > > >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > index 55a0a243e871..232fb261dfc2 100644
> > > > > --- a/kernel/sched/topology.c
> > > > > +++ b/kernel/sched/topology.c
> > > > > @@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > >  DEFINE_PER_CPU(int, sd_llc_size);
> > > > >  DEFINE_PER_CPU(int, sd_llc_id);
> > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > +DEFINE_PER_CPU(int, smt_id);
> > > > > +#endif
> > > > >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > > > > @@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
> > > > >         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > > >         per_cpu(sd_llc_size, cpu) = size;
> > > > >         per_cpu(sd_llc_id, cpu) = id;
> > > > > +#ifdef CONFIG_SCHED_SMT
> > > > > +       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > > > > +#endif
> > > > >         rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > > >
> > > > >         sd = lowest_flag_domain(cpu, SD_NUMA);
> > > > > @@ -1497,6 +1503,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > > > >                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> > > > >                 atomic_inc(&sd->shared->ref);
> > > > >                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> > > > > +               sd->shared->idle_core = -1;
> > > > >         }
> > > > >
> > > > >         sd->private = sdd;
> > > > > --
> > > > > 2.18.2
> > > > >
> > >
> > > --
> > > Thanks and Regards
> > > Srikar Dronamraju
>
> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-25  7:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-13  7:40 [PATCH v3 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
2021-05-21 12:36   ` Vincent Guittot
2021-05-21 13:31     ` Srikar Dronamraju
2021-05-22 12:42       ` Vincent Guittot
2021-05-22 14:10         ` Srikar Dronamraju
2021-05-25  7:11           ` Vincent Guittot
2021-05-13  7:40 ` [PATCH v3 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
2021-05-21 12:37   ` Vincent Guittot
2021-05-21 13:21     ` Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
2021-05-13  7:40 ` [PATCH v3 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
2021-05-19  9:36 ` [PATCH v3 0/8] sched/fair: wake_affine improvements Mel Gorman
2021-05-19 16:55   ` Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.