linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] sched/fair: wake_affine improvements
@ 2021-04-22 10:23 Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 01/10] sched/fair: Update affine statistics when needed Srikar Dronamraju
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
a patch that marks MC domain as LLC.

Here the focus is on seeing if we can improve the current core scheduler's
wakeup mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC) (first 8 patches) + explore the
possibility to provide a fallback LLC domain, that can be preferred if the
current LLC is busy (last 2 patches).

Except the last 2 patches, the rest patches should work independently of the
other proposed solution. i.e if the mc-llc patch is accepted, then the last
two patches may not be needed for Power10. However this may be helpful for
other archs/platforms.

In the fallback approach, we look for a one-to-one mapping for each LLC.
However this can be easily modified to look for all LLC's in the current
LLC's parent. Also fallback is only used for sync wakeups. This is because
that is where we expect the maximum benefit of moving the task closer to the
task. For non-sync wakeups, its expected that CPU from previous LLC may be
better off.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  8
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Model:               1.0 (pvr 0080 0100)
Model name:          POWER10 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8K
NUMA node2 CPU(s):   0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc
instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.368805 usecs/op, median =  usecs/op, max = 19.368805 usecs/op
instances = 2, min = 19.423674 usecs/op, median = 19.423674 usecs/op, max = 19.506203 usecs/op
instances = 4, min = 19.454523 usecs/op, median = 19.596947 usecs/op, max = 19.863620 usecs/op
instances = 8, min = 20.005272 usecs/op, median = 20.239924 usecs/op, max = 20.878947 usecs/op
instances = 16, min = 21.856779 usecs/op, median = 24.102147 usecs/op, max = 25.496110 usecs/op
instances = 32, min = 31.460159 usecs/op, median = 32.809621 usecs/op, max = 33.939650 usecs/op
instances = 64, min = 39.506553 usecs/op, median = 43.835221 usecs/op, max = 45.645505 usecs/op
instances = 80, min = 43.805716 usecs/op, median = 44.314757 usecs/op, max = 48.910236 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
		     v5.12-rc5   v5.12-rc5     v5.12-rc5
                                 + mc-llc      + patchset
64CPUs/1JVM/ 60Users  6373.7      7520.5        7232.3
64CPUs/1JVM/ 80Users  6742.1      7940.9        7732.8
64CPUs/1JVM/100Users  6482.2      7730.3        7540
64CPUs/2JVM/ 60Users  6335        8081.6        7914.3
64CPUs/2JVM/ 80Users  6360.8      8259.6        8138.6
64CPUs/2JVM/100Users  6215.6      8046.5        8039.2
64CPUs/4JVM/ 60Users  5385.4      7685.3        7706.1
64CPUs/4JVM/ 80Users  5380.8      7753.3        7721.5
64CPUs/4JVM/100Users  5275.2      7549.2        7608.3

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset

50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 50 (1310 samples)
75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 68 (632 samples)
90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 80 (354 samples)
95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 86 (126 samples)
*99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 103 (97 samples)
99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 1202 (13 samples)
99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 14992 (10 samples)
min=13, max=20684          |     min=15, max=113                 | min=15, max=18721

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 49 (3841 samples)
75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 67 (1965 samples)
90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 78 (1134 samples)
95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 83 (359 samples)
*99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 93 (296 samples)
99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 98 (34 samples)
99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 7544 (28 samples)
min=13, max=20684          |     min=12, max=113                 | min=13, max=18721

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 48 (6527 samples)
75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 64 (3143 samples)
90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 76 (1985 samples)
95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 81 (585 samples)
*99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 90 (453 samples)
99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 94 (66 samples)
99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1202 (49 samples)
min=12, max=21421          |     min=1, max=126                  | min=3, max=18721

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5


mongodb (threads=6) (throughput, higher is better)
					 Throughput         read        clean      update
					                    latency     latency    latency
v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us

v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67403.29 ops/sec   1121.80 us  797.81 us  1357.28 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63952.79 ops/sec   1792.86 us  779.59 us  2130.54 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62198.83 ops/sec   2469.60 us  780.00 us  2914.48 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60333.81 ops/sec   3192.41 us  822.09 us  3748.24 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.


Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>

Srikar Dronamraju (10):
  sched/fair: Update affine statistics when needed
  sched/fair: Maintain the identity of idle-core
  sched/fair: Update idle-core more often
  sched/fair: Prefer idle CPU to cache affinity
  sched/fair: Call wake_affine only if necessary
  sched/idle: Move busy_cpu accounting to idle callback
  sched/fair: Remove ifdefs in waker_affine_idler_llc
  sched/fair: Dont iterate if no idle CPUs
  sched/topology: Introduce fallback LLC
  powerpc/smp: Add fallback flag to powerpc MC domain

 arch/powerpc/kernel/smp.c      |   7 +-
 include/linux/sched/sd_flags.h |   7 +
 include/linux/sched/topology.h |   3 +-
 kernel/sched/fair.c            | 229 +++++++++++++++++++++++++++------
 kernel/sched/features.h        |   1 +
 kernel/sched/idle.c            |  33 ++++-
 kernel/sched/sched.h           |   6 +
 kernel/sched/topology.c        |  54 +++++++-
 8 files changed, 296 insertions(+), 44 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/10] sched/fair: Update affine statistics when needed
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 02/10] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

wake_affine_idle() can return prev_cpu. Even in such a scenario,
scheduler was going ahead and updating schedstats related to wake
affine. i.e even if the task is not moved across LLC domains,
schedstats would have accounted.

Hence add a check before updating schedstats.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..a258a84cfdfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
-	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	if (!cpus_share_cache(prev_cpu, target)) {
+		schedstat_inc(sd->ttwu_move_affine);
+		schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	}
 	return target;
 }
 
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/10] sched/fair: Maintain the identity of idle-core
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 01/10] sched/fair: Update affine statistics when needed Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 03/10] sched/fair: Update idle-core more often Srikar Dronamraju
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Scheduler maintains a per LLC info which tells if there is any idle core
in that LLC. However this information doesn't provide which core is idle.

So when iterating for idle-cores, if select_idle_core() finds an
idle-core, then it doesn't try to reset this information.

So if there was only one idle core in the LLC and select_idle_core()
selected the idle-core, the LLC will maintain that it still has a
idle-core.

On the converse, if a task is pinned, and has a restricted
cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
cannot find a idle-core, LLC will no more maintain that it has an
idle-core.

As a first step to solve this problem, LLC will maintain the identity of
the idle core instead of just the information that LLC has an idle core

Along with maintaining, this change will solve both the problems listed
above. However there are other problems that exist with the current
infrastructure and those will continue to exist with this change and
would be handled in subsequent patches.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 43 +++++++++++++++++++---------------
 kernel/sched/sched.h           |  3 +++
 kernel/sched/topology.c        |  7 ++++++
 4 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..285165a35f21 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -73,7 +73,7 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
+	int		idle_core;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a258a84cfdfd..03083eacdaf0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1563,11 +1563,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
 
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
+static inline int get_idle_core(int cpu, int def);
 static inline int numa_idle_core(int idle_core, int cpu)
 {
 	if (!static_branch_likely(&sched_smt_present) ||
-	    idle_core >= 0 || !test_idle_cores(cpu, false))
+	    idle_core >= 0 || get_idle_core(cpu, -1) == -1)
 		return idle_core;
 
 	/*
@@ -6015,29 +6015,31 @@ static inline int __select_idle_cpu(int cpu)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
+		WRITE_ONCE(sds->idle_core, val);
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline int get_idle_core(int cpu, int def)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
+	if (static_branch_likely(&sched_smt_present)) {
+		sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+		if (sds)
+			return READ_ONCE(sds->idle_core);
+	}
 
 	return def;
 }
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_llc_shared->idle_core.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -6048,7 +6050,7 @@ void __update_idle_core(struct rq *rq)
 	int cpu;
 
 	rcu_read_lock();
-	if (test_idle_cores(core, true))
+	if (get_idle_core(core, 0) != -1)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6059,7 +6061,7 @@ void __update_idle_core(struct rq *rq)
 			goto unlock;
 	}
 
-	set_idle_cores(core, 1);
+	set_idle_core(core, per_cpu(smt_id, core));
 unlock:
 	rcu_read_unlock();
 }
@@ -6067,7 +6069,7 @@ void __update_idle_core(struct rq *rq)
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * sd_llc->shared->idle_core and enabled through update_idle_core() above.
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
@@ -6102,11 +6104,11 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool get_idle_core(int cpu, int def)
 {
 	return def;
 }
@@ -6127,7 +6129,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	bool smt = test_idle_cores(target, false);
+	int idle_core = get_idle_core(target, -1);
+	bool smt = (idle_core != -1);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6160,8 +6163,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (smt) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned int)i < nr_cpumask_bits)
+			if ((unsigned int)i < nr_cpumask_bits) {
+#ifdef CONFIG_SCHED_SMT
+				if ((per_cpu(smt_id, i)) == idle_core)
+					set_idle_core(i, -1);
+#endif
 				return i;
+			}
 
 		} else {
 			if (!--nr)
@@ -6172,9 +6180,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		}
 	}
 
-	if (smt)
-		set_idle_cores(this, false);
-
 	if (sched_feat(SIS_PROP) && !smt) {
 		time = cpu_clock(this) - time;
 		update_avg(&this_sd->avg_scan_cost, time);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..46d40a281724 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1478,6 +1478,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DECLARE_PER_CPU(int, smt_id);
+#endif
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 09d35044bd88..8db40c8a6ad0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DEFINE_PER_CPU(int, smt_id);
+#endif
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+#ifdef CONFIG_SCHED_SMT
+	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
+#endif
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
@@ -1466,6 +1472,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		sd->shared->idle_core = -1;
 	}
 
 	sd->private = sdd;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/10] sched/fair: Update idle-core more often
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 01/10] sched/fair: Update affine statistics when needed Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 02/10] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Currently when the scheduler does a load balance and pulls a task or
when a CPU picks up a task during wakeup without having to call
select_idle_cpu(), it never checks if the target CPU is part of the
idle-core. This makes idle-core less accurate.

Given that the identity of idle-core for LLC is maintained, its easy to
update the idle-core as soon as the CPU picks up a task.

This change will update the idle-core whenever a CPU from the idle-core
picks up a task. However if there are multiple idle-cores in the LLC,
and waking CPU happens to be part of the designated idle-core, idle-core
is set to -1 (i.e there are no idle-cores).

To reduce this case, whenever a CPU updates idle-core, it will look for
other cores in the LLC for an idle-core, if the core to which it belongs
to is not idle.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c  | 44 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/idle.c  |  6 ++++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03083eacdaf0..09c33cca0349 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6037,6 +6037,39 @@ static inline int get_idle_core(int cpu, int def)
 	return def;
 }
 
+static void set_next_idle_core(struct sched_domain *sd, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int core, cpu;
+
+	cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
+	for_each_cpu_wrap(core, cpus, target) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			if (!available_idle_cpu(cpu)) {
+				idle = false;
+				break;
+			}
+		}
+
+		if (idle) {
+			set_idle_core(core, per_cpu(smt_id, core));
+			return;
+		}
+
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+	}
+}
+
+void set_core_busy(int core)
+{
+	rcu_read_lock();
+	if (get_idle_core(core, -1) == per_cpu(smt_id, core))
+		set_idle_core(core, -1);
+	rcu_read_unlock();
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->idle_core.
@@ -6046,11 +6079,13 @@ static inline int get_idle_core(int cpu, int def)
  */
 void __update_idle_core(struct rq *rq)
 {
+	struct sched_domain *sd;
 	int core = cpu_of(rq);
 	int cpu;
 
 	rcu_read_lock();
-	if (get_idle_core(core, 0) != -1)
+	sd = rcu_dereference(per_cpu(sd_llc, core));
+	if (!sd || get_idle_core(core, 0) != -1)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6058,10 +6093,15 @@ void __update_idle_core(struct rq *rq)
 			continue;
 
 		if (!available_idle_cpu(cpu))
-			goto unlock;
+			goto try_next;
 	}
 
 	set_idle_core(core, per_cpu(smt_id, core));
+	goto unlock;
+
+try_next:
+	set_next_idle_core(sd, core);
+
 unlock:
 	rcu_read_unlock();
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7199e6f23789..cc828f3efe71 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,6 +425,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SCHED_SMT
+	int cpu = rq->cpu;
+
+	if (static_branch_likely(&sched_smt_present))
+		set_core_busy(cpu);
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46d40a281724..5c0bd4b0e73a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1102,6 +1102,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
+extern void set_core_busy(int cpu);
 
 static inline void update_idle_core(struct rq *rq)
 {
@@ -1111,6 +1112,7 @@ static inline void update_idle_core(struct rq *rq)
 
 #else
 static inline void update_idle_core(struct rq *rq) { }
+static inline void set_core_busy(int cpu) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 03/10] sched/fair: Update idle-core more often Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 05/10] sched/fair: Call wake_affine only if necessary Srikar Dronamraju
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Michael Neuling,
	Gautham R Shenoy, Parth Shah

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a method where Scheduler
compares idle CPUs in waker and previous LLCs and picks the appropriate
one.

The new method looks at idle-core to figure out idle LLC. If there are
no idle LLCs, it compares the ratio of busy CPUs to the total number of
CPUs in the LLC. This method will only be useful to compare 2 LLCs. If
the previous CPU and the waking CPU are in the same LLC, this method
would not be useful. For now the new method is disabled by default.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Based on similar posting:
http://lore.kernel.org/lkml/20210226164029.122432-1-srikar@linux.vnet.ibm.com/t/#u
Some comments in the next patch
- Make WA_WAKER default (Suggested by Rik) : done in next patch
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- Rename WA_WAKER to WA_IDLER_LLC (Suggested by Vincent)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains
- Add idle-core support

 kernel/sched/fair.c     | 64 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 65 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09c33cca0349..943621367a96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,67 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
+{
+#ifdef CONFIG_NO_HZ_COMMON
+	int pnr_busy, pllc_size, tnr_busy, tllc_size;
+#endif
+	struct sched_domain_shared *tsds, *psds;
+	int diff;
+
+	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+	if (!tsds || !psds)
+		return nr_cpumask_bits;
+
+	if (sync) {
+		if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
+			return this_cpu;
+		if (tsds->idle_core != -1) {
+			if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
+				return tsds->idle_core;
+			return this_cpu;
+		}
+	}
+
+	if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+		return prev_cpu;
+	if (psds->idle_core != -1) {
+		if (cpumask_test_cpu(psds->idle_core, p->cpus_ptr))
+			return psds->idle_core;
+		return prev_cpu;
+	}
+
+	if (!sync) {
+		if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
+			return this_cpu;
+		if (tsds->idle_core != -1) {
+			if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
+				return tsds->idle_core;
+			return this_cpu;
+		}
+	}
+
+#ifdef CONFIG_NO_HZ_COMMON
+	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+	pnr_busy = atomic_read(&psds->nr_busy_cpus);
+
+	tllc_size = per_cpu(sd_llc_size, this_cpu);
+	pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+	if (pnr_busy == pllc_size && tnr_busy == tllc_size)
+		return nr_cpumask_bits;
+
+	diff = pnr_busy * tllc_size - tnr_busy * pllc_size;
+	if (diff > 0)
+		return this_cpu;
+	if (diff < 0)
+		return prev_cpu;
+#endif /* CONFIG_NO_HZ_COMMON */
+
+	return nr_cpumask_bits;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5877,6 +5938,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
+	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..c77349a47e01 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, false)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/10] sched/fair: Call wake_affine only if necessary
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Currently if a waker is the only running thread of the CPU, and waking a
wakee (both of them interacting over a pipe aka sync wakeups) with the
wakee previously running on a different LLC), then wakee is pulled and
queued on the CPU that is running the waker.

This is true even if the previous CPU was completely idle. The rationale
being waker would soon relinquish the CPU and wakee would benefit from
the cache access. However if the previous CPU is idle, then wakee thread
will incur latency cost of migration to the waker CPU + latency cost of
the waker context switching.

Before the waker switches out, load balancer could also kick in and pull
the wakee out resulting in another migration cost. This will mostly hurt
systems where LLC have just one core. For example:- Hackbench. Both the
threads of hackbench would then end up running on the same core
resulting in CPUs running in higher SMT modes, more load balances and
non optimal performance.

It would be beneficial to run the wakee thread on the same CPU as waker
only if other CPUs are busy. To achieve this move this part of the code
that check if waker is the only running thread to early part of
wake_affine_weight().

Also post this change, wake_affine_idle() will only look at wakeups
within the LLC.  For wakeups within LLC, there should be no difference
between sync and no-sync. For wakeups across LLC boundaries, lets use
wake_affine_idler_llc.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c     | 22 +++++++++++-----------
 kernel/sched/features.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 943621367a96..f8c98e544418 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5799,8 +5799,7 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-static int
-wake_affine_idle(int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idle(int this_cpu, int prev_cpu)
 {
 	/*
 	 * If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5814,15 +5813,12 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * a cpufreq perspective, it's better to have higher utilisation
 	 * on one CPU.
 	 */
-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+	if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+		return prev_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
 		return this_cpu;
 
-	if (available_idle_cpu(prev_cpu))
-		return prev_cpu;
-
 	return nr_cpumask_bits;
 }
 
@@ -5838,6 +5834,9 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
+		if (cpu_rq(this_cpu)->nr_running <= 1)
+			return this_cpu;
+
 		if (current_load > this_eff_load)
 			return this_cpu;
 
@@ -5933,12 +5932,13 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
+	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
 
-	if (sched_feat(WA_IDLE))
-		target = wake_affine_idle(this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_IDLE) && share_caches)
+		target = wake_affine_idle(this_cpu, prev_cpu);
 
-	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
 		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c77349a47e01..c60a6d2b2126 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
-SCHED_FEAT(WA_IDLER_LLC, false)
+SCHED_FEAT(WA_IDLER_LLC, true)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 05/10] sched/fair: Call wake_affine only if necessary Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Currently we account nr_busy_cpus in no_hz idle functions.
There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
configs only. Also scheduler can mark a CPU as non-busy as soon as an
idle class task starts to run. Scheduler can then mark a CPU as busy
as soon as its woken up from idle or a new task is placed on it's
runqueue.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c     |  6 ++++--
 kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  2 ++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8c98e544418..00bcf1d861b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10384,7 +10384,10 @@ static void set_cpu_sd_state_busy(int cpu)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->shared->nr_busy_cpus);
+	if (sd && per_cpu(is_idle, cpu)) {
+		atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
+		per_cpu(is_idle, cpu) = 0;
+	}
 unlock:
 	rcu_read_unlock();
 }
@@ -10414,7 +10417,6 @@ static void set_cpu_sd_state_idle(int cpu)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index cc828f3efe71..e624a05e48bd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
 	int cpu = rq->cpu;
 
+#ifdef CONFIG_SCHED_SMT
 	if (static_branch_likely(&sched_smt_present))
 		set_core_busy(cpu);
 #endif
+
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		if (per_cpu(is_idle, cpu)) {
+			atomic_inc(&sds->nr_busy_cpus);
+			per_cpu(is_idle, cpu) = 0;
+		}
+	}
+	rcu_read_unlock();
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
@@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
+	int cpu = rq->cpu;
 
-	set_next_task_idle(rq, next, true);
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
+		per_cpu(is_idle, cpu) = 1;
+	}
 
+	rcu_read_unlock();
+#endif
+
+	set_next_task_idle(rq, next, true);
 	return next;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c0bd4b0e73a..baf8d9a4cb26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DECLARE_PER_CPU(int, smt_id);
 #endif
+DECLARE_PER_CPU(int, is_idle);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8db40c8a6ad0..00e4669bb241 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DEFINE_PER_CPU(int, smt_id);
 #endif
+DEFINE_PER_CPU(int, is_idle);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
 #ifdef CONFIG_SCHED_SMT
 	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
 #endif
+	per_cpu(is_idle, cpu) = 1;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Now that idle callbacks are updating nr_busy_cpus, remove ifdefs in
wake_affine_idler_llc

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00bcf1d861b5..8f752f77b76f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5870,9 +5870,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 
 static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
 {
-#ifdef CONFIG_NO_HZ_COMMON
 	int pnr_busy, pllc_size, tnr_busy, tllc_size;
-#endif
 	struct sched_domain_shared *tsds, *psds;
 	int diff;
 
@@ -5909,7 +5907,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 		}
 	}
 
-#ifdef CONFIG_NO_HZ_COMMON
 	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
 	pnr_busy = atomic_read(&psds->nr_busy_cpus);
 
@@ -5924,7 +5921,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 		return this_cpu;
 	if (diff < 0)
 		return prev_cpu;
-#endif /* CONFIG_NO_HZ_COMMON */
 
 	return nr_cpumask_bits;
 }
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 10:23 ` [PATCH 09/10] sched/topology: Introduce fallback LLC Srikar Dronamraju
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Now that the nr_busy_cpus for a LLC are updated in idle callbacks,
scheduler can detect if all threads of a LLC are busy. In such cases, it
can avoid searching for idle CPUs in the LLC that can run the wakee
thread.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f752f77b76f..db5dc9875e4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,7 +715,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, bool idle);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -5868,7 +5868,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
-static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_cpu,
+				int sync, bool *idle)
 {
 	int pnr_busy, pllc_size, tnr_busy, tllc_size;
 	struct sched_domain_shared *tsds, *psds;
@@ -5913,8 +5914,10 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 	tllc_size = per_cpu(sd_llc_size, this_cpu);
 	pllc_size = per_cpu(sd_llc_size, prev_cpu);
 
-	if (pnr_busy == pllc_size && tnr_busy == tllc_size)
+	if (pnr_busy == pllc_size && tnr_busy == tllc_size) {
+		*idle = false;
 		return nr_cpumask_bits;
+	}
 
 	diff = pnr_busy * tllc_size - tnr_busy * pllc_size;
 	if (diff > 0)
@@ -5926,7 +5929,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-		       int this_cpu, int prev_cpu, int sync)
+		       int this_cpu, int prev_cpu, int sync, bool *idle)
 {
 	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
@@ -5935,7 +5938,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		target = wake_affine_idle(this_cpu, prev_cpu);
 
 	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
-		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync, idle);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -6333,7 +6336,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, bool idle)
 {
 	struct sched_domain *sd;
 	unsigned long task_util;
@@ -6410,6 +6413,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		}
 	}
 
+	if (!idle)
+		return target;
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
@@ -6818,6 +6824,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int want_affine = 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
+	bool idle = true;
 
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
@@ -6841,7 +6848,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			if (cpu != prev_cpu)
-				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
+				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, &idle);
 
 			sd = NULL; /* Prefer wake_affine over balance flags */
 			break;
@@ -6858,7 +6865,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, idle);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/10] sched/topology: Introduce fallback LLC
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-22 15:10   ` kernel test robot
  2021-04-22 10:23 ` [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain Srikar Dronamraju
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

On some systems, LLC sizes may be too small. Some of these systems may
also support multiple cache access latency levels i.e between the
previous LLC and waker LLC, there could be other LLCs that have a lesser
cache access latency to waker LLC. If the waker LLC is busy, then
scheduler could choose to scheduler a task on such LLC.

Here is one approach to identity a static fallback LLC for each LLC for
systems that support multiple cache access latency levels. In this
approach, the fallback LLCs are decided at boot/CPU bring up time. There
is a one-to-one mapping between the LLC and fallback LLC.  The fallback
LLC will only be used if wakeup is a sync wakeup and the current LLC is
more busy than the fallback LLC. Also scheduler will not choose fallback
LLC if the previous LLC has same cache access latency as fallback LLC.

It is expected that fallback LLC has to be part of parent domain of
LLC domain. Archs can choose to use fallback LLC by setting the
SD_FALLBACK_LLC flag.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched/sd_flags.h |  7 ++++++
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++++---
 kernel/sched/topology.c        | 45 ++++++++++++++++++++++++++++++++--
 4 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 34b21e971d77..3ca44dd421e4 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -129,6 +129,13 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Consider waking task on near-by idle LLC.
+ *
+ * NEEDS_GROUPS: Load balancing flag.
+ */
+SD_FLAG(SD_FALLBACK_LLC, SDF_NEEDS_GROUPS)
+
 /*
  * Prefer to place tasks in a sibling domain
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 285165a35f21..b0446191319a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -74,6 +74,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		idle_core;
+	int		fallback_llc_id;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db5dc9875e4c..8ea6d0183fc8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5873,7 +5873,8 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 {
 	int pnr_busy, pllc_size, tnr_busy, tllc_size;
 	struct sched_domain_shared *tsds, *psds;
-	int diff;
+	bool try_fallback = false;
+	int diff, fcpu = -1;
 
 	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
 	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
@@ -5890,6 +5891,43 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 		}
 	}
 
+	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+	tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+	if (sync) {
+		struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
+
+		/*
+		 * task is a target of *sync* wakeup. However there are no
+		 * idle cores in the waking CPU. Ignore fallback LLC if the
+		 * previous CPU is part of the LLC's parent domain.
+		 */
+		try_fallback = !cpumask_test_cpu(prev_cpu, sched_domain_span(sd->parent));
+		fcpu = tsds->fallback_llc_id;
+	}
+
+	if (try_fallback && fcpu != -1 && cpumask_test_cpu(fcpu, p->cpus_ptr)) {
+		struct sched_domain_shared *fsds;
+		int fnr_busy, fllc_size;
+
+		fsds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
+		if (fsds && fsds != psds) {
+			if (fsds->idle_core != -1) {
+				if (cpumask_test_cpu(fsds->idle_core, p->cpus_ptr))
+					return fsds->idle_core;
+				return fcpu;
+			}
+
+			fnr_busy = atomic_read(&fsds->nr_busy_cpus);
+			fllc_size = per_cpu(sd_llc_size, fcpu);
+			if (fnr_busy * tllc_size < tnr_busy * fllc_size) {
+				tnr_busy = fnr_busy;
+				tllc_size = fllc_size;
+				this_cpu = fcpu;
+			}
+		}
+	}
+
 	if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
 		return prev_cpu;
 	if (psds->idle_core != -1) {
@@ -5908,10 +5946,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int this_cpu, int prev_c
 		}
 	}
 
-	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
 	pnr_busy = atomic_read(&psds->nr_busy_cpus);
-
-	tllc_size = per_cpu(sd_llc_size, this_cpu);
 	pllc_size = per_cpu(sd_llc_size, prev_cpu);
 
 	if (pnr_busy == pllc_size && tnr_busy == tllc_size) {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 00e4669bb241..89aa8986c58b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -603,6 +603,8 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
 
 static void destroy_sched_domain(struct sched_domain *sd)
 {
+	struct sched_domain_shared *sds = sd->shared;
+
 	/*
 	 * A normal sched domain may have multiple group references, an
 	 * overlapping domain, having private groups, only one.  Iterate,
@@ -610,8 +612,18 @@ static void destroy_sched_domain(struct sched_domain *sd)
 	 */
 	free_sched_groups(sd->groups, 1);
 
-	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
-		kfree(sd->shared);
+	if (sds && atomic_dec_and_test(&sds->ref)) {
+		struct sched_domain_shared *next_sds;
+
+		if (sds->fallback_llc_id != -1) {
+			next_sds = rcu_dereference(per_cpu(sd_llc_shared, sds->fallback_llc_id));
+			if (next_sds && next_sds->fallback_llc_id != -1)
+				next_sds->fallback_llc_id = -1;
+
+			sds->fallback_llc_id = -1;
+		}
+		kfree(sds);
+	}
 	kfree(sd);
 }
 
@@ -663,9 +675,36 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
+		struct sched_domain *sd_parent = sd->parent;
+
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 		sds = sd->shared;
+
+		if (sds->fallback_llc_id == -1 && sd_parent &&
+				sd_parent->flags & SD_FALLBACK_LLC) {
+			const struct cpumask *parent_span = sched_domain_span(sd->parent);
+			struct cpumask *span = sched_domains_tmpmask;
+			int fcpu;
+
+			/*
+			 * If LLC's parent domain has SD_FALLBACK_LLC flag
+			 * set and this LLC's fallback_llc_id is not yet
+			 * set, then walk through the LLC parent's domain to
+			 * find a fallback_llc.
+			 */
+			cpumask_andnot(span, parent_span, sched_domain_span(sd));
+			for_each_cpu_wrap(fcpu, span, cpu) {
+				struct sched_domain_shared *next_sds;
+
+				next_sds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
+				if (next_sds && next_sds->fallback_llc_id == -1) {
+					sds->fallback_llc_id = fcpu;
+					next_sds->fallback_llc_id = cpu;
+					break;
+				}
+			}
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
@@ -1370,6 +1409,7 @@ int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
 	 SD_SHARE_PKG_RESOURCES |	\
+	 SD_FALLBACK_LLC	|	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
 
@@ -1475,6 +1515,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
 		sd->shared->idle_core = -1;
+		sd->shared->fallback_llc_id = -1;
 	}
 
 	sd->private = sdd;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 09/10] sched/topology: Introduce fallback LLC Srikar Dronamraju
@ 2021-04-22 10:23 ` Srikar Dronamraju
  2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
  2021-04-27 14:52 ` Vincent Guittot
  11 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-22 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Power 10 supports MC domain. Cores within the MC domain
aka hemisphere have lesser cache access latency compared to cache access
latency for cores across the neighbouring MC. Last level of cache for a
Power 10 is at Core (4 threads).
Now that scheduler supports fallback LLC domain, mark MC domain with
SD_FALLBACK_LLC flag. With this each SMT 4 core forms a pair (as
fallback LLC) with another SMT 4 core within the MC domain.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a1070d..bc6386055cbe 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -950,6 +950,11 @@ static int powerpc_shared_cache_flags(void)
 	return SD_SHARE_PKG_RESOURCES;
 }
 
+static int powerpc_mc_flags(void)
+{
+	return SD_FALLBACK_LLC;
+}
+
 /*
  * We can't just pass cpu_l2_cache_mask() directly because
  * returns a non-const pointer and the compiler barfs on that.
@@ -986,7 +991,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
 	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-	{ cpu_mc_mask, SD_INIT_NAME(MC) },
+	{ cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 09/10] sched/topology: Introduce fallback LLC
  2021-04-22 10:23 ` [PATCH 09/10] sched/topology: Introduce fallback LLC Srikar Dronamraju
@ 2021-04-22 15:10   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2021-04-22 15:10 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: kbuild-all, LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 20542 bytes --]

Hi Srikar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on linus/master v5.12-rc8]
[cannot apply to tip/sched/core tip/master next-20210422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Srikar-Dronamraju/sched-fair-wake_affine-improvements/20210422-182725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: x86_64-randconfig-s022-20210421 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/dcb752591087d90fd4093094e42baf167fdcae7b
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Srikar-Dronamraju/sched-fair-wake_affine-improvements/20210422-182725
        git checkout dcb752591087d90fd4093094e42baf167fdcae7b
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   kernel/sched/topology.c:106:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:106:56: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:106:56: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:125:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:125:60: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:125:60: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:148:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:148:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:148:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:431:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct perf_domain *[assigned] tmp @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:431:13: sparse:     expected struct perf_domain *[assigned] tmp
   kernel/sched/topology.c:431:13: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:440:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct perf_domain *[assigned] tmp @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:440:13: sparse:     expected struct perf_domain *[assigned] tmp
   kernel/sched/topology.c:440:13: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct perf_domain *[assigned] pd @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:461:19: sparse:     expected struct perf_domain *[assigned] pd
   kernel/sched/topology.c:461:19: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:635:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:635:49: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:635:49: sparse:     got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/topology.c:678:52: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:678:52: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/topology.c:678:52: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:686:81: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:686:81: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:686:81: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:742:50: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:742:50: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:742:50: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:749:55: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@     got struct sched_domain *[assigned] tmp @@
   kernel/sched/topology.c:749:55: sparse:     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
   kernel/sched/topology.c:749:55: sparse:     got struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:759:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:759:29: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:759:29: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:764:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:764:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:764:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:770:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:770:33: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:770:33: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:805:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *sd @@
   kernel/sched/topology.c:805:13: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:805:13: sparse:     got struct sched_domain [noderef] __rcu *sd
   kernel/sched/topology.c:967:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:967:70: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:967:70: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:996:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:996:59: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:996:59: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1166:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1166:40: sparse:     expected struct sched_domain *child
   kernel/sched/topology.c:1166:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1468:43: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain [noderef] __rcu *child @@     got struct sched_domain *child @@
   kernel/sched/topology.c:1468:43: sparse:     expected struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1468:43: sparse:     got struct sched_domain *child
   kernel/sched/topology.c:1955:31: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *parent @@     got struct sched_domain *sd @@
   kernel/sched/topology.c:1955:31: sparse:     expected struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:1955:31: sparse:     got struct sched_domain *sd
   kernel/sched/topology.c:2123:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2123:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2123:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2140:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2140:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2140:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:59:25: sparse: sparse: dereference of noderef expression
   kernel/sched/topology.c:64:25: sparse: sparse: dereference of noderef expression
   kernel/sched/topology.c: note: in included file:
   kernel/sched/sched.h:1459:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1459:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1459:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1472:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1472:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1472:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1459:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1459:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1459:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1472:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1472:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1472:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:1483:19: sparse: sparse: dereference of noderef expression
--
   kernel/sched/fair.c:861:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:861:34: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:861:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:10750:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10750:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10750:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:4907:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:4907:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:4907:22: sparse:    struct task_struct *
   kernel/sched/fair.c:5430:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:5430:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:5430:38: sparse:     got struct task_struct [noderef] __rcu *curr
>> kernel/sched/fair.c:5905:80: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:5905:80: sparse:     expected struct sched_domain *sd
   kernel/sched/fair.c:5905:80: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6755:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6755:20: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:6755:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6878:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6878:9: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/fair.c:6878:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:7076:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7076:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7076:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:7327:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7327:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7327:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:8298:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/fair.c:8298:40: sparse:     expected struct sched_domain *child
   kernel/sched/fair.c:8298:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/fair.c:8791:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:8791:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:8791:22: sparse:    struct task_struct *
   kernel/sched/fair.c:10067:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10067:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10067:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:9723:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:9723:44: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/fair.c:9723:44: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:10145:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10145:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10145:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:4575:31: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1885:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1885:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1885:9: sparse:    struct task_struct *
   kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1730:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1730:25: sparse:    struct task_struct *

vim +678 kernel/sched/topology.c

   668	
   669	static void update_top_cache_domain(int cpu)
   670	{
   671		struct sched_domain_shared *sds = NULL;
   672		struct sched_domain *sd;
   673		int id = cpu;
   674		int size = 1;
   675	
   676		sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
   677		if (sd) {
 > 678			struct sched_domain *sd_parent = sd->parent;
   679	
   680			id = cpumask_first(sched_domain_span(sd));
   681			size = cpumask_weight(sched_domain_span(sd));
   682			sds = sd->shared;
   683	
   684			if (sds->fallback_llc_id == -1 && sd_parent &&
   685					sd_parent->flags & SD_FALLBACK_LLC) {
   686				const struct cpumask *parent_span = sched_domain_span(sd->parent);
   687				struct cpumask *span = sched_domains_tmpmask;
   688				int fcpu;
   689	
   690				/*
   691				 * If LLC's parent domain has SD_FALLBACK_LLC flag
   692				 * set and this LLC's fallback_llc_id is not yet
   693				 * set, then walk through the LLC parent's domain to
   694				 * find a fallback_llc.
   695				 */
   696				cpumask_andnot(span, parent_span, sched_domain_span(sd));
   697				for_each_cpu_wrap(fcpu, span, cpu) {
   698					struct sched_domain_shared *next_sds;
   699	
   700					next_sds = rcu_dereference(per_cpu(sd_llc_shared, fcpu));
   701					if (next_sds && next_sds->fallback_llc_id == -1) {
   702						sds->fallback_llc_id = fcpu;
   703						next_sds->fallback_llc_id = cpu;
   704						break;
   705					}
   706				}
   707			}
   708		}
   709	
   710		rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
   711		per_cpu(sd_llc_size, cpu) = size;
   712		per_cpu(sd_llc_id, cpu) = id;
   713	#ifdef CONFIG_SCHED_SMT
   714		per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
   715	#endif
   716		per_cpu(is_idle, cpu) = 1;
   717		rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
   718	
   719		sd = lowest_flag_domain(cpu, SD_NUMA);
   720		rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
   721	
   722		sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
   723		rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
   724	
   725		sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
   726		rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
   727	}
   728	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36822 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2021-04-22 10:23 ` [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain Srikar Dronamraju
@ 2021-04-23  8:25 ` Mel Gorman
  2021-04-23 10:31   ` Srikar Dronamraju
  2021-04-26 10:39   ` Srikar Dronamraju
  2021-04-27 14:52 ` Vincent Guittot
  11 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2021-04-23  8:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> 

I didn't get the chance to review this properly although I am suspicious
of tracking idle_core and updating that more frequently. It becomes a very
hot cache line that bounces. I did experiement with tracking an idle core
but the data either went stale too quickly or the updates incurred more
overhead than a reduced search saved.

The series also oopses a *lot* and didn't get through a run of basic
workloads on x86 on any of three machines. An example oops is

[  137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
[  137.777836] #PF: supervisor read access in kernel mode
[  137.782965] #PF: error_code(0x0000) - not-present page
[  137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
[  137.796261] Oops: 0000 [#1] SMP PTI
[  137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G            E     5.12.0-rc8-llcfallback-v1r1 #1
[  137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[  137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
[  137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41
[  137.840133] RSP: 0018:ffff9e0aa26a7c60 EFLAGS: 00010086
[  137.845360] RAX: 000000000001a5c8 RBX: ffff8bfbc65cf280 RCX: ffff8c3abf440000
[  137.852483] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000000000000015
[  137.859605] RBP: ffff8bbc478b9fe0 R08: 0000000000000000 R09: 000000000001a5b8
[  137.866730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bbc47c88a00
[  137.873855] R13: 0000000000000015 R14: 0000000000000001 R15: ffff8bfde0199fc0
[  137.880978] FS:  00007fa33c173700(0000) GS:ffff8bfabf400000(0000) knlGS:0000000000000000
[  137.889055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.894794] CR2: 000000000001a5c8 CR3: 0000004096b22006 CR4: 00000000003706f0
[  137.901918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  137.909041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  137.916164] Call Trace:
[  137.918614]  select_task_rq_fair+0xb88/0x1410
[  137.922974]  try_to_wake_up+0x114/0x4b0
[  137.926813]  ? should_numa_migrate_memory+0x54/0x170
[  137.931775]  wake_up_q+0x64/0xa0
[  137.934999]  futex_wake+0x13e/0x160
[  137.938493]  do_futex+0xce/0xa80
[  137.941724]  ? asm_sysvec_call_function+0x12/0x20
[  137.946422]  ? __cgroup_account_cputime+0x24/0x30
[  137.951126]  ? update_curr+0xfa/0x1a0
[  137.954784]  ? pick_next_entity+0x70/0x110
[  137.958876]  ? pick_next_task_fair+0xd2/0x3b0
[  137.963233]  __x64_sys_futex+0x13b/0x1f0
[  137.967155]  do_syscall_64+0x33/0x80
[  137.970737]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  137.975786] RIP: 0033:0x7fc9de7e1cda
[  137.979365] Code: 00 00 b8 ca 00 00 00 0f 05 5a 5e c3 0f 1f 40 00 56 52 c7 07 00 00 00 00 81 f6 81 00 00 00 ba 01 00 00 00 b8 ca 00 00 00 0f 05 <5a> 5e c3 0f 1f 00 41 54 41 55 49 89 fc 49 89 f5 48 83 ec 18 48 89
[  137.998101] RSP: 002b:00007fa33c170af0 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
[  138.005657] RAX: ffffffffffffffda RBX: 00007fc9d80050a8 RCX: 00007fc9de7e1cda
[  138.012783] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 00007fc9d8005128
[  138.019907] RBP: 00007fc9d85b9298 R08: 0000000000000000 R09: 0000000000000001
[  138.027031] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fc9d8005128
[  138.034155] R13: 00007fc9d85b9268 R14: 00007fc9d8005150 R15: 00007fc9d85b8df8
[  138.041279] Modules linked in: binfmt_misc(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) dmi_sysfs(E) msr(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) iTCO_wdt(E) kvm_intel(E) iTCO_vendor_support(E) ipmi_ssif(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ixgbe(E) ghash_clmulni_intel(E) xfrm_algo(E) aesni_intel(E) mdio_devres(E) crypto_simd(E) cryptd(E) joydev(E) igb(E) libphy(E) ioatdma(E) mei_me(E) i2c_i801(E) mdio(E) pcspkr(E) acpi_ipmi(E) i2c_smbus(E) lpc_ich(E) mei(E) dca(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) button(E) btrfs(E) blake2b_generic(E) libcrc32c(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) xhci_pci(E) drm_ttm_helper(E) ehci_pci(E) ttm(E) xhci_hcd(E) ehci_hcd(E) crc32c_intel(E) drm(E) usbcore(E) mpt3sas(E)
[  138.041333]  raid_class(E) scsi_transport_sas(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E)
[  138.140016] CR2: 000000000001a5c8

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
@ 2021-04-23 10:31   ` Srikar Dronamraju
  2021-04-23 12:38     ` Mel Gorman
  2021-04-26 10:39   ` Srikar Dronamraju
  1 sibling, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-23 10:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Mel Gorman <mgorman@techsingularity.net> [2021-04-23 09:25:32]:

> On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> > 
> 
> I didn't get the chance to review this properly although I am suspicious
> of tracking idle_core and updating that more frequently. It becomes a very
> hot cache line that bounces. I did experiement with tracking an idle core
> but the data either went stale too quickly or the updates incurred more
> overhead than a reduced search saved.

Thanks a lot for trying this.

> The series also oopses a *lot* and didn't get through a run of basic
> workloads on x86 on any of three machines. An example oops is
> 

Can you pass me your failing config. I am somehow not been seeing this
either on x86 or on Powerpc on multiple systems.
Also if possible cat /proc/schedstat and cat
/proc/sys/kernel/sched_domain/cpu0/domain*/name

> [  137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
> [  137.777836] #PF: supervisor read access in kernel mode
> [  137.782965] #PF: error_code(0x0000) - not-present page
> [  137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
> [  137.796261] Oops: 0000 [#1] SMP PTI
> [  137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G            E     5.12.0-rc8-llcfallback-v1r1 #1
> [  137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [  137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
> [  137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41

IP says cpus_share_cache, and it takes 2 ints,
RAX is 000000000001a5c8 but the panic says
"unable to handle page fault for address: 000000000001a5c8"
so it must have failed for "per_cpu(sd_llc_id, xx_cpu)"

> [  137.840133] RSP: 0018:ffff9e0aa26a7c60 EFLAGS: 00010086
> [  137.845360] RAX: 000000000001a5c8 RBX: ffff8bfbc65cf280 RCX: ffff8c3abf440000
> [  137.852483] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000000000000015
> [  137.859605] RBP: ffff8bbc478b9fe0 R08: 0000000000000000 R09: 000000000001a5b8
> [  137.866730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bbc47c88a00
> [  137.873855] R13: 0000000000000015 R14: 0000000000000001 R15: ffff8bfde0199fc0
> [  137.880978] FS:  00007fa33c173700(0000) GS:ffff8bfabf400000(0000) knlGS:0000000000000000
> [  137.889055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  137.894794] CR2: 000000000001a5c8 CR3: 0000004096b22006 CR4: 00000000003706f0
> [  137.901918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  137.909041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-23 10:31   ` Srikar Dronamraju
@ 2021-04-23 12:38     ` Mel Gorman
  2021-04-26 10:30       ` Srikar Dronamraju
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2021-04-23 12:38 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > The series also oopses a *lot* and didn't get through a run of basic
> > workloads on x86 on any of three machines. An example oops is
> > 
> 
> Can you pass me your failing config. I am somehow not been seeing this
> either on x86 or on Powerpc on multiple systems.

The machines have since moved onto testing something else (Rik's patch
for newidle) but the attached config should be close enough.

> Also if possible cat /proc/schedstat and cat
> /proc/sys/kernel/sched_domain/cpu0/domain*/name
> 

For the vanilla kernel

SMT
MC
NUMA

> > [  137.770968] BUG: unable to handle page fault for address: 000000000001a5c8
> > [  137.777836] #PF: supervisor read access in kernel mode
> > [  137.782965] #PF: error_code(0x0000) - not-present page
> > [  137.788097] PGD 8000004098a42067 P4D 8000004098a42067 PUD 4092e36067 PMD 40883ac067 PTE 0
> > [  137.796261] Oops: 0000 [#1] SMP PTI
> > [  137.799747] CPU: 0 PID: 14913 Comm: GC Slave Tainted: G            E     5.12.0-rc8-llcfallback-v1r1 #1
> > [  137.809123] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> > [  137.816765] RIP: 0010:cpus_share_cache+0x22/0x30
> > [  137.821396] Code: fc ff 0f 0b eb 80 66 90 0f 1f 44 00 00 48 63 ff 48 63 f6 48 c7 c0 c8 a5 01 00 48 8b 0c fd 00 59 9d 9a 48 8b 14 f5 00 59 9d 9a <8b> 14 02 39 14 01 0f 94 c0 c3 0f 1f 40 00 0f 1f 44 00 00 41 57 41
> 
> IP says cpus_share_cache, and it takes 2 ints,
> RAX is 000000000001a5c8 but the panic says
> "unable to handle page fault for address: 000000000001a5c8"
> so it must have failed for "per_cpu(sd_llc_id, xx_cpu)"
> 

More than likely, I didn't look closely because the intent was to schedule
tests to get some data and do the review later when I had time. tbench
partially completed but oopsed for high thread counts. Another load failed
completely and I didn't test beyond that but tbench for high thread counts
should be reproducible.

-- 
Mel Gorman
SUSE Labs

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 59330 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-23 12:38     ` Mel Gorman
@ 2021-04-26 10:30       ` Srikar Dronamraju
  2021-04-26 11:35         ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-26 10:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Mel Gorman <mgorman@techsingularity.net> [2021-04-23 13:38:55]:

Hi Mel,

> On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > > The series also oopses a *lot* and didn't get through a run of basic
> > > workloads on x86 on any of three machines. An example oops is
> > > 
> > 
> > Can you pass me your failing config. I am somehow not been seeing this
> > either on x86 or on Powerpc on multiple systems.
> 
> The machines have since moved onto testing something else (Rik's patch
> for newidle) but the attached config should be close enough.
> 
> > Also if possible cat /proc/schedstat and cat
> > /proc/sys/kernel/sched_domain/cpu0/domain*/name
> > 
> 
> For the vanilla kernel
> 
> SMT
> MC
> NUMA

I was able to reproduce the problem and analyze why it would panic in
cpus_share_cache.

In my patch(es), we have code snippets like this.

	if (tsds->idle_core != -1) {
		if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
			return tsds->idle_core;
		return this_cpu;
	}

Here when we tested the idle_core and cpumask_test_cpu,
tsds->idle_core may not have been -1; However by the time it returns,
tsds->idle_core could be -1;

cpus_share_cpus() then tries to find sd_llc_id for -1 and crashes.

Its more easier to reproduce this on a machine with more cores in a
LLC than say a Power10/Power9.  Hence we are hitting this more often
on x86.

One way could be to save the idle_core to a local variable, but that
negates the whole purpose since we may end up choosing a busy CPU.  I
will find a way to fix this problem.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
  2021-04-23 10:31   ` Srikar Dronamraju
@ 2021-04-26 10:39   ` Srikar Dronamraju
  2021-04-26 11:41     ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-26 10:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Mel Gorman <mgorman@techsingularity.net> [2021-04-23 09:25:32]:

> On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> > 
> 
> I didn't get the chance to review this properly although I am suspicious
> of tracking idle_core and updating that more frequently. It becomes a very
> hot cache line that bounces. I did experiement with tracking an idle core
> but the data either went stale too quickly or the updates incurred more
> overhead than a reduced search saved.
> 

This change does increase the number of times we read the idle-core.  There
are also more places where we try to update the idle-core. However I feel
the number of times, we actually update the idle-core now will be much
lesser than previous, because we are mostly doing a conditional update. i.e
we are updating the idle-core only if the waking up CPU happens to be part
of our core.

Also if the system is mostly lightly loaded, we check for
available_idle_cpu, so we may not look for an idle-core. If the system is
running a CPU intensive task, then the idle-core will most likely to be -1.
Its only the cases where the system utilization keeps swinging between
lightly loaded to heavy load, that we would end up checking and setting
idle-core.

Do let me know your thoughts.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-26 10:30       ` Srikar Dronamraju
@ 2021-04-26 11:35         ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2021-04-26 11:35 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

On Mon, Apr 26, 2021 at 04:00:32PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2021-04-23 13:38:55]:
> 
> Hi Mel,
> 
> > On Fri, Apr 23, 2021 at 04:01:29PM +0530, Srikar Dronamraju wrote:
> > > > The series also oopses a *lot* and didn't get through a run of basic
> > > > workloads on x86 on any of three machines. An example oops is
> > > > 
> > > 
> > > Can you pass me your failing config. I am somehow not been seeing this
> > > either on x86 or on Powerpc on multiple systems.
> > 
> > The machines have since moved onto testing something else (Rik's patch
> > for newidle) but the attached config should be close enough.
> > 
> > > Also if possible cat /proc/schedstat and cat
> > > /proc/sys/kernel/sched_domain/cpu0/domain*/name
> > > 
> > 
> > For the vanilla kernel
> > 
> > SMT
> > MC
> > NUMA
> 
> I was able to reproduce the problem and analyze why it would panic in
> cpus_share_cache.
> 
> In my patch(es), we have code snippets like this.
> 
> 	if (tsds->idle_core != -1) {
> 		if (cpumask_test_cpu(tsds->idle_core, p->cpus_ptr))
> 			return tsds->idle_core;
> 		return this_cpu;
> 	}
> 
> Here when we tested the idle_core and cpumask_test_cpu,
> tsds->idle_core may not have been -1; However by the time it returns,
> tsds->idle_core could be -1;
> 
> cpus_share_cpus() then tries to find sd_llc_id for -1 and crashes.
> 
> Its more easier to reproduce this on a machine with more cores in a
> LLC than say a Power10/Power9.  Hence we are hitting this more often
> on x86.
> 
> One way could be to save the idle_core to a local variable, but that
> negates the whole purpose since we may end up choosing a busy CPU.  I
> will find a way to fix this problem.
> 

As there is no locking that protects the variable, it's inherently
race-prone. A READ_ONCE to a local variable may be your only choice

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-26 10:39   ` Srikar Dronamraju
@ 2021-04-26 11:41     ` Mel Gorman
  2021-04-28 12:57       ` Srikar Dronamraju
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2021-04-26 11:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

On Mon, Apr 26, 2021 at 04:09:40PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2021-04-23 09:25:32]:
> 
> > On Thu, Apr 22, 2021 at 03:53:16PM +0530, Srikar Dronamraju wrote:
> > > Recently we found that some of the benchmark numbers on Power10 were lesser
> > > than expected. Some analysis showed that the problem lies in the fact that
> > > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> > > 
> > 
> > I didn't get the chance to review this properly although I am suspicious
> > of tracking idle_core and updating that more frequently. It becomes a very
> > hot cache line that bounces. I did experiement with tracking an idle core
> > but the data either went stale too quickly or the updates incurred more
> > overhead than a reduced search saved.
> > 
> 
> This change does increase the number of times we read the idle-core.  There
> are also more places where we try to update the idle-core. However I feel
> the number of times, we actually update the idle-core now will be much
> lesser than previous, because we are mostly doing a conditional update. i.e
> we are updating the idle-core only if the waking up CPU happens to be part
> of our core.
> 

Increased cache misses may be detectable from perf.

> Also if the system is mostly lightly loaded, we check for
> available_idle_cpu, so we may not look for an idle-core. If the system is
> running a CPU intensive task, then the idle-core will most likely to be -1.
> Its only the cases where the system utilization keeps swinging between
> lightly loaded to heavy load, that we would end up checking and setting
> idle-core.
> 

But this is a "how long is a piece of string" question because the benefit
of tracking an idle core depends on both the interarrival time of wakeups,
the domain utilisation and the length of time tasks are running. When
I was looking at the area, I tracked the SIS efficiency to see how much
each change was helping. The patch no longer applies but the stats are
understood by mmtests if you wanted to forward port it.  It's possible
you would do something similar but specific to idle_core -- e.g. track
how often it's updated, how often it's read, how often a CPU is returned
and how often it's still an idle core and use those stats to calculate
hit/miss ratios.

However, I would caution against conflating the "fallback search domain"
with the patches tracking idle core because they should be independent
of each other.

Old patch that no longer applies that was the basis for some SIS work
over a year ago is below

---8<---
From c791354b92a5723b0da14d050f942f61f0c12857 Mon Sep 17 00:00:00 2001
From: Mel Gorman <mgorman@techsingularity.net>
Date: Fri, 14 Feb 2020 19:11:16 +0000
Subject: [PATCH] sched/fair: Track efficiency of select_idle_sibling

select_idle_sibling is an important path that finds a nearby idle CPU on
wakeup. As it is examining other CPUs state, it can be expensive in terms
of cache usage. This patch tracks the search efficiency if schedstats
are enabled. In general, this is only useful for kernel developers but
schedstats are typically disabled by default so it is convenient for
development and mostly free otherwise.

The series can be done without this patch but the stats were used to
generate a number of useful metrics in mmtest to analyse what was
going on.

SIS Search: Number of calls to select_idle_sibling

SIS Domain Search: Number of times the domain was searched because the
	fast path failed.

SIS Scanned: Generally the number of runqueues scanned but the fast
	path counts as 1 regardless of the values for target, prev
	and recent.

SIS Domain Scanned: Number of runqueues scanned during a search of the
	LLC domain.

SIS Failures: Number of SIS calls that failed to find an idle CPU

SIS Search Efficiency: A ratio expressed as a percentage of runqueues
	scanned versus idle CPUs found. A 100% efficiency indicates that
	the target, prev or recent CPU of a task was idle at wakeup. The
	lower the efficiency, the more runqueues were scanned before an
	idle CPU was found.

SIS Domain Search Efficiency: Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate: Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate: Percentage of scans that found an idle CPU.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/debug.c |  4 ++++
 kernel/sched/fair.c  | 14 ++++++++++++++
 kernel/sched/sched.h |  6 ++++++
 kernel/sched/stats.c |  8 +++++---
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8331bc04aea2..7af6e8a12f40 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -661,6 +661,10 @@ do {									\
 		P(sched_goidle);
 		P(ttwu_count);
 		P(ttwu_local);
+		P(sis_search);
+		P(sis_domain_search);
+		P(sis_scanned);
+		P(sis_failed);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23a6fe298720..10cdd28e910e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,6 +5975,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 		bool idle = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			schedstat_inc(this_rq()->sis_scanned);
 			if (!available_idle_cpu(cpu)) {
 				idle = false;
 				break;
@@ -6005,6 +6006,7 @@ static int select_idle_smt(struct task_struct *p, int target)
 		return -1;
 
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
+		schedstat_inc(this_rq()->sis_scanned);
 		if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 			continue;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
@@ -6070,6 +6072,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		schedstat_inc(this_rq()->sis_scanned);
 		if (!--nr)
 			return -1;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
@@ -6126,6 +6129,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	schedstat_inc(this_rq()->sis_search);
+
+	/*
+	 * Checking if prev, target and recent is treated as one scan. A
+	 * perfect hit on one of those is considered 100% efficiency.
+	 * Further scanning impairs efficiency.
+	 */
+	schedstat_inc(this_rq()->sis_scanned);
+
 	/*
 	 * For asymmetric CPU capacity systems, our domain of interest is
 	 * sd_asym_cpucapacity rather than sd_llc.
@@ -6191,6 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
+	schedstat_inc(this_rq()->sis_domain_search);
 	i = select_idle_core(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
@@ -6203,6 +6216,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	schedstat_inc(this_rq()->sis_failed);
 	return target;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0caf394dd4..c1a5c27cfd72 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1006,6 +1006,12 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+
+	/* select_idle_sibling stats */
+	unsigned int		sis_search;
+	unsigned int		sis_domain_search;
+	unsigned int		sis_scanned;
+	unsigned int		sis_failed;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 750fb3c67eed..390bfcc3842c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -30,12 +30,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->sis_search, rq->sis_domain_search,
+		    rq->sis_scanned, rq->sis_failed);
 
 		seq_printf(seq, "\n");
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
@ 2021-04-27 14:52 ` Vincent Guittot
  2021-04-28 12:49   ` Srikar Dronamraju
  11 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2021-04-27 14:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

Hi Srikar,

On Thu, 22 Apr 2021 at 12:23, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
>
> One probable solution to the problem was worked by Gautham where he posted
> http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
> a patch that marks MC domain as LLC.
>
> Here the focus is on seeing if we can improve the current core scheduler's
> wakeup mechanism by looking at idle-cores and nr_busy_cpus that is already
> maintained per Last level cache(aka LLC) (first 8 patches) + explore the
> possibility to provide a fallback LLC domain, that can be preferred if the
> current LLC is busy (last 2 patches).
>
> Except the last 2 patches, the rest patches should work independently of the
> other proposed solution. i.e if the mc-llc patch is accepted, then the last
> two patches may not be needed for Power10. However this may be helpful for
> other archs/platforms.
>
> In the fallback approach, we look for a one-to-one mapping for each LLC.
> However this can be easily modified to look for all LLC's in the current
> LLC's parent. Also fallback is only used for sync wakeups. This is because
> that is where we expect the maximum benefit of moving the task closer to the
> task. For non-sync wakeups, its expected that CPU from previous LLC may be
> better off.
>
> Request you to please review and provide your feedback.
>
> Benchmarking numbers are from Power 10 but I have verified that we don't
> regress on Power 9 setup.
>
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              80
> On-line CPU(s) list: 0-79
> Thread(s) per core:  8
> Core(s) per socket:  10
> Socket(s):           1
> NUMA node(s):        1
> Model:               1.0 (pvr 0080 0100)
> Model name:          POWER10 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:           64K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            8K
> NUMA node2 CPU(s):   0-79
>
> Hackbench: (latency, lower is better)
>
> v5.12-rc5
> instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
> instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
> instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
> instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
> instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
> instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
> instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
> instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op
>
> v5.12-rc5 + mc-llc
> instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
> instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
> instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
> instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
> instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
> instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
> instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
> instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op
>
> v5.12-rc5 + this patchset
> instances = 1, min = 19.368805 usecs/op, median =  usecs/op, max = 19.368805 usecs/op
> instances = 2, min = 19.423674 usecs/op, median = 19.423674 usecs/op, max = 19.506203 usecs/op
> instances = 4, min = 19.454523 usecs/op, median = 19.596947 usecs/op, max = 19.863620 usecs/op
> instances = 8, min = 20.005272 usecs/op, median = 20.239924 usecs/op, max = 20.878947 usecs/op
> instances = 16, min = 21.856779 usecs/op, median = 24.102147 usecs/op, max = 25.496110 usecs/op
> instances = 32, min = 31.460159 usecs/op, median = 32.809621 usecs/op, max = 33.939650 usecs/op
> instances = 64, min = 39.506553 usecs/op, median = 43.835221 usecs/op, max = 45.645505 usecs/op
> instances = 80, min = 43.805716 usecs/op, median = 44.314757 usecs/op, max = 48.910236 usecs/op
>
> Summary:
> mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5
>
> DayTrader (throughput, higher is better)
>                      v5.12-rc5   v5.12-rc5     v5.12-rc5
>                                  + mc-llc      + patchset
> 64CPUs/1JVM/ 60Users  6373.7      7520.5        7232.3
> 64CPUs/1JVM/ 80Users  6742.1      7940.9        7732.8
> 64CPUs/1JVM/100Users  6482.2      7730.3        7540
> 64CPUs/2JVM/ 60Users  6335        8081.6        7914.3
> 64CPUs/2JVM/ 80Users  6360.8      8259.6        8138.6
> 64CPUs/2JVM/100Users  6215.6      8046.5        8039.2
> 64CPUs/4JVM/ 60Users  5385.4      7685.3        7706.1
> 64CPUs/4JVM/ 80Users  5380.8      7753.3        7721.5
> 64CPUs/4JVM/100Users  5275.2      7549.2        7608.3
>
> Summary: Across all profiles, this patchset or mc-llc out perform
> vanilla v5.12-rc5
> Not: Only 64 cores were online during this test.
>
> schbench (latency: lesser is better)
> ======== Running schbench -m 3 -r 30 =================
> Latency percentiles (usec) runtime 10 (s) (2545 total samples)
> v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset
>
> 50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 50 (1310 samples)
> 75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 68 (632 samples)
> 90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 80 (354 samples)
> 95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 86 (126 samples)
> *99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 103 (97 samples)
> 99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 1202 (13 samples)
> 99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 14992 (10 samples)
> min=13, max=20684          |     min=15, max=113                 | min=15, max=18721
>
> Latency percentiles (usec) runtime 20 (s) (7649 total samples)
>
> 50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 49 (3841 samples)
> 75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 67 (1965 samples)
> 90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 78 (1134 samples)
> 95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 83 (359 samples)
> *99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 93 (296 samples)
> 99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 98 (34 samples)
> 99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 7544 (28 samples)
> min=13, max=20684          |     min=12, max=113                 | min=13, max=18721
>
> Latency percentiles (usec) runtime 30 (s) (12785 total samples)
>
> 50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 48 (6527 samples)
> 75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 64 (3143 samples)
> 90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 76 (1985 samples)
> 95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 81 (585 samples)
> *99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 90 (453 samples)
> 99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 94 (66 samples)
> 99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1202 (49 samples)
> min=12, max=21421          |     min=1, max=126                  | min=3, max=18721
>
> Summary:
> mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5
>
>
> mongodb (threads=6) (throughput, higher is better)
>                                          Throughput         read        clean      update
>                                                             latency     latency    latency
> v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
> v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
> v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
> v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us
>
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us
>
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67403.29 ops/sec   1121.80 us  797.81 us  1357.28 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63952.79 ops/sec   1792.86 us  779.59 us  2130.54 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62198.83 ops/sec   2469.60 us  780.00 us  2914.48 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60333.81 ops/sec   3192.41 us  822.09 us  3748.24 us
>
> Summary:
> mc-llc outperforms, this patchset and upstream almost give similar performance.

So mc-llc patch seems to be the best approach IMHO. Although the
hemisphere don't share cache, they share enough resources so
cache-snooping is as efficient as sharing cache

>
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
>
> Srikar Dronamraju (10):
>   sched/fair: Update affine statistics when needed
>   sched/fair: Maintain the identity of idle-core
>   sched/fair: Update idle-core more often
>   sched/fair: Prefer idle CPU to cache affinity
>   sched/fair: Call wake_affine only if necessary
>   sched/idle: Move busy_cpu accounting to idle callback
>   sched/fair: Remove ifdefs in waker_affine_idler_llc
>   sched/fair: Dont iterate if no idle CPUs
>   sched/topology: Introduce fallback LLC
>   powerpc/smp: Add fallback flag to powerpc MC domain
>
>  arch/powerpc/kernel/smp.c      |   7 +-
>  include/linux/sched/sd_flags.h |   7 +
>  include/linux/sched/topology.h |   3 +-
>  kernel/sched/fair.c            | 229 +++++++++++++++++++++++++++------
>  kernel/sched/features.h        |   1 +
>  kernel/sched/idle.c            |  33 ++++-
>  kernel/sched/sched.h           |   6 +
>  kernel/sched/topology.c        |  54 +++++++-
>  8 files changed, 296 insertions(+), 44 deletions(-)
>
> --
> 2.18.2
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-27 14:52 ` Vincent Guittot
@ 2021-04-28 12:49   ` Srikar Dronamraju
  0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-28 12:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Vincent Guittot <vincent.guittot@linaro.org> [2021-04-27 16:52:30]:

> Hi Srikar,

Hi Vincent, 

> On Thu, 22 Apr 2021 at 12:23, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > Recently we found that some of the benchmark numbers on Power10 were lesser
> > than expected. Some analysis showed that the problem lies in the fact that
> > L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
> >
> >
> > Summary:
> > mc-llc outperforms, this patchset and upstream almost give similar performance.
> 
> So mc-llc patch seems to be the best approach IMHO. Although the
> hemisphere don't share cache, they share enough resources so
> cache-snooping is as efficient as sharing cache
> 

Yes, mc-llc helps just specific systems like Power10 but its shows better
numbers than my posted patchset.

However in this patchset, we are looking at areas in wakeup (aka idler llcs)
we could optimize which can help other archs too. + the fallback mechanism
is generic enough that we could use it for other Systems too.

I know that there are valid concerns raised by Mel and I working to resolve
them. Some of them are.
- How hot is idle-core
- Crashes when running tbench (I was able to reproduce with kernbench on x86)

Also I am adding some more changes with which we are getting similar
performance as mc-llc.

> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Parth Shah <parth@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Rik van Riel <riel@surriel.com>
> >
> > Srikar Dronamraju (10):
> >   sched/fair: Update affine statistics when needed
> >   sched/fair: Maintain the identity of idle-core
> >   sched/fair: Update idle-core more often
> >   sched/fair: Prefer idle CPU to cache affinity
> >   sched/fair: Call wake_affine only if necessary
> >   sched/idle: Move busy_cpu accounting to idle callback
> >   sched/fair: Remove ifdefs in waker_affine_idler_llc
> >   sched/fair: Dont iterate if no idle CPUs
> >   sched/topology: Introduce fallback LLC
> >   powerpc/smp: Add fallback flag to powerpc MC domain
> >
> >  arch/powerpc/kernel/smp.c      |   7 +-
> >  include/linux/sched/sd_flags.h |   7 +
> >  include/linux/sched/topology.h |   3 +-
> >  kernel/sched/fair.c            | 229 +++++++++++++++++++++++++++------
> >  kernel/sched/features.h        |   1 +
> >  kernel/sched/idle.c            |  33 ++++-
> >  kernel/sched/sched.h           |   6 +
> >  kernel/sched/topology.c        |  54 +++++++-
> >  8 files changed, 296 insertions(+), 44 deletions(-)
> >
> > --
> > 2.18.2
> >

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 00/10] sched/fair: wake_affine improvements
  2021-04-26 11:41     ` Mel Gorman
@ 2021-04-28 12:57       ` Srikar Dronamraju
  0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2021-04-28 12:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Mel Gorman <mgorman@techsingularity.net> [2021-04-26 12:41:12]:

> On Mon, Apr 26, 2021 at 04:09:40PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2021-04-23 09:25:32]:

<snip>

> > 
> > This change does increase the number of times we read the idle-core.  There
> > are also more places where we try to update the idle-core. However I feel
> > the number of times, we actually update the idle-core now will be much
> > lesser than previous, because we are mostly doing a conditional update. i.e
> > we are updating the idle-core only if the waking up CPU happens to be part
> > of our core.
> > 
> 
> Increased cache misses may be detectable from perf.

I will get some cache miss numbers pre and post the patchset.
Since we may be altering the selection of CPUs esp on Systems that have a
small LLCs, I was thinking the cache misses could be different between pre
and post patchset.

> 
> > Also if the system is mostly lightly loaded, we check for
> > available_idle_cpu, so we may not look for an idle-core. If the system is
> > running a CPU intensive task, then the idle-core will most likely to be -1.
> > Its only the cases where the system utilization keeps swinging between
> > lightly loaded to heavy load, that we would end up checking and setting
> > idle-core.
> > 
> 
> But this is a "how long is a piece of string" question because the benefit
> of tracking an idle core depends on both the interarrival time of wakeups,
> the domain utilisation and the length of time tasks are running. When
> I was looking at the area, I tracked the SIS efficiency to see how much
> each change was helping. The patch no longer applies but the stats are
> understood by mmtests if you wanted to forward port it.  It's possible
> you would do something similar but specific to idle_core -- e.g. track
> how often it's updated, how often it's read, how often a CPU is returned
> and how often it's still an idle core and use those stats to calculate
> hit/miss ratios.
> 
> However, I would caution against conflating the "fallback search domain"
> with the patches tracking idle core because they should be independent
> of each other.
> 
> Old patch that no longer applies that was the basis for some SIS work
> over a year ago is below
> 

Thanks Mel for sharing this, I will build a prototype patch similar to this
and see what inputs it will come up with.

> ---8<---
> From c791354b92a5723b0da14d050f942f61f0c12857 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <mgorman@techsingularity.net>
> Date: Fri, 14 Feb 2020 19:11:16 +0000
> Subject: [PATCH] sched/fair: Track efficiency of select_idle_sibling
> 
<snip>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-28 12:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 01/10] sched/fair: Update affine statistics when needed Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 02/10] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 03/10] sched/fair: Update idle-core more often Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 05/10] sched/fair: Call wake_affine only if necessary Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 09/10] sched/topology: Introduce fallback LLC Srikar Dronamraju
2021-04-22 15:10   ` kernel test robot
2021-04-22 10:23 ` [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain Srikar Dronamraju
2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
2021-04-23 10:31   ` Srikar Dronamraju
2021-04-23 12:38     ` Mel Gorman
2021-04-26 10:30       ` Srikar Dronamraju
2021-04-26 11:35         ` Mel Gorman
2021-04-26 10:39   ` Srikar Dronamraju
2021-04-26 11:41     ` Mel Gorman
2021-04-28 12:57       ` Srikar Dronamraju
2021-04-27 14:52 ` Vincent Guittot
2021-04-28 12:49   ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).