All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] sched/fair: wake_affine improvements
@ 2021-05-06 16:45 Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Changelog v1->v2:
v1 Link: http://lore.kernel.org/lkml/20210422102326.35889-1-srikar@linux.vnet.ibm.com/t/#u
 - Fallback LLC domain has been split out as a subsequent patchset.
   					(suggested by Mel)
 - Fix a panic due to two wakeups racing for the same idle-core
					(Reported by Mel)
 - Differentiate if a LLC surely has no idle-cores(-2) vs a LLC may or
   may not have idle-cores(-1).
 - Rebased to v5.12

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
a patch that marks MC domain as LLC.

Here the focus is on improving the current core scheduler's wakeup
mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC)

Hence this approach can work well with the mc-llc too. It can help
other architectures too.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  8
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Model:               1.0 (pvr 0080 0100)
Model name:          POWER10 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8K
NUMA node2 CPU(s):   0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc+
instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.240824 usecs/op, median =  usecs/op, max = 19.240824 usecs/op
instances = 2, min = 19.143470 usecs/op, median = 19.143470 usecs/op, max = 19.249875 usecs/op
instances = 4, min = 19.399812 usecs/op, median = 19.487433 usecs/op, max = 19.501298 usecs/op
instances = 8, min = 19.024297 usecs/op, median = 19.908682 usecs/op, max = 20.741605 usecs/op
instances = 16, min = 22.209444 usecs/op, median = 23.971275 usecs/op, max = 25.145198 usecs/op
instances = 32, min = 31.220392 usecs/op, median = 32.689189 usecs/op, max = 34.081588 usecs/op
instances = 64, min = 39.012110 usecs/op, median = 44.062042 usecs/op, max = 45.370525 usecs/op
instances = 80, min = 43.884358 usecs/op, median = 44.326417 usecs/op, max = 48.031303 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
		     v5.12-rc5   v5.12-rc5     v5.12-rc5
                                 + mc-llc      + patchset
64CPUs/1JVM/ 60Users  6373.7      7520.5        7375.6
64CPUs/1JVM/ 80Users  6742.1      7940.9        7832.9
64CPUs/1JVM/100Users  6482.2      7730.3        7538.4
64CPUs/2JVM/ 60Users  6335        8081.6        8000.2
64CPUs/2JVM/ 80Users  6360.8      8259.6        8315.4
64CPUs/2JVM/100Users  6215.6      8046.5        8049.4
64CPUs/4JVM/ 60Users  5385.4      7685.3        8013.5
64CPUs/4JVM/ 80Users  5380.8      7753.3        7868
64CPUs/4JVM/100Users  5275.2      7549.2        7620

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset

50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 53 (1285 samples)
75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 72 (635 samples)
90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 88 (388 samples)
95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 94 (118 samples)
*99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 108 (108 samples)
99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 108 (0 samples)
99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 110 (8 samples)
min=13, max=20684          |     min=15, max=113                 | min=15, max=1433

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 51 (3843 samples)
75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 69 (1962 samples)
90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 84 (1103 samples)
95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 93 (386 samples)
*99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 107 (297 samples)
99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 110 (39 samples)
99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 1714 (27 samples)
min=13, max=20684          |     min=12, max=113                 | min=15, max=4456

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 50 (6443 samples)
75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 67 (3263 samples)
90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 82 (1890 samples)
95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 92 (652 samples)
*99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 107 (464 samples)
99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 110 (61 samples)
99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1434 (47 samples)
min=12, max=21421          |     min=1, max=126                  | min=15, max=4456

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5

mongodb (threads=6) (throughput, higher is better)
					 Throughput         read        clean      update
					                    latency     latency    latency
v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us

v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67604.51 ops/sec   1117.91 us  947.07 us  1353.41 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63979.39 ops/sec   1793.63 us  869.72 us  2130.22 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62032.34 ops/sec   2475.89 us  869.06 us  2922.01 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60152.96 ops/sec   3203.84 us  972.00 us  3756.52 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>

Srikar Dronamraju (8):
  sched/fair: Update affine statistics when needed
  sched/fair: Maintain the identity of idle-core
  sched/fair: Update idle-core more often
  sched/fair: Prefer idle CPU to cache affinity
  sched/fair: Use affine_idler_llc for wakeups across LLC
  sched/idle: Move busy_cpu accounting to idle callback
  sched/fair: Remove ifdefs in waker_affine_idler_llc
  sched/fair: Dont iterate if no idle CPUs

 include/linux/sched/topology.h |   2 +-
 kernel/sched/fair.c            | 204 ++++++++++++++++++++++++++-------
 kernel/sched/features.h        |   1 +
 kernel/sched/idle.c            |  33 +++++-
 kernel/sched/sched.h           |   6 +
 kernel/sched/topology.c        |   9 ++
 6 files changed, 214 insertions(+), 41 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/8] sched/fair: Update affine statistics when needed
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-07 16:08   ` Valentin Schneider
  2021-05-06 16:45 ` [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

wake_affine_idle() can return prev_cpu. Even in such a scenario,
scheduler was going ahead and updating schedstats related to wake
affine. i.e even if the task is not moved across LLC domains,
schedstats would have accounted.

Hence add a check before updating schedstats.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..a258a84cfdfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
-	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	if (!cpus_share_cache(prev_cpu, target)) {
+		schedstat_inc(sd->ttwu_move_affine);
+		schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	}
 	return target;
 }
 
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-11 11:51   ` Valentin Schneider
  2021-05-06 16:45 ` [PATCH v2 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Scheduler maintains a per LLC info which tells if there is any idle core
in that LLC. However this information doesn't provide which core is idle.

So when iterating for idle-cores, if select_idle_core() finds an
idle-core, then it doesn't try to reset this information.

So if there was only one idle core in the LLC and select_idle_core()
selected the idle-core, the LLC will maintain that it still has a
idle-core.

On the converse, if a task is pinned, and has a restricted
cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
cannot find a idle-core, LLC will no more maintain that it has an
idle-core.

As a first step to solve this problem, LLC will maintain the identity of
the idle core instead of just the information that LLC has an idle core

Along with maintaining, this change will solve both the problems listed
above. However there are other problems that exist with the current
infrastructure and those will continue to exist with this change and
would be handled in subsequent patches.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 43 +++++++++++++++++++---------------
 kernel/sched/sched.h           |  3 +++
 kernel/sched/topology.c        |  7 ++++++
 4 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..285165a35f21 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -73,7 +73,7 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
+	int		idle_core;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a258a84cfdfd..8c9d1a210820 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1563,11 +1563,11 @@ numa_type numa_classify(unsigned int imbalance_pct,
 
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
-static inline bool test_idle_cores(int cpu, bool def);
+static inline int get_idle_core(int cpu, int def);
 static inline int numa_idle_core(int idle_core, int cpu)
 {
 	if (!static_branch_likely(&sched_smt_present) ||
-	    idle_core >= 0 || !test_idle_cores(cpu, false))
+	    idle_core >= 0 || get_idle_core(cpu, -1) < 0)
 		return idle_core;
 
 	/*
@@ -6015,29 +6015,31 @@ static inline int __select_idle_cpu(int cpu)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
+		WRITE_ONCE(sds->idle_core, val);
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline int get_idle_core(int cpu, int def)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
+	if (static_branch_likely(&sched_smt_present)) {
+		sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+		if (sds)
+			return READ_ONCE(sds->idle_core);
+	}
 
 	return def;
 }
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_llc_shared->idle_core.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -6048,7 +6050,7 @@ void __update_idle_core(struct rq *rq)
 	int cpu;
 
 	rcu_read_lock();
-	if (test_idle_cores(core, true))
+	if (get_idle_core(core, 0) >= 0)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6059,7 +6061,7 @@ void __update_idle_core(struct rq *rq)
 			goto unlock;
 	}
 
-	set_idle_cores(core, 1);
+	set_idle_core(core, per_cpu(smt_id, core));
 unlock:
 	rcu_read_unlock();
 }
@@ -6067,7 +6069,7 @@ void __update_idle_core(struct rq *rq)
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * sd_llc->shared->idle_core and enabled through update_idle_core() above.
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
@@ -6102,11 +6104,11 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_core(int cpu, int val)
 {
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline bool get_idle_core(int cpu, int def)
 {
 	return def;
 }
@@ -6127,7 +6129,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	bool smt = test_idle_cores(target, false);
+	int idle_core = get_idle_core(target, -1);
+	bool smt = (idle_core != -1);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6160,8 +6163,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (smt) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned int)i < nr_cpumask_bits)
+			if ((unsigned int)i < nr_cpumask_bits) {
+#ifdef CONFIG_SCHED_SMT
+				if ((per_cpu(smt_id, i)) == idle_core)
+					set_idle_core(i, -1);
+#endif
 				return i;
+			}
 
 		} else {
 			if (!--nr)
@@ -6172,9 +6180,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		}
 	}
 
-	if (smt)
-		set_idle_cores(this, false);
-
 	if (sched_feat(SIS_PROP) && !smt) {
 		time = cpu_clock(this) - time;
 		update_avg(&this_sd->avg_scan_cost, time);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..46d40a281724 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1478,6 +1478,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DECLARE_PER_CPU(int, smt_id);
+#endif
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 09d35044bd88..8db40c8a6ad0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -644,6 +644,9 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+#ifdef CONFIG_SCHED_SMT
+DEFINE_PER_CPU(int, smt_id);
+#endif
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -667,6 +670,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+#ifdef CONFIG_SCHED_SMT
+	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
+#endif
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
@@ -1466,6 +1472,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		sd->shared->idle_core = -1;
 	}
 
 	sd->private = sdd;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/8] sched/fair: Update idle-core more often
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Currently when the scheduler does a load balance and pulls a task or
when a CPU picks up a task during wakeup without having to call
select_idle_cpu(), it never checks if the target CPU is part of the
idle-core. This makes idle-core less accurate.

Given that the identity of idle-core for LLC is maintained, its easy to
update the idle-core as soon as the CPU picks up a task.

This change will update the idle-core whenever a CPU from the idle-core
picks up a task. However if there are multiple idle-cores in the LLC,
and waking CPU happens to be part of the designated idle-core, idle-core
is set to -1. In cases where the scheduler is sure that there are no
more idle-cores, idle-core is set to -2.

To reduce this case, whenever a CPU updates idle-core, it will look for
other cores in the LLC for an idle-core, if the core to which it belongs
to is not idle.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c  | 54 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/idle.c  |  6 +++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c9d1a210820..50da2363317d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6015,6 +6015,13 @@ static inline int __select_idle_cpu(int cpu)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+/*
+ * Value of -2 indicates there are no idle-cores in LLC.
+ * Value of -1 indicates an idle-core turned to busy recently.
+ * However there could be other idle-cores in the system.
+ * Anyother value indicates core to which the CPU(value)
+ * belongs is idle.
+ */
 static inline void set_idle_core(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
@@ -6037,6 +6044,40 @@ static inline int get_idle_core(int cpu, int def)
 	return def;
 }
 
+static void set_next_idle_core(struct sched_domain *sd, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int core, cpu;
+
+	cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
+	for_each_cpu_wrap(core, cpus, target) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			if (!available_idle_cpu(cpu)) {
+				idle = false;
+				break;
+			}
+		}
+
+		if (idle) {
+			set_idle_core(core, per_cpu(smt_id, core));
+			return;
+		}
+
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+	}
+	set_idle_core(target, -2);
+}
+
+void set_core_busy(int core)
+{
+	rcu_read_lock();
+	if (get_idle_core(core, -1) == per_cpu(smt_id, core))
+		set_idle_core(core, -1);
+	rcu_read_unlock();
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->idle_core.
@@ -6046,11 +6087,13 @@ static inline int get_idle_core(int cpu, int def)
  */
 void __update_idle_core(struct rq *rq)
 {
+	struct sched_domain *sd;
 	int core = cpu_of(rq);
 	int cpu;
 
 	rcu_read_lock();
-	if (get_idle_core(core, 0) >= 0)
+	sd = rcu_dereference(per_cpu(sd_llc, core));
+	if (!sd || get_idle_core(core, 0) >= 0)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6058,10 +6101,15 @@ void __update_idle_core(struct rq *rq)
 			continue;
 
 		if (!available_idle_cpu(cpu))
-			goto unlock;
+			goto try_next;
 	}
 
 	set_idle_core(core, per_cpu(smt_id, core));
+	goto unlock;
+
+try_next:
+	set_next_idle_core(sd, core);
+
 unlock:
 	rcu_read_unlock();
 }
@@ -6130,7 +6178,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	int idle_core = get_idle_core(target, -1);
-	bool smt = (idle_core != -1);
+	bool smt = (idle_core != -2);
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
 	u64 time;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7199e6f23789..cc828f3efe71 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,6 +425,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SCHED_SMT
+	int cpu = rq->cpu;
+
+	if (static_branch_likely(&sched_smt_present))
+		set_core_busy(cpu);
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46d40a281724..5c0bd4b0e73a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1102,6 +1102,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
+extern void set_core_busy(int cpu);
 
 static inline void update_idle_core(struct rq *rq)
 {
@@ -1111,6 +1112,7 @@ static inline void update_idle_core(struct rq *rq)
 
 #else
 static inline void update_idle_core(struct rq *rq) { }
+static inline void set_core_busy(int cpu) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Michael Neuling,
	Gautham R Shenoy, Parth Shah

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a method where Scheduler
compares idle CPUs in waker and previous LLCs and picks the appropriate
one.

The new method looks at idle-core to figure out idle LLC. If there are
no idle LLCs, it compares the ratio of busy CPUs to the total number of
CPUs in the LLC. This method will only be useful to compare 2 LLCs. If
the previous CPU and the waking CPU are in the same LLC, this method
would not be useful. For now the new method is disabled by default.

sync flag decides which CPU/LLC to try first. If sync is set, choose
current LLC, else choose previous LLC.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog: v1->v2:
 - Swap the cpus, if the wakeup is not sync, so that a single order of
 code suffices for both sync and non-sync wakeups.

 - Mel reported a crash. Apparently two threads can race to find an
 idle-core. I now cache the idlecore. Also use compare-exchange, so
 that no 2 waking tasks contend on the same CPU.

Also Based on similar posting:
http://lore.kernel.org/lkml/20210226164029.122432-1-srikar@linux.vnet.ibm.com/t/#u
- Make WA_WAKER default (Suggested by Rik) : done in next patch
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- Rename WA_WAKER to WA_IDLER_LLC (Suggested by Vincent)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains
- Add idle-core support

 kernel/sched/fair.c     | 66 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 67 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50da2363317d..72bf1996903d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,59 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);
+
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+{
+#ifdef CONFIG_NO_HZ_COMMON
+	int tnr_busy, tllc_size, pnr_busy, pllc_size;
+#endif
+	struct sched_domain_shared *pref_sds, *try_sds;
+	int diff, idle_core;
+
+	if (!sync)
+		swap(pref_cpu, try_cpu);
+
+	pref_sds = rcu_dereference(per_cpu(sd_llc_shared, pref_cpu));
+	try_sds = rcu_dereference(per_cpu(sd_llc_shared, try_cpu));
+	if (!pref_sds || !try_sds)
+		return nr_cpumask_bits;
+
+	if (available_idle_cpu(pref_cpu) || sched_idle_cpu(pref_cpu))
+		return pref_cpu;
+
+	idle_core = READ_ONCE(pref_sds->idle_core);
+	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+				test_reset_idle_core(pref_sds, idle_core))
+		return idle_core;
+
+	if (available_idle_cpu(try_cpu) || sched_idle_cpu(try_cpu))
+		return try_cpu;
+
+	idle_core = READ_ONCE(try_sds->idle_core);
+	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
+				test_reset_idle_core(try_sds, idle_core))
+		return idle_core;
+
+#ifdef CONFIG_NO_HZ_COMMON
+	pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
+	tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
+	pllc_size = per_cpu(sd_llc_size, pref_cpu);
+	tllc_size = per_cpu(sd_llc_size, try_cpu);
+
+	if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+		return nr_cpumask_bits;
+
+	diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
+	if (diff > 0)
+		return pref_cpu;
+	if (diff < 0)
+		return try_cpu;
+#endif /* CONFIG_NO_HZ_COMMON */
+
+	return nr_cpumask_bits;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5877,6 +5930,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
+	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
@@ -6044,6 +6100,11 @@ static inline int get_idle_core(int cpu, int def)
 	return def;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+	return cmpxchg(&sds->idle_core, val, -1) == val;
+}
+
 static void set_next_idle_core(struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
@@ -6161,6 +6222,11 @@ static inline bool get_idle_core(int cpu, int def)
 	return def;
 }
 
+static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val)
+{
+	return false;
+}
+
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..c77349a47e01 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, false)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Currently if a waker is the only running thread of the CPU, and waking a
wakee (both of them interacting over a pipe aka sync wakeups) with the
wakee previously running on a different LLC), then wakee is pulled and
queued on the CPU that is running the waker.

This is true even if the previous CPU was completely idle. The rationale
being waker would soon relinquish the CPU and wakee would benefit from
the cache access. However if the previous CPU is idle, then wakee thread
will incur latency cost of migration to the waker CPU + latency cost of
the waker context switching.

Before the waker switches out, load balancer could also kick in and pull
the wakee out resulting in another migration cost. This will mostly hurt
systems where LLC have just one core. For example:- Hackbench. Both the
threads of hackbench would then end up running on the same core
resulting in CPUs running in higher SMT modes, more load balances and
non optimal performance.

It would be beneficial to run the wakee thread on the same CPU as waker
only if other CPUs are busy. To achieve this move this part of the code
that check if waker is the only running thread to early part of
wake_affine_weight().

Also post this change, wake_affine_idle() will only look at wakeups
within the LLC.  For wakeups within LLC, there should be no difference
between sync and no-sync. For wakeups across LLC boundaries, lets use
wake_affine_idler_llc.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
- Update the patch subject line.

 kernel/sched/fair.c     | 22 +++++++++++-----------
 kernel/sched/features.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 72bf1996903d..c30587631a24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5799,8 +5799,7 @@ static int wake_wide(struct task_struct *p)
  *			  scheduling latency of the CPUs. This seems to work
  *			  for the overloaded case.
  */
-static int
-wake_affine_idle(int this_cpu, int prev_cpu, int sync)
+static int wake_affine_idle(int this_cpu, int prev_cpu)
 {
 	/*
 	 * If this_cpu is idle, it implies the wakeup is from interrupt
@@ -5814,15 +5813,12 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * a cpufreq perspective, it's better to have higher utilisation
 	 * on one CPU.
 	 */
-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+	if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))
+		return prev_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	if (available_idle_cpu(this_cpu) || sched_idle_cpu(this_cpu))
 		return this_cpu;
 
-	if (available_idle_cpu(prev_cpu))
-		return prev_cpu;
-
 	return nr_cpumask_bits;
 }
 
@@ -5838,6 +5834,9 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
+		if (cpu_rq(this_cpu)->nr_running <= 1)
+			return this_cpu;
+
 		if (current_load > this_eff_load)
 			return this_cpu;
 
@@ -5925,12 +5924,13 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
+	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
 
-	if (sched_feat(WA_IDLE))
-		target = wake_affine_idle(this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_IDLE) && share_caches)
+		target = wake_affine_idle(this_cpu, prev_cpu);
 
-	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits)
+	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
 		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c77349a47e01..c60a6d2b2126 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,7 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
-SCHED_FEAT(WA_IDLER_LLC, false)
+SCHED_FEAT(WA_IDLER_LLC, true)
 SCHED_FEAT(WA_BIAS, true)
 
 /*
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-11 11:51   ` Valentin Schneider
  2021-05-12  8:08   ` Aubrey Li
  2021-05-06 16:45 ` [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Currently we account nr_busy_cpus in no_hz idle functions.
There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
configs only. Also scheduler can mark a CPU as non-busy as soon as an
idle class task starts to run. Scheduler can then mark a CPU as busy
as soon as its woken up from idle or a new task is placed on it's
runqueue.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c     |  6 ++++--
 kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  2 ++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c30587631a24..4d3b0928fe98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->shared->nr_busy_cpus);
+	if (sd && per_cpu(is_idle, cpu)) {
+		atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
+		per_cpu(is_idle, cpu) = 0;
+	}
 unlock:
 	rcu_read_unlock();
 }
@@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index cc828f3efe71..e624a05e48bd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
 	int cpu = rq->cpu;
 
+#ifdef CONFIG_SCHED_SMT
 	if (static_branch_likely(&sched_smt_present))
 		set_core_busy(cpu);
 #endif
+
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		if (per_cpu(is_idle, cpu)) {
+			atomic_inc(&sds->nr_busy_cpus);
+			per_cpu(is_idle, cpu) = 0;
+		}
+	}
+	rcu_read_unlock();
+#endif
 }
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
@@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
+#ifdef CONFIG_SMP
+	struct sched_domain_shared *sds;
+	int cpu = rq->cpu;
 
-	set_next_task_idle(rq, next, true);
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
+		per_cpu(is_idle, cpu) = 1;
+	}
 
+	rcu_read_unlock();
+#endif
+
+	set_next_task_idle(rq, next, true);
 	return next;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c0bd4b0e73a..baf8d9a4cb26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DECLARE_PER_CPU(int, smt_id);
 #endif
+DECLARE_PER_CPU(int, is_idle);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8db40c8a6ad0..00e4669bb241 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 #ifdef CONFIG_SCHED_SMT
 DEFINE_PER_CPU(int, smt_id);
 #endif
+DEFINE_PER_CPU(int, is_idle);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
 #ifdef CONFIG_SCHED_SMT
 	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
 #endif
+	per_cpu(is_idle, cpu) = 1;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
  2021-05-06 16:53 ` [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Now that idle callbacks are updating nr_busy_cpus, remove ifdefs in
wake_affine_idler_llc

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d3b0928fe98..c70f0889258f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5872,9 +5872,7 @@ static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val
 
 static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
 {
-#ifdef CONFIG_NO_HZ_COMMON
 	int tnr_busy, tllc_size, pnr_busy, pllc_size;
-#endif
 	struct sched_domain_shared *pref_sds, *try_sds;
 	int diff, idle_core;
 
@@ -5902,7 +5900,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 				test_reset_idle_core(try_sds, idle_core))
 		return idle_core;
 
-#ifdef CONFIG_NO_HZ_COMMON
 	pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
 	tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
 	pllc_size = per_cpu(sd_llc_size, pref_cpu);
@@ -5916,7 +5913,6 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 		return pref_cpu;
 	if (diff < 0)
 		return try_cpu;
-#endif /* CONFIG_NO_HZ_COMMON */
 
 	return nr_cpumask_bits;
 }
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
@ 2021-05-06 16:45 ` Srikar Dronamraju
  2021-05-06 16:53 ` [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

Now that the nr_busy_cpus for a LLC are updated in idle callbacks,
scheduler can detect if all threads of a LLC are busy. In such cases, it
can avoid searching for idle CPUs in the LLC that can run the wakee
thread.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c70f0889258f..83104d3bd0f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,7 +715,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, bool idle);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -5870,7 +5870,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 
 static inline bool test_reset_idle_core(struct sched_domain_shared *sds, int val);
 
-static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu, int sync)
+static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cpu,
+				int sync, bool *idle)
 {
 	int tnr_busy, tllc_size, pnr_busy, pllc_size;
 	struct sched_domain_shared *pref_sds, *try_sds;
@@ -5905,8 +5906,10 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 	pllc_size = per_cpu(sd_llc_size, pref_cpu);
 	tllc_size = per_cpu(sd_llc_size, try_cpu);
 
-	if (tnr_busy == tllc_size && pnr_busy == pllc_size)
+	if (tnr_busy == tllc_size && pnr_busy == pllc_size) {
+		*idle = false;
 		return nr_cpumask_bits;
+	}
 
 	diff = tnr_busy * pllc_size - pnr_busy * tllc_size;
 	if (diff > 0)
@@ -5918,7 +5921,7 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-		       int this_cpu, int prev_cpu, int sync)
+		       int this_cpu, int prev_cpu, int sync, bool *idle)
 {
 	bool share_caches = cpus_share_cache(prev_cpu, this_cpu);
 	int target = nr_cpumask_bits;
@@ -5927,7 +5930,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		target = wake_affine_idle(this_cpu, prev_cpu);
 
 	else if (sched_feat(WA_IDLER_LLC) && !share_caches)
-		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync);
+		target = wake_affine_idler_llc(p, this_cpu, prev_cpu, sync, idle);
 
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -6343,7 +6346,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, bool idle)
 {
 	struct sched_domain *sd;
 	unsigned long task_util;
@@ -6420,6 +6423,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		}
 	}
 
+	if (!idle)
+		return target;
+
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	if (!sd)
 		return target;
@@ -6828,6 +6834,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int want_affine = 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
+	bool idle = true;
 
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
@@ -6851,7 +6858,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			if (cpu != prev_cpu)
-				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
+				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, &idle);
 
 			sd = NULL; /* Prefer wake_affine over balance flags */
 			break;
@@ -6868,7 +6875,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, idle);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/8] sched/fair: wake_affine improvements
  2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2021-05-06 16:45 ` [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
@ 2021-05-06 16:53 ` Srikar Dronamraju
  8 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Gautham R Shenoy, Parth Shah

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2021-05-06 22:15:35]:

Mel had asked for additional profile data that could be collected 
for idlecore

# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:                        2
CPU MHz:                         1200.000
CPU max MHz:                     3500.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        5200.05
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        60 MiB
NUMA node0 CPU(s):               0-11,24-35
NUMA node1 CPU(s):               12-23,36-47

# cat test.sh
#! /bin/bash
tbench_srv &

#20 iterations of tbench on 48 CPU, 2 node syste
for i in $(seq 1 20); do
	#enable schedstat just before tbench
	echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
	tbench -t 60 48 127.0.0.1
	#disable schedstat just before tbench
	echo 0 | sudo tee /proc/sys/kernel/sched_schedstats
	IDLE_COUNT=$(awk '/nr_idlecore_write/{a+=$NF}END{print a}' /proc/sched_debug)
	NR_IDLE_COUNT=$(($IDLE_COUNT-$NR_IDLE_COUNT))
	SELECT_COUNT=$(awk '/nr_idlecore_select/{a+=$NF}END{print a}' /proc/sched_debug)
	NR_SELECT_COUNT=$(($SELECT_COUNT-$NR_SELECT_COUNT))
	# select means we selected an idle core when the preferred CPU is
	# busy. write means, unconditional set of idlecore.
	# seqno total_updates=select+write select write
	echo $i $(($NR_IDLE_COUNT+$NR_SELECT_COUNT)) $NR_SELECT_COUNT $NR_IDLE_COUNT
done
#

a5e13c6df0e4 aka v5.12-rc5
seqno  nr_update_idle_core  nr_select_idle_core  nr_write_idle_core
1      11515                0                    11515
2      13439                0                    13439
3      42339                0                    42339
4      13642                0                    13642
5      44770                0                    44770
6      32402                0                    32402
7      65638                0                    65638
8      106601               0                    106601
9      99819                0                    99819
10     106754               0                    106754
11     107899               0                    107899
12     112432               0                    112432
13     125329               0                    125329
14     127363               0                    127363
15     133821               0                    133821
16     127495               0                    127495
17     133957               0                    133957
18     185021               0                    185021
19     137139               0                    137139
20     221413               0                    221413

 N           Min           Max        Median           Avg        Stddev
20         11515        221413        107899       97439.4     57524.696

Average of 1353 updates per second.


635bb392f382 aka v5.12-rc5 + patches
seqno  nr_update_idle_core  nr_select_idle_core  nr_write_idle_core
1      2112856              218                  2112638
2      1727892              84                   1727808
3      3662807              280                  3662527
4      3623563              220                  3623343
5      4972825              308                  4972517
6      3625828              258                  3625570
7      6703820              407                  6703413
8      5565289              390                  5564899
9      8376039              528                  8375511
10     6643273              405                  6642868
11     10041803             605                  10041198
12     8322148              537                  8321611
13     11941494             729                  11940765
14     10125633             704                  10124929
15     12810965             797                  12810168
16     12269912             857                  12269055
17     14798232             912                  14797320
18     14378202             980                  14377222
19     15705273             935                  15704338
20     16407305             1122                 16406183

 N           Min           Max        Median           Avg        Stddev
20       1727892      16407305       8376039     8690757.9     4718172.5

Average of 120704 updates per second which is around 89X times without the
patch.

-- 
Thanks and Regards
Srikar Dronamraju

---->8----------------------------------------------------8<--------------

From e361fdd4234ff718247f0ee20f4f836ccbbc1df8 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Thu, 6 May 2021 19:33:51 +0530
Subject: sched: Show idlecore update stats

Not for inclusion; Just for demonstration.

nr_idlecore_write: idlecore set unconditionally.
nr_idlecore_select; wakeup found an idle core when the current CPU was
busy.

nr_idlecore updates = nr_idlecore_write + nr_idlecore_select

Not-Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/debug.c |  2 ++
 kernel/sched/fair.c  | 12 +++++++++---
 kernel/sched/sched.h |  2 ++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..b50e0c5acf46 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -715,6 +715,8 @@ do {									\
 		P(ttwu_count);
 		P(ttwu_local);
 	}
+	P(nr_idlecore_write);
+	P(nr_idlecore_select);
 #undef P
 
 	spin_lock_irqsave(&sched_debug_lock, flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83104d3bd0f9..4ef0b7d959d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5890,16 +5890,20 @@ static int wake_affine_idler_llc(struct task_struct *p, int pref_cpu, int try_cp
 
 	idle_core = READ_ONCE(pref_sds->idle_core);
 	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
-				test_reset_idle_core(pref_sds, idle_core))
+				test_reset_idle_core(pref_sds, idle_core)) {
+		schedstat_inc(cpu_rq(idle_core)->nr_idlecore_select);
 		return idle_core;
+	}
 
 	if (available_idle_cpu(try_cpu) || sched_idle_cpu(try_cpu))
 		return try_cpu;
 
 	idle_core = READ_ONCE(try_sds->idle_core);
 	if (idle_core > -1 && cpumask_test_cpu(idle_core, p->cpus_ptr) &&
-				test_reset_idle_core(try_sds, idle_core))
+				test_reset_idle_core(try_sds, idle_core)) {
+		schedstat_inc(cpu_rq(idle_core)->nr_idlecore_select);
 		return idle_core;
+	}
 
 	pnr_busy = atomic_read(&pref_sds->nr_busy_cpus);
 	tnr_busy = atomic_read(&try_sds->nr_busy_cpus);
@@ -6082,8 +6086,10 @@ static inline void set_idle_core(int cpu, int val)
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
+	if (sds) {
 		WRITE_ONCE(sds->idle_core, val);
+		schedstat_inc(cpu_rq(cpu)->nr_idlecore_write);
+	}
 }
 
 static inline int get_idle_core(int cpu, int def)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index baf8d9a4cb26..06d81a9e0a48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1060,6 +1060,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	unsigned int		nr_pinned;
 #endif
+	unsigned int		nr_idlecore_write;
+	unsigned int		nr_idlecore_select;
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
 };
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] sched/fair: Update affine statistics when needed
  2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
@ 2021-05-07 16:08   ` Valentin Schneider
  2021-05-07 17:05     ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2021-05-07 16:08 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

On 06/05/21 22:15, Srikar Dronamraju wrote:
> wake_affine_idle() can return prev_cpu. Even in such a scenario,
> scheduler was going ahead and updating schedstats related to wake
> affine. i.e even if the task is not moved across LLC domains,
> schedstats would have accounted.
>
> Hence add a check before updating schedstats.
>

I briefly glanced at the git history but didn't find any proper description
of that stat. As it stands, it counts the number of times wake_affine()
purposedly steered a task towards a particular CPU (waker or wakee's prev),
so nr_wakeups_affine / nr_wakeups_affine_attempts is your wake_affine()
"success rate" - how often could it make a choice with the available data.

I could see a point in only incrementing the count if wake_affine() steers
towards the waker rather than the wakee (i.e. don't increment if choice is
prev), but then that has no link with LLC spans

> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..a258a84cfdfd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>       if (target == nr_cpumask_bits)
>               return prev_cpu;
>
> -	schedstat_inc(sd->ttwu_move_affine);
> -	schedstat_inc(p->se.statistics.nr_wakeups_affine);
> +	if (!cpus_share_cache(prev_cpu, target)) {

Per the above, why? Why not just if(target == this_cpu) ?

> +		schedstat_inc(sd->ttwu_move_affine);
> +		schedstat_inc(p->se.statistics.nr_wakeups_affine);
> +	}
>       return target;
>  }
>
> --
> 2.18.2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] sched/fair: Update affine statistics when needed
  2021-05-07 16:08   ` Valentin Schneider
@ 2021-05-07 17:05     ` Srikar Dronamraju
  2021-05-11 11:51       ` Valentin Schneider
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-07 17:05 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-07 17:08:17]:

> On 06/05/21 22:15, Srikar Dronamraju wrote:
> > wake_affine_idle() can return prev_cpu. Even in such a scenario,
> > scheduler was going ahead and updating schedstats related to wake
> > affine. i.e even if the task is not moved across LLC domains,
> > schedstats would have accounted.
> >
> > Hence add a check before updating schedstats.
> >

Thanks Valentin for taking a look at the patch.

> 
> I briefly glanced at the git history but didn't find any proper description
> of that stat. As it stands, it counts the number of times wake_affine()
> purposedly steered a task towards a particular CPU (waker or wakee's prev),
> so nr_wakeups_affine / nr_wakeups_affine_attempts is your wake_affine()
> "success rate" - how often could it make a choice with the available data.
> 
> I could see a point in only incrementing the count if wake_affine() steers
> towards the waker rather than the wakee (i.e. don't increment if choice is
> prev), but then that has no link with LLC spans

Lets say if prev CPU and this CPU were part of the same LLC, and the prev
CPU was busy (or busier than this CPU), should consider this as a wake
affine? If prev was idle, we would have surely consider prev CPU. Also since
both are part of same LLC, we cant say this CPU is more affine than prev
CPU. Or may be I am confusing wake_affine with cache_affine.

> 
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Parth Shah <parth@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  kernel/sched/fair.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 794c2cb945f8..a258a84cfdfd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >       if (target == nr_cpumask_bits)
> >               return prev_cpu;
> >
> > -	schedstat_inc(sd->ttwu_move_affine);
> > -	schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > +	if (!cpus_share_cache(prev_cpu, target)) {
> 
> Per the above, why? Why not just if(target == this_cpu) ?

We could use target == this_cpu. However if prev CPU and this CPU share the
same LLC, then should we consider moving to this_cpu as an affine wakeup?

I could have probably moved this patch a later in the patch series, but one
of the patch that introduces wake_affine_idler_llc() may end up returning
neither this_cpu, prev_cpu or nr_cpumask_bits. In such a case where it
returns a CPU closer to this_cpu, then I would still mark it as wake_affine.

> 
> > +		schedstat_inc(sd->ttwu_move_affine);
> > +		schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > +	}
> >       return target;
> >  }
> >
> > --
> > 2.18.2

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-06 16:45 ` [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
@ 2021-05-11 11:51   ` Valentin Schneider
  2021-05-11 16:27     ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2021-05-11 11:51 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

On 06/05/21 22:15, Srikar Dronamraju wrote:
> Scheduler maintains a per LLC info which tells if there is any idle core
> in that LLC. However this information doesn't provide which core is idle.
>
> So when iterating for idle-cores, if select_idle_core() finds an
> idle-core, then it doesn't try to reset this information.
>
> So if there was only one idle core in the LLC and select_idle_core()
> selected the idle-core, the LLC will maintain that it still has a
> idle-core.
>

That would be rectified at the next select_idle_cpu() call, so that would
be a fight between extra instrumentation overhead vs extra work at next
wakeup.

> On the converse, if a task is pinned, and has a restricted
> cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> cannot find a idle-core, LLC will no more maintain that it has an
> idle-core.
>

This however does sound icky.

> As a first step to solve this problem, LLC will maintain the identity of
> the idle core instead of just the information that LLC has an idle core
>
> Along with maintaining, this change will solve both the problems listed
> above. However there are other problems that exist with the current
> infrastructure and those will continue to exist with this change and
> would be handled in subsequent patches.
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> @@ -6127,7 +6129,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  {
>       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> -	bool smt = test_idle_cores(target, false);
> +	int idle_core = get_idle_core(target, -1);
> +	bool smt = (idle_core != -1);

test_idle_cores() tells you if there's at least one idle core in the
target's LLC. AFAICT get_idle_core() only tells you whether the target's
core is idle, which is not the same thing.

Note that this code has recently been changed by Rik in

  c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")

so as annoying as it is you should probably go try this out / rebase your
series on top of it (as a rule of thumb for core scheduler stuff you should
use https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git -b
tip/sched/core as a base).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
@ 2021-05-11 11:51   ` Valentin Schneider
  2021-05-11 16:55     ` Srikar Dronamraju
  2021-05-12  0:32     ` Aubrey Li
  2021-05-12  8:08   ` Aubrey Li
  1 sibling, 2 replies; 31+ messages in thread
From: Valentin Schneider @ 2021-05-11 11:51 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

On 06/05/21 22:15, Srikar Dronamraju wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8db40c8a6ad0..00e4669bb241 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_PER_CPU(int, smt_id);
>  #endif
> +DEFINE_PER_CPU(int, is_idle);

This + patch 8 immediately reminds me of Aubrey's patch:

  http://lore.kernel.org/r/1615872606-56087-1-git-send-email-aubrey.li@intel.com

last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
have any more work to do on that one (other than rebasing)?

>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
>  #ifdef CONFIG_SCHED_SMT
>       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
>  #endif
> +	per_cpu(is_idle, cpu) = 1;
>       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
>       sd = lowest_flag_domain(cpu, SD_NUMA);
> --
> 2.18.2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] sched/fair: Update affine statistics when needed
  2021-05-07 17:05     ` Srikar Dronamraju
@ 2021-05-11 11:51       ` Valentin Schneider
  2021-05-11 16:22         ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Valentin Schneider @ 2021-05-11 11:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

On 07/05/21 22:35, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2021-05-07 17:08:17]:
>
>> On 06/05/21 22:15, Srikar Dronamraju wrote:
>> > wake_affine_idle() can return prev_cpu. Even in such a scenario,
>> > scheduler was going ahead and updating schedstats related to wake
>> > affine. i.e even if the task is not moved across LLC domains,
>> > schedstats would have accounted.
>> >
>> > Hence add a check before updating schedstats.
>> >
>
> Thanks Valentin for taking a look at the patch.
>
>>
>> I briefly glanced at the git history but didn't find any proper description
>> of that stat. As it stands, it counts the number of times wake_affine()
>> purposedly steered a task towards a particular CPU (waker or wakee's prev),
>> so nr_wakeups_affine / nr_wakeups_affine_attempts is your wake_affine()
>> "success rate" - how often could it make a choice with the available data.
>>
>> I could see a point in only incrementing the count if wake_affine() steers
>> towards the waker rather than the wakee (i.e. don't increment if choice is
>> prev), but then that has no link with LLC spans
>
> Lets say if prev CPU and this CPU were part of the same LLC, and the prev
> CPU was busy (or busier than this CPU), should consider this as a wake
> affine? If prev was idle, we would have surely consider prev CPU. Also since
> both are part of same LLC, we cant say this CPU is more affine than prev
> CPU. Or may be I am confusing wake_affine with cache_affine.
>

SD_WAKE_AFFINE says: "Consider waking task on waking CPU.", with that I
read wake_affine() as: "should I place the wakee close to the waker or
close to its previous CPU?". This can be yes or no even if both are in the
same LLC.

>>
>> > Cc: LKML <linux-kernel@vger.kernel.org>
>> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
>> > Cc: Parth Shah <parth@linux.ibm.com>
>> > Cc: Ingo Molnar <mingo@kernel.org>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Valentin Schneider <valentin.schneider@arm.com>
>> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> > Cc: Mel Gorman <mgorman@techsingularity.net>
>> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> > Cc: Rik van Riel <riel@surriel.com>
>> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> > ---
>> >  kernel/sched/fair.c | 6 ++++--
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 794c2cb945f8..a258a84cfdfd 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>> >       if (target == nr_cpumask_bits)
>> >               return prev_cpu;
>> >
>> > -	schedstat_inc(sd->ttwu_move_affine);
>> > -	schedstat_inc(p->se.statistics.nr_wakeups_affine);
>> > +	if (!cpus_share_cache(prev_cpu, target)) {
>>
>> Per the above, why? Why not just if(target == this_cpu) ?
>
> We could use target == this_cpu. However if prev CPU and this CPU share the
> same LLC, then should we consider moving to this_cpu as an affine wakeup?
>

It would make sense if it's a sync wakeup, which wake_affine() does try to
do ATM (regardless of LLC actually, if I'm reading it correctly).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/8] sched/fair: Update affine statistics when needed
  2021-05-11 11:51       ` Valentin Schneider
@ 2021-05-11 16:22         ` Srikar Dronamraju
  0 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-11 16:22 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-11 12:51:52]:

> On 07/05/21 22:35, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-05-07 17:08:17]:
> >
> >> On 06/05/21 22:15, Srikar Dronamraju wrote:
> >> > wake_affine_idle() can return prev_cpu. Even in such a scenario,
> >> > scheduler was going ahead and updating schedstats related to wake
> >> > affine. i.e even if the task is not moved across LLC domains,
> >> > schedstats would have accounted.
> >
<snip>
> > Lets say if prev CPU and this CPU were part of the same LLC, and the prev
> > CPU was busy (or busier than this CPU), should consider this as a wake
> > affine? If prev was idle, we would have surely consider prev CPU. Also since
> > both are part of same LLC, we cant say this CPU is more affine than prev
> > CPU. Or may be I am confusing wake_affine with cache_affine.
> >
> 
> SD_WAKE_AFFINE says: "Consider waking task on waking CPU.", with that I
> read wake_affine() as: "should I place the wakee close to the waker or
> close to its previous CPU?". This can be yes or no even if both are in the
> same LLC.
> 

Okay.

<snip>

> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -5884,8 +5884,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >> >       if (target == nr_cpumask_bits)
> >> >               return prev_cpu;
> >> >
> >> > -	schedstat_inc(sd->ttwu_move_affine);
> >> > -	schedstat_inc(p->se.statistics.nr_wakeups_affine);
> >> > +	if (!cpus_share_cache(prev_cpu, target)) {
> >>
> >> Per the above, why? Why not just if(target == this_cpu) ?
> >
> > We could use target == this_cpu. However if prev CPU and this CPU share the
> > same LLC, then should we consider moving to this_cpu as an affine wakeup?
> >
> 
> It would make sense if it's a sync wakeup, which wake_affine() does try to
> do ATM (regardless of LLC actually, if I'm reading it correctly).

Okay, I will replace the cpus_share_cache check with target == this_cpu.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core
  2021-05-11 11:51   ` Valentin Schneider
@ 2021-05-11 16:27     ` Srikar Dronamraju
  0 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-11 16:27 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-11 12:51:26]:

> On 06/05/21 22:15, Srikar Dronamraju wrote:
> > Scheduler maintains a per LLC info which tells if there is any idle core
> > in that LLC. However this information doesn't provide which core is idle.
> >
> 

<snip>

> > On the converse, if a task is pinned, and has a restricted
> > cpus_allowed_list and LLC has multiple idle-cores, but select_idle_core
> > cannot find a idle-core, LLC will no more maintain that it has an
> > idle-core.
> >
> 
> This however does sound icky.
> 

<snip>

> > @@ -6127,7 +6129,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >  {
> >       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> >       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > -	bool smt = test_idle_cores(target, false);
> > +	int idle_core = get_idle_core(target, -1);
> > +	bool smt = (idle_core != -1);
> 
> test_idle_cores() tells you if there's at least one idle core in the
> target's LLC. AFAICT get_idle_core() only tells you whether the target's
> core is idle, which is not the same thing.
> 
> Note that this code has recently been changed by Rik in
> 
>   c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
> 
> so as annoying as it is you should probably go try this out / rebase your
> series on top of it (as a rule of thumb for core scheduler stuff you should
> use https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git -b
> tip/sched/core as a base).

Yes I did notice it, I will rebase the next version on top of tip/sched/core

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-11 11:51   ` Valentin Schneider
@ 2021-05-11 16:55     ` Srikar Dronamraju
  2021-05-12  0:32     ` Aubrey Li
  1 sibling, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-11 16:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah, Aubrey Li

* Valentin Schneider <valentin.schneider@arm.com> [2021-05-11 12:51:41]:

> On 06/05/21 22:15, Srikar Dronamraju wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8db40c8a6ad0..00e4669bb241 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_PER_CPU(int, smt_id);
> >  #endif
> > +DEFINE_PER_CPU(int, is_idle);
> 
> This + patch 8 immediately reminds me of Aubrey's patch:
> 
>   http://lore.kernel.org/r/1615872606-56087-1-git-send-email-aubrey.li@intel.com
> 
> last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
> have any more work to do on that one (other than rebasing)?
> 

The above patch also is aimed at similar problems.

However I feel this patch has few differences.
- We use the nr_busy_cpus in this patchset in the wake_affine_idler_llc() to
  differentiate between 2 LLCs and choose a LLC that is lightly loaded.

- Except for the per-cpu is_idle, it gets it done with the existing
  infrastructure. (And we could move the is_idle in the rq struct too.)

- Mel had reservations on per-LLC idlecore, I would think the per-LLC idle
  mask would mean more updates and dirty. Everytime a CPU goes to idle,
  comes out of idle, the mask would be dirtied. Though the number of times
  this new per LLC CPU mask is read is probably going to be lesser than
  idle-cores with this patch series.

- In the said implementation, the idleness of a CPU is done at every check
  which may not be necessary if handled in the callbacks.

> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> >  #ifdef CONFIG_SCHED_SMT
> >       per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> >  #endif
> > +	per_cpu(is_idle, cpu) = 1;
> >       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> >       sd = lowest_flag_domain(cpu, SD_NUMA);
> > --
> > 2.18.2

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-11 11:51   ` Valentin Schneider
  2021-05-11 16:55     ` Srikar Dronamraju
@ 2021-05-12  0:32     ` Aubrey Li
  1 sibling, 0 replies; 31+ messages in thread
From: Aubrey Li @ 2021-05-12  0:32 UTC (permalink / raw)
  To: Valentin Schneider, Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah, Aubrey Li

Hi Valentin,

On 5/11/21 7:51 PM, Valentin Schneider wrote:
> On 06/05/21 22:15, Srikar Dronamraju wrote:
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8db40c8a6ad0..00e4669bb241 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>>  #ifdef CONFIG_SCHED_SMT
>>  DEFINE_PER_CPU(int, smt_id);
>>  #endif
>> +DEFINE_PER_CPU(int, is_idle);
> 
> This + patch 8 immediately reminds me of Aubrey's patch:
> 
>   http://lore.kernel.org/r/1615872606-56087-1-git-send-email-aubrey.li@intel.com
> 
> last I looked it seemed OK, even the test bot seems happy. Aubrey, did you
> have any more work to do on that one (other than rebasing)?
> 

Thanks to mention this patch, in terms of the patch itself, I don't have any more
work, other than rebasing it to Rik's bring-back-select_idle_smt() patch.

But I have some other ideas based on this patch need to verify, for example, if we
have an idle_cpu mask when we scan the idle CPU, we should be able to remove SIS_PROP
feature at all. This removes scan cost computation and should reduce the task wakeup
latency. But I need a while to understand some benchmark regressions.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
  2021-05-11 11:51   ` Valentin Schneider
@ 2021-05-12  8:08   ` Aubrey Li
  2021-05-13  7:31     ` Srikar Dronamraju
  1 sibling, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-12  8:08 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Gautham R Shenoy, Parth Shah

On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> Currently we account nr_busy_cpus in no_hz idle functions.
> There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> configs only. Also scheduler can mark a CPU as non-busy as soon as an
> idle class task starts to run. Scheduler can then mark a CPU as busy
> as soon as its woken up from idle or a new task is placed on it's
> runqueue.

IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
CPU's runqueue, this CPU should be still taken as a wakeup target.

Also, for those frequent context-switching tasks with very short idle,
it's expensive for scheduler to mark idle/busy every time, that's why
my patch only marks idle every time and marks busy ratelimited in
scheduler tick.

Thanks,
-Aubrey

> 
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c     |  6 ++++--
>  kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c |  2 ++
>  4 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c30587631a24..4d3b0928fe98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
>  		goto unlock;
>  	sd->nohz_idle = 0;
>  
> -	atomic_inc(&sd->shared->nr_busy_cpus);
> +	if (sd && per_cpu(is_idle, cpu)) {
> +		atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> +		per_cpu(is_idle, cpu) = 0;
> +	}
>  unlock:
>  	rcu_read_unlock();
>  }
> @@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
>  		goto unlock;
>  	sd->nohz_idle = 1;
>  
> -	atomic_dec(&sd->shared->nr_busy_cpus);
>  unlock:
>  	rcu_read_unlock();
>  }
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index cc828f3efe71..e624a05e48bd 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
>  
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> -#ifdef CONFIG_SCHED_SMT
> +#ifdef CONFIG_SMP
> +	struct sched_domain_shared *sds;
>  	int cpu = rq->cpu;
>  
> +#ifdef CONFIG_SCHED_SMT
>  	if (static_branch_likely(&sched_smt_present))
>  		set_core_busy(cpu);
>  #endif
> +
> +	rcu_read_lock();
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (sds) {
> +		if (per_cpu(is_idle, cpu)) {
> +			atomic_inc(&sds->nr_busy_cpus);
> +			per_cpu(is_idle, cpu) = 0;
> +		}
> +	}
> +	rcu_read_unlock();
> +#endif
>  }
>  
>  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> @@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
>  struct task_struct *pick_next_task_idle(struct rq *rq)
>  {
>  	struct task_struct *next = rq->idle;
> +#ifdef CONFIG_SMP
> +	struct sched_domain_shared *sds;
> +	int cpu = rq->cpu;
>  
> -	set_next_task_idle(rq, next, true);
> +	rcu_read_lock();
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (sds) {
> +		atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> +		per_cpu(is_idle, cpu) = 1;
> +	}
>  
> +	rcu_read_unlock();
> +#endif
> +
> +	set_next_task_idle(rq, next, true);
>  	return next;
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5c0bd4b0e73a..baf8d9a4cb26 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
>  #ifdef CONFIG_SCHED_SMT
>  DECLARE_PER_CPU(int, smt_id);
>  #endif
> +DECLARE_PER_CPU(int, is_idle);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8db40c8a6ad0..00e4669bb241 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_PER_CPU(int, smt_id);
>  #endif
> +DEFINE_PER_CPU(int, is_idle);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
>  #ifdef CONFIG_SCHED_SMT
>  	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
>  #endif
> +	per_cpu(is_idle, cpu) = 1;
>  	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>  
>  	sd = lowest_flag_domain(cpu, SD_NUMA);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-12  8:08   ` Aubrey Li
@ 2021-05-13  7:31     ` Srikar Dronamraju
  2021-05-14  4:11       ` Aubrey Li
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-13  7:31 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:

> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> > Currently we account nr_busy_cpus in no_hz idle functions.
> > There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> > configs only. Also scheduler can mark a CPU as non-busy as soon as an
> > idle class task starts to run. Scheduler can then mark a CPU as busy
> > as soon as its woken up from idle or a new task is placed on it's
> > runqueue.
> 
> IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
> CPU's runqueue, this CPU should be still taken as a wakeup target.
> 

Yes, this CPU is still a wakeup target, its only when this CPU is busy, that
we look at other CPUs

> Also, for those frequent context-switching tasks with very short idle,
> it's expensive for scheduler to mark idle/busy every time, that's why
> my patch only marks idle every time and marks busy ratelimited in
> scheduler tick.
> 

I have tried few tasks with very short idle times and updating nr_busy
everytime, doesnt seem to be impacting. Infact, it seems to help in picking
the idler-llc more often.

> Thanks,
> -Aubrey
> 
> > 
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Parth Shah <parth@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Rik van Riel <riel@surriel.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  kernel/sched/fair.c     |  6 ++++--
> >  kernel/sched/idle.c     | 29 +++++++++++++++++++++++++++--
> >  kernel/sched/sched.h    |  1 +
> >  kernel/sched/topology.c |  2 ++
> >  4 files changed, 34 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c30587631a24..4d3b0928fe98 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10394,7 +10394,10 @@ static void set_cpu_sd_state_busy(int cpu)
> >  		goto unlock;
> >  	sd->nohz_idle = 0;
> >  
> > -	atomic_inc(&sd->shared->nr_busy_cpus);
> > +	if (sd && per_cpu(is_idle, cpu)) {
> > +		atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> > +		per_cpu(is_idle, cpu) = 0;
> > +	}
> >  unlock:
> >  	rcu_read_unlock();
> >  }
> > @@ -10424,7 +10427,6 @@ static void set_cpu_sd_state_idle(int cpu)
> >  		goto unlock;
> >  	sd->nohz_idle = 1;
> >  
> > -	atomic_dec(&sd->shared->nr_busy_cpus);
> >  unlock:
> >  	rcu_read_unlock();
> >  }
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index cc828f3efe71..e624a05e48bd 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -425,12 +425,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> >  
> >  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> >  {
> > -#ifdef CONFIG_SCHED_SMT
> > +#ifdef CONFIG_SMP
> > +	struct sched_domain_shared *sds;
> >  	int cpu = rq->cpu;
> >  
> > +#ifdef CONFIG_SCHED_SMT
> >  	if (static_branch_likely(&sched_smt_present))
> >  		set_core_busy(cpu);
> >  #endif
> > +
> > +	rcu_read_lock();
> > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +	if (sds) {
> > +		if (per_cpu(is_idle, cpu)) {
> > +			atomic_inc(&sds->nr_busy_cpus);
> > +			per_cpu(is_idle, cpu) = 0;
> > +		}
> > +	}
> > +	rcu_read_unlock();
> > +#endif
> >  }
> >  
> >  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> > @@ -442,9 +455,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> >  struct task_struct *pick_next_task_idle(struct rq *rq)
> >  {
> >  	struct task_struct *next = rq->idle;
> > +#ifdef CONFIG_SMP
> > +	struct sched_domain_shared *sds;
> > +	int cpu = rq->cpu;
> >  
> > -	set_next_task_idle(rq, next, true);
> > +	rcu_read_lock();
> > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +	if (sds) {
> > +		atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> > +		per_cpu(is_idle, cpu) = 1;
> > +	}
> >  
> > +	rcu_read_unlock();
> > +#endif
> > +
> > +	set_next_task_idle(rq, next, true);
> >  	return next;
> >  }
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 5c0bd4b0e73a..baf8d9a4cb26 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1483,6 +1483,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
> >  #ifdef CONFIG_SCHED_SMT
> >  DECLARE_PER_CPU(int, smt_id);
> >  #endif
> > +DECLARE_PER_CPU(int, is_idle);
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8db40c8a6ad0..00e4669bb241 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_PER_CPU(int, smt_id);
> >  #endif
> > +DEFINE_PER_CPU(int, is_idle);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> >  #ifdef CONFIG_SCHED_SMT
> >  	per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> >  #endif
> > +	per_cpu(is_idle, cpu) = 1;
> >  	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >  
> >  	sd = lowest_flag_domain(cpu, SD_NUMA);
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-13  7:31     ` Srikar Dronamraju
@ 2021-05-14  4:11       ` Aubrey Li
  2021-05-17 10:40         ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-14  4:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
> 
>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>> Currently we account nr_busy_cpus in no_hz idle functions.
>>> There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
>>> configs only. Also scheduler can mark a CPU as non-busy as soon as an
>>> idle class task starts to run. Scheduler can then mark a CPU as busy
>>> as soon as its woken up from idle or a new task is placed on it's
>>> runqueue.
>>
>> IIRC, we discussed this before, if a SCHED_IDLE task is placed on the
>> CPU's runqueue, this CPU should be still taken as a wakeup target.
>>
> 
> Yes, this CPU is still a wakeup target, its only when this CPU is busy, that
> we look at other CPUs
> 
>> Also, for those frequent context-switching tasks with very short idle,
>> it's expensive for scheduler to mark idle/busy every time, that's why
>> my patch only marks idle every time and marks busy ratelimited in
>> scheduler tick.
>>
> 
> I have tried few tasks with very short idle times and updating nr_busy
> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> the idler-llc more often.
> 

How many CPUs in your LLC?

This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
domain.

It looks like for netperf both TCP and UDP cases have the notable change
under 2 x overcommit, it may be not interesting though.


hackbench(48 tasks per group)
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	group-1 	 1.00 (  6.74)	 -4.61 (  8.97)
process-pipe    	group-2 	 1.00 ( 36.84)	+11.53 ( 26.35)
process-pipe    	group-3 	 1.00 ( 24.97)	+12.21 ( 19.05)
process-pipe    	group-4 	 1.00 ( 18.27)	 -2.62 ( 17.60)
process-pipe    	group-8 	 1.00 (  4.33)	 -2.22 (  3.08)
process-sockets 	group-1 	 1.00 (  7.88)	-20.26 ( 15.97)
process-sockets 	group-2 	 1.00 (  5.38)	-19.41 (  9.25)
process-sockets 	group-3 	 1.00 (  4.22)	 -5.70 (  3.00)
process-sockets 	group-4 	 1.00 (  1.44)	 -1.80 (  0.79)
process-sockets 	group-8 	 1.00 (  0.44)	 -2.86 (  0.06)
threads-pipe    	group-1 	 1.00 (  5.43)	 -3.69 (  3.59)
threads-pipe    	group-2 	 1.00 ( 18.00)	 -2.69 ( 16.79)
threads-pipe    	group-3 	 1.00 ( 21.72)	 -9.01 ( 21.34)
threads-pipe    	group-4 	 1.00 ( 21.58)	 -6.43 ( 16.26)
threads-pipe    	group-8 	 1.00 (  3.05)	 -0.15 (  2.31)
threads-sockets 	group-1 	 1.00 ( 14.51)	 -5.35 ( 13.85)
threads-sockets 	group-2 	 1.00 (  3.97)	-24.15 (  4.40)
threads-sockets 	group-3 	 1.00 (  4.97)	 -9.05 (  2.46)
threads-sockets 	group-4 	 1.00 (  1.98)	 -3.44 (  0.49)
threads-sockets 	group-8 	 1.00 (  0.37)	 -2.13 (  0.20)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	thread-48	 1.00 (  3.84)	 -2.20 (  3.83)
TCP_RR          	thread-96	 1.00 (  5.22)	 -4.97 (  3.90)
TCP_RR          	thread-144	 1.00 (  7.97)	 -0.75 (  4.39)
TCP_RR          	thread-192	 1.00 (  3.03)	 -0.67 (  4.40)
TCP_RR          	thread-384	 1.00 ( 22.27)	-14.15 ( 36.28)
UDP_RR          	thread-48	 1.00 (  2.08)	 -0.39 (  2.29)
UDP_RR          	thread-96	 1.00 (  2.48)	 -4.26 ( 16.06)
UDP_RR          	thread-144	 1.00 ( 49.50)	 -3.28 ( 34.86)
UDP_RR          	thread-192	 1.00 (  6.39)	 +8.07 ( 88.15)
UDP_RR          	thread-384	 1.00 ( 31.54)	-12.76 ( 35.98)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-14  4:11       ` Aubrey Li
@ 2021-05-17 10:40         ` Srikar Dronamraju
  2021-05-17 12:48           ` Aubrey Li
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-17 10:40 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:

> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
> >> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:

<snip>

> >> Also, for those frequent context-switching tasks with very short idle,
> >> it's expensive for scheduler to mark idle/busy every time, that's why
> >> my patch only marks idle every time and marks busy ratelimited in
> >> scheduler tick.
> >>
> > 
> > I have tried few tasks with very short idle times and updating nr_busy
> > everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> > the idler-llc more often.
> > 
> 
> How many CPUs in your LLC?

I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
+
POWER10, Multiple CPUs with 4 CPUs in LLC
+
POWER9, Multiple CPUs with 8 CPUs in LLC

> 
> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> domain.
> 

Okay,

> It looks like for netperf both TCP and UDP cases have the notable change
> under 2 x overcommit, it may be not interesting though.
> 
> 

I believe the extra load on this 24 core LLC could be because we may end up
trying to set the idle-core, even when there is no idle core available.

If possible, can you please give a try with v3 with the call to
set_next_idle_core commented out?


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-17 10:40         ` Srikar Dronamraju
@ 2021-05-17 12:48           ` Aubrey Li
  2021-05-17 12:57             ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-17 12:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
> 
>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> 
> <snip>
> 
>>>> Also, for those frequent context-switching tasks with very short idle,
>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>> my patch only marks idle every time and marks busy ratelimited in
>>>> scheduler tick.
>>>>
>>>
>>> I have tried few tasks with very short idle times and updating nr_busy
>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>> the idler-llc more often.
>>>
>>
>> How many CPUs in your LLC?
> 
> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> +
> POWER10, Multiple CPUs with 4 CPUs in LLC
> +
> POWER9, Multiple CPUs with 8 CPUs in LLC
> 
>>
>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>> domain.
>>
> 
> Okay,
> 
>> It looks like for netperf both TCP and UDP cases have the notable change
>> under 2 x overcommit, it may be not interesting though.
>>
>>
> 
> I believe the extra load on this 24 core LLC could be because we may end up
> trying to set the idle-core, even when there is no idle core available.
> 
> If possible, can you please give a try with v3 with the call to
> set_next_idle_core commented out?
> 
> 

v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-17 12:48           ` Aubrey Li
@ 2021-05-17 12:57             ` Srikar Dronamraju
  2021-05-18  0:59               ` Aubrey Li
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-17 12:57 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-17 20:48:46]:

> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
> > 
> >> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
> >>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> > 
> > <snip>
> > 
> >>>> Also, for those frequent context-switching tasks with very short idle,
> >>>> it's expensive for scheduler to mark idle/busy every time, that's why
> >>>> my patch only marks idle every time and marks busy ratelimited in
> >>>> scheduler tick.
> >>>>
> >>>
> >>> I have tried few tasks with very short idle times and updating nr_busy
> >>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> >>> the idler-llc more often.
> >>>
> >>
> >> How many CPUs in your LLC?
> > 
> > I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> > +
> > POWER10, Multiple CPUs with 4 CPUs in LLC
> > +
> > POWER9, Multiple CPUs with 8 CPUs in LLC
> > 
> >>
> >> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> >> domain.
> >>
> > 
> > Okay,
> > 
> >> It looks like for netperf both TCP and UDP cases have the notable change
> >> under 2 x overcommit, it may be not interesting though.
> >>
> >>
> > 
> > I believe the extra load on this 24 core LLC could be because we may end up
> > trying to set the idle-core, even when there is no idle core available.
> > 
> > If possible, can you please give a try with v3 with the call to
> > set_next_idle_core commented out?
> > 
> > 
> 
> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?

I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
which was tag: sched-core-2021-04-28

The only conflict you get on today's tip is Gautham's one line patch.
Gautham's patch replaced 'this' with 'target'.

The 2nd patch does away with that line

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-17 12:57             ` Srikar Dronamraju
@ 2021-05-18  0:59               ` Aubrey Li
  2021-05-18  4:00                 ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-18  0:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-17 20:48:46]:
> 
>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
>>>
>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>>
>>> <snip>
>>>
>>>>>> Also, for those frequent context-switching tasks with very short idle,
>>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>>>> my patch only marks idle every time and marks busy ratelimited in
>>>>>> scheduler tick.
>>>>>>
>>>>>
>>>>> I have tried few tasks with very short idle times and updating nr_busy
>>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>>>> the idler-llc more often.
>>>>>
>>>>
>>>> How many CPUs in your LLC?
>>>
>>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
>>> +
>>> POWER10, Multiple CPUs with 4 CPUs in LLC
>>> +
>>> POWER9, Multiple CPUs with 8 CPUs in LLC
>>>
>>>>
>>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>>>> domain.
>>>>
>>>
>>> Okay,
>>>
>>>> It looks like for netperf both TCP and UDP cases have the notable change
>>>> under 2 x overcommit, it may be not interesting though.
>>>>
>>>>
>>>
>>> I believe the extra load on this 24 core LLC could be because we may end up
>>> trying to set the idle-core, even when there is no idle core available.
>>>
>>> If possible, can you please give a try with v3 with the call to
>>> set_next_idle_core commented out?
>>>
>>>
>>
>> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
> 
> I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
> which was tag: sched-core-2021-04-28
> 
> The only conflict you get on today's tip is Gautham's one line patch.
> Gautham's patch replaced 'this' with 'target'.
> 
> The 2nd patch does away with that line
> 

This is v3. It looks like hackbench gets better. And netperf still has
some notable changes under 2 x overcommit cases.


hackbench (48 tasks per group)
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	group-1 	 1.00 (  4.51)	 +1.36 (  4.26)
process-pipe    	group-2 	 1.00 ( 18.73)	 -9.66 ( 31.15)
process-pipe    	group-3 	 1.00 ( 23.67)	 +8.52 ( 21.13)
process-pipe    	group-4 	 1.00 ( 14.65)	+17.12 ( 25.23)
process-pipe    	group-8 	 1.00 (  3.11)	+16.41 (  5.94)
process-sockets 	group-1 	 1.00 (  8.83)	 +1.53 ( 11.93)
process-sockets 	group-2 	 1.00 (  5.32)	-15.43 (  7.17)
process-sockets 	group-3 	 1.00 (  4.79)	 -4.14 (  1.90)
process-sockets 	group-4 	 1.00 (  2.39)	 +4.37 (  1.31)
process-sockets 	group-8 	 1.00 (  0.38)	 +4.41 (  0.05)
threads-pipe    	group-1 	 1.00 (  3.06)	 -1.57 (  3.71)
threads-pipe    	group-2 	 1.00 ( 17.41)	 -2.16 ( 15.29)
threads-pipe    	group-3 	 1.00 ( 17.94)	+19.86 ( 13.24)
threads-pipe    	group-4 	 1.00 ( 15.38)	 +3.71 ( 11.97)
threads-pipe    	group-8 	 1.00 (  2.72)	+13.40 (  8.43)
threads-sockets 	group-1 	 1.00 (  8.51)	 -2.73 ( 17.48)
threads-sockets 	group-2 	 1.00 (  5.44)	-12.04 (  5.91)
threads-sockets 	group-3 	 1.00 (  4.38)	 -5.00 (  1.48)
threads-sockets 	group-4 	 1.00 (  1.08)	 +4.46 (  1.15)
threads-sockets 	group-8 	 1.00 (  0.61)	 +5.12 (  0.20)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	thread-48	 1.00 (  3.79)	 +4.69 (  4.03)
TCP_RR          	thread-96	 1.00 (  4.98)	 -6.74 (  3.59)
TCP_RR          	thread-144	 1.00 (  6.04)	 -2.36 (  3.57)
TCP_RR          	thread-192	 1.00 (  4.97)	 -0.44 (  4.89)
TCP_RR          	thread-384	 1.00 ( 19.87)	-19.12 ( 28.99)
UDP_RR          	thread-48	 1.00 ( 12.54)	 -2.73 (  1.59)
UDP_RR          	thread-96	 1.00 (  6.51)	 -6.66 ( 10.42)
UDP_RR          	thread-144	 1.00 ( 45.41)	 -3.81 ( 31.37)
UDP_RR          	thread-192	 1.00 ( 32.06)	 +3.07 ( 71.89)
UDP_RR          	thread-384	 1.00 ( 29.57)	-21.52 ( 35.50)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-18  0:59               ` Aubrey Li
@ 2021-05-18  4:00                 ` Srikar Dronamraju
  2021-05-18  6:05                   ` Aubrey Li
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-18  4:00 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-18 08:59:00]:

> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-17 20:48:46]:
> > 
> >> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
> >>>
> >>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
> >>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> >>>
> >>> <snip>
> >>>
> >>>>>> Also, for those frequent context-switching tasks with very short idle,
> >>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
> >>>>>> my patch only marks idle every time and marks busy ratelimited in
> >>>>>> scheduler tick.
> >>>>>>
> >>>>>
> >>>>> I have tried few tasks with very short idle times and updating nr_busy
> >>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
> >>>>> the idler-llc more often.
> >>>>>
> >>>>
> >>>> How many CPUs in your LLC?
> >>>
> >>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
> >>> +
> >>> POWER10, Multiple CPUs with 4 CPUs in LLC
> >>> +
> >>> POWER9, Multiple CPUs with 8 CPUs in LLC
> >>>
> >>>>
> >>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
> >>>> domain.
> >>>>
> >>>
> >>> Okay,
> >>>
> >>>> It looks like for netperf both TCP and UDP cases have the notable change
> >>>> under 2 x overcommit, it may be not interesting though.
> >>>>
> >>>>
> >>>
> >>> I believe the extra load on this 24 core LLC could be because we may end up
> >>> trying to set the idle-core, even when there is no idle core available.
> >>>
> >>> If possible, can you please give a try with v3 with the call to
> >>> set_next_idle_core commented out?
> >>>
> >>>
> >>
> >> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
> > 
> > I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
> > which was tag: sched-core-2021-04-28
> > 
> > The only conflict you get on today's tip is Gautham's one line patch.
> > Gautham's patch replaced 'this' with 'target'.
> > 
> > The 2nd patch does away with that line
> > 
> 
> This is v3. It looks like hackbench gets better. And netperf still has
> some notable changes under 2 x overcommit cases.
> 

Thanks Aubrey for the results. netperf (2X) case does seem to regress.
I was actually expecting the results to get better with overcommit.
Can you confirm if this was just v3 or with v3 + set_next_idle_core
disabled?

> 
> hackbench (48 tasks per group)
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	group-1 	 1.00 (  4.51)	 +1.36 (  4.26)
> process-pipe    	group-2 	 1.00 ( 18.73)	 -9.66 ( 31.15)
> process-pipe    	group-3 	 1.00 ( 23.67)	 +8.52 ( 21.13)
> process-pipe    	group-4 	 1.00 ( 14.65)	+17.12 ( 25.23)
> process-pipe    	group-8 	 1.00 (  3.11)	+16.41 (  5.94)
> process-sockets 	group-1 	 1.00 (  8.83)	 +1.53 ( 11.93)
> process-sockets 	group-2 	 1.00 (  5.32)	-15.43 (  7.17)
> process-sockets 	group-3 	 1.00 (  4.79)	 -4.14 (  1.90)
> process-sockets 	group-4 	 1.00 (  2.39)	 +4.37 (  1.31)
> process-sockets 	group-8 	 1.00 (  0.38)	 +4.41 (  0.05)
> threads-pipe    	group-1 	 1.00 (  3.06)	 -1.57 (  3.71)
> threads-pipe    	group-2 	 1.00 ( 17.41)	 -2.16 ( 15.29)
> threads-pipe    	group-3 	 1.00 ( 17.94)	+19.86 ( 13.24)
> threads-pipe    	group-4 	 1.00 ( 15.38)	 +3.71 ( 11.97)
> threads-pipe    	group-8 	 1.00 (  2.72)	+13.40 (  8.43)
> threads-sockets 	group-1 	 1.00 (  8.51)	 -2.73 ( 17.48)
> threads-sockets 	group-2 	 1.00 (  5.44)	-12.04 (  5.91)
> threads-sockets 	group-3 	 1.00 (  4.38)	 -5.00 (  1.48)
> threads-sockets 	group-4 	 1.00 (  1.08)	 +4.46 (  1.15)
> threads-sockets 	group-8 	 1.00 (  0.61)	 +5.12 (  0.20)
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	thread-48	 1.00 (  3.79)	 +4.69 (  4.03)
> TCP_RR          	thread-96	 1.00 (  4.98)	 -6.74 (  3.59)
> TCP_RR          	thread-144	 1.00 (  6.04)	 -2.36 (  3.57)
> TCP_RR          	thread-192	 1.00 (  4.97)	 -0.44 (  4.89)
> TCP_RR          	thread-384	 1.00 ( 19.87)	-19.12 ( 28.99)
> UDP_RR          	thread-48	 1.00 ( 12.54)	 -2.73 (  1.59)
> UDP_RR          	thread-96	 1.00 (  6.51)	 -6.66 ( 10.42)
> UDP_RR          	thread-144	 1.00 ( 45.41)	 -3.81 ( 31.37)
> UDP_RR          	thread-192	 1.00 ( 32.06)	 +3.07 ( 71.89)
> UDP_RR          	thread-384	 1.00 ( 29.57)	-21.52 ( 35.50)

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-18  4:00                 ` Srikar Dronamraju
@ 2021-05-18  6:05                   ` Aubrey Li
  2021-05-18  7:18                     ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-18  6:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

On 5/18/21 12:00 PM, Srikar Dronamraju wrote:
> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-18 08:59:00]:
> 
>> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-17 20:48:46]:
>>>
>>>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
>>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
>>>>>
>>>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
>>>>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
>>>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>>>> Also, for those frequent context-switching tasks with very short idle,
>>>>>>>> it's expensive for scheduler to mark idle/busy every time, that's why
>>>>>>>> my patch only marks idle every time and marks busy ratelimited in
>>>>>>>> scheduler tick.
>>>>>>>>
>>>>>>>
>>>>>>> I have tried few tasks with very short idle times and updating nr_busy
>>>>>>> everytime, doesnt seem to be impacting. Infact, it seems to help in picking
>>>>>>> the idler-llc more often.
>>>>>>>
>>>>>>
>>>>>> How many CPUs in your LLC?
>>>>>
>>>>> I have tried with X86, 48 CPUs, 2 nodes, each having 24 CPUs in LLC
>>>>> +
>>>>> POWER10, Multiple CPUs with 4 CPUs in LLC
>>>>> +
>>>>> POWER9, Multiple CPUs with 8 CPUs in LLC
>>>>>
>>>>>>
>>>>>> This is a system with 192 CPUs, 4 nodes and each node has 48 CPUs in LLC
>>>>>> domain.
>>>>>>
>>>>>
>>>>> Okay,
>>>>>
>>>>>> It looks like for netperf both TCP and UDP cases have the notable change
>>>>>> under 2 x overcommit, it may be not interesting though.
>>>>>>
>>>>>>
>>>>>
>>>>> I believe the extra load on this 24 core LLC could be because we may end up
>>>>> trying to set the idle-core, even when there is no idle core available.
>>>>>
>>>>> If possible, can you please give a try with v3 with the call to
>>>>> set_next_idle_core commented out?
>>>>>
>>>>>
>>>>
>>>> v3 seems not be applicable on tip/sched/core 915a2bc3c6b7?
>>>
>>> I had applied on top of 2ea46c6fc9452ac100ad907b051d797225847e33
>>> which was tag: sched-core-2021-04-28
>>>
>>> The only conflict you get on today's tip is Gautham's one line patch.
>>> Gautham's patch replaced 'this' with 'target'.
>>>
>>> The 2nd patch does away with that line
>>>
>>
>> This is v3. It looks like hackbench gets better. And netperf still has
>> some notable changes under 2 x overcommit cases.
>>
> 
> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> I was actually expecting the results to get better with overcommit.
> Can you confirm if this was just v3 or with v3 + set_next_idle_core
> disabled?

Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
changed "this" to "target" in set_idle_cores, and I removed it to apply
v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
commit-id 915a2bc3c6b7.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-18  6:05                   ` Aubrey Li
@ 2021-05-18  7:18                     ` Srikar Dronamraju
  2021-05-19  9:43                       ` Aubrey Li
  0 siblings, 1 reply; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-18  7:18 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-18 14:05:56]:

> On 5/18/21 12:00 PM, Srikar Dronamraju wrote:
> > * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-18 08:59:00]:
> > 
> >> On 5/17/21 8:57 PM, Srikar Dronamraju wrote:
> >>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-17 20:48:46]:
> >>>
> >>>> On 5/17/21 6:40 PM, Srikar Dronamraju wrote:
> >>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-14 12:11:50]:
> >>>>>
> >>>>>> On 5/13/21 3:31 PM, Srikar Dronamraju wrote:
> >>>>>>> * Aubrey Li <aubrey.li@linux.intel.com> [2021-05-12 16:08:24]:
> >>>>>>>> On 5/7/21 12:45 AM, Srikar Dronamraju wrote:
> >>>>>

<snip>
> >>
> >> This is v3. It looks like hackbench gets better. And netperf still has
> >> some notable changes under 2 x overcommit cases.
> >>
> > 
> > Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> > I was actually expecting the results to get better with overcommit.
> > Can you confirm if this was just v3 or with v3 + set_next_idle_core
> > disabled?
> 
> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
> changed "this" to "target" in set_idle_cores, and I removed it to apply
> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
> commit-id 915a2bc3c6b7.

Thats correct,

In the 3rd patch, I had introduced set_next_idle_core
which is suppose to set idle_cores in the LLC.
What I suspected was is this one is causing issues in your 48 CPU LLC.

I am expecting set_next_idle_core to be spending much time in your scenario.
I was planning for something like the below on top of my patch.
With this we dont look for an idle-core if we already know that we dont find one.
But in the mean while I had asked if you could have dropped the call to
set_next_idle_core.

-- 
Thanks and Regards
Srikar Dronamraju

------------>8-----------------8<--------------------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bee8e5225d99..2e2113262647 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6207,6 +6207,9 @@ static void set_next_idle_core(int target)
 	if (!sd)
 		return;
 
+	if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >=  per_cpu(sd_llc_size, target))
+		goto out;
+
 	cpumask_andnot(cpus, sched_domain_span(sd), cpu_smt_mask(target));
 	for_each_cpu_wrap(core, cpus, target) {
 		bool idle = true;
@@ -6225,6 +6228,7 @@ static void set_next_idle_core(int target)
 
 		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 	}
+out:
 	set_idle_core(target, -2);
 }
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-18  7:18                     ` Srikar Dronamraju
@ 2021-05-19  9:43                       ` Aubrey Li
  2021-05-19 17:34                         ` Srikar Dronamraju
  0 siblings, 1 reply; 31+ messages in thread
From: Aubrey Li @ 2021-05-19  9:43 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

On 5/18/21 3:18 PM, Srikar Dronamraju wrote:

>>>> This is v3. It looks like hackbench gets better. And netperf still has
>>>> some notable changes under 2 x overcommit cases.
>>>>
>>>
>>> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
>>> I was actually expecting the results to get better with overcommit.
>>> Can you confirm if this was just v3 or with v3 + set_next_idle_core
>>> disabled?
>>
>> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
>> changed "this" to "target" in set_idle_cores, and I removed it to apply
>> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
>> commit-id 915a2bc3c6b7.
> 
> Thats correct,
> 
> In the 3rd patch, I had introduced set_next_idle_core
> which is suppose to set idle_cores in the LLC.
> What I suspected was is this one is causing issues in your 48 CPU LLC.
> 
> I am expecting set_next_idle_core to be spending much time in your scenario.
> I was planning for something like the below on top of my patch.
> With this we dont look for an idle-core if we already know that we dont find one.
> But in the mean while I had asked if you could have dropped the call to
> set_next_idle_core.
> 

+	if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >=  per_cpu(sd_llc_size, target))
+		goto out;

Does this has side effect if waker and wakee are coalesced on a portion of cores?
Also, is 2 a SMT2 assumption?

I did a quick testing on this, it looks like the regression of netperf 2x cases are 
improved indeed, but hackbench two mid-load cases get worse.

process-sockets 	group-2 	 1.00 (  5.32)	-18.40 (  7.32)
threads-sockets 	group-2 	 1.00 (  5.44)	-20.44 (  4.60)

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback
  2021-05-19  9:43                       ` Aubrey Li
@ 2021-05-19 17:34                         ` Srikar Dronamraju
  0 siblings, 0 replies; 31+ messages in thread
From: Srikar Dronamraju @ 2021-05-19 17:34 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Gautham R Shenoy, Parth Shah

* Aubrey Li <aubrey.li@linux.intel.com> [2021-05-19 17:43:55]:

> On 5/18/21 3:18 PM, Srikar Dronamraju wrote:
> 
> >>>> This is v3. It looks like hackbench gets better. And netperf still has
> >>>> some notable changes under 2 x overcommit cases.
> >>>>
> >>>
> >>> Thanks Aubrey for the results. netperf (2X) case does seem to regress.
> >>> I was actually expecting the results to get better with overcommit.
> >>> Can you confirm if this was just v3 or with v3 + set_next_idle_core
> >>> disabled?
> >>
> >> Do you mean set_idle_cores(not set_next_idle_core) actually? Gautham's patch
> >> changed "this" to "target" in set_idle_cores, and I removed it to apply
> >> v3-2-8-sched-fair-Maintain-the-identity-of-idle-core.patch for tip/sched/core
> >> commit-id 915a2bc3c6b7.
> > 
> > Thats correct,
> > 
> > In the 3rd patch, I had introduced set_next_idle_core
> > which is suppose to set idle_cores in the LLC.
> > What I suspected was is this one is causing issues in your 48 CPU LLC.
> > 
> > I am expecting set_next_idle_core to be spending much time in your scenario.
> > I was planning for something like the below on top of my patch.
> > With this we dont look for an idle-core if we already know that we dont find one.
> > But in the mean while I had asked if you could have dropped the call to
> > set_next_idle_core.
> > 
> 
> +	if (atomic_read(&sd->shared->nr_busy_cpus) * 2 >=  per_cpu(sd_llc_size, target))
> +		goto out;
> 
> Does this has side effect if waker and wakee are coalesced on a portion of cores?
> Also, is 2 a SMT2 assumption?

The above line was just a hack to see if things change by not spending time
searching for an idle-core. Since you were running on Intel, I had
hard-coded it to 2.

> 
> I did a quick testing on this, it looks like the regression of netperf 2x cases are 
> improved indeed, but hackbench two mid-load cases get worse.
> 

In the mid-loaded, case, there was a chance of idle-core being around, Since
we dont set it to an idle-core or -2, i.e idle-core is set to -1, it may or
may not get selected.

> process-sockets 	group-2 	 1.00 (  5.32)	-18.40 (  7.32)
> threads-sockets 	group-2 	 1.00 (  5.44)	-20.44 (  4.60)
> 
> Thanks,
> -Aubrey

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-05-19 17:35 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
2021-05-07 16:08   ` Valentin Schneider
2021-05-07 17:05     ` Srikar Dronamraju
2021-05-11 11:51       ` Valentin Schneider
2021-05-11 16:22         ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
2021-05-11 11:51   ` Valentin Schneider
2021-05-11 16:27     ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
2021-05-11 11:51   ` Valentin Schneider
2021-05-11 16:55     ` Srikar Dronamraju
2021-05-12  0:32     ` Aubrey Li
2021-05-12  8:08   ` Aubrey Li
2021-05-13  7:31     ` Srikar Dronamraju
2021-05-14  4:11       ` Aubrey Li
2021-05-17 10:40         ` Srikar Dronamraju
2021-05-17 12:48           ` Aubrey Li
2021-05-17 12:57             ` Srikar Dronamraju
2021-05-18  0:59               ` Aubrey Li
2021-05-18  4:00                 ` Srikar Dronamraju
2021-05-18  6:05                   ` Aubrey Li
2021-05-18  7:18                     ` Srikar Dronamraju
2021-05-19  9:43                       ` Aubrey Li
2021-05-19 17:34                         ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
2021-05-06 16:53 ` [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.