All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support
@ 2016-07-25 13:34 Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarios. SMP behaviour and performance should be
unaffected.

Test 0:
	for i in `seq 1 10`; \
	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
	       done \
	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
	       END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

	(Higher is better)
tip:	Average events: 151.8 
patch:	Average events: 217.9

Target: ARM Juno (2xA57+4xA53)

	(Higher is better)
tip:	Average events: 1737.7
patch:	Average events: 1952.5

Test 1:
	perf stat --null --repeat 10 -- \
	perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip:    4.632455615 seconds time elapsed ( +-  0.84% )
patch:  4.532470694 seconds time elapsed ( +-  1.28% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip:    61.554834175 seconds time elapsed ( +-  0.11% )
patch:  62.633350367 seconds time elapsed ( +-  0.12% )

Notes:

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

The infrastructure to enable capacity awareness for arm64 and arm is not
provided here but will be based on Juri's DT bindings patch set [1]. A
combined preview branch is available [2]. Test results above a based on
[2].

[1] https://lkml.org/lkml/2016/7/19/419
[2] git://linux-arm.org/linux-power.git capacity_awareness_v3_arm64_v1

Patch   1-4: Generic fixes and clean-ups.
Patch  5-13: Improve capacity awareness.

Tested-by: Koan-Sin Tan <freedom.tan@mediatek.com>
Tested-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>

v3:

- Changed SD_ASYM_CPUCAPACITY sched_domain flag semantics as suggested
  by PeterZ.

- Dropped arm specific patches for setting cpu capacity as these are
  superseded by Juri's patches [2].

- Changed capacity-aware pulling during load-balance to use sched_group
  min capacity instead of max as suggested by Sai.

v2: https://lkml.org/lkml/2016/6/22/614

- Dropped patch ignoring wakee_flips for pid=0 for now as we can not
  distinguish cpu time processing irqs from idle time.

- Dropped disabling WAKE_AFFINE as suggested by Vincent to allow more
  scenarios to use fast-path (select_idle_sibling()). Asymmetric wake
  conditions adjusted accordingly.

- Changed use of new SD_ASYM_CPUCAPACITY slightly. Now enables
  SD_BALANCE_WAKE.

- Minor clean-ups and rebased to more recent tip/sched/core.

v1: https://lkml.org/lkml/2014/5/23/621

Dietmar Eggemann (1):
  sched: Store maximum per-cpu capacity in root domain

Morten Rasmussen (12):
  sched: Fix power to capacity renaming in comment
  sched/fair: Consistent use of prev_cpu in wakeup path
  sched/fair: Optimize find_idlest_cpu() when there is no choice
  sched/core: Remove unnecessary null-pointer check
  sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  sched/core: Pass child domain into sd_init
  sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  sched/fair: Let asymmetric cpu configurations balance at wake-up
  sched/fair: Compute task/cpu utilization at wake-up more correctly
  sched/fair: Consider spare capacity in find_idlest_group()
  sched: Add per-cpu min capacity to sched_group_capacity
  sched/fair: Avoid pulling tasks from non-overloaded higher capacity
    groups

 include/linux/sched.h |   3 +-
 kernel/sched/core.c   |  37 +++++++--
 kernel/sched/fair.c   | 213 ++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h  |   5 +-
 4 files changed, 227 insertions(+), 31 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 01/13] sched: Fix power to capacity renaming in comment
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d99218a1e043..930784dd7cd8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1021,7 +1021,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
-#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 02/13] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen, Rik van Riel

In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping). However, the
select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same cpu
everywhere in the wakeup path.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
cc: Rik van Riel <riel@redhat.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4088eedea763..e7d50acd6028 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -1514,7 +1514,8 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+						   env->dst_cpu);
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -5091,18 +5092,18 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p,
+		       int prev_cpu, int sync)
 {
 	s64 this_load, load;
 	s64 this_eff_load, prev_eff_load;
-	int idx, this_cpu, prev_cpu;
+	int idx, this_cpu;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
 
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
-	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
 	this_load = target_load(this_cpu, idx);
 
@@ -5267,11 +5268,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
@@ -5279,8 +5279,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+		return prev;
 
 	/*
 	 * Otherwise, iterate the domains and find an eligible idle cpu.
@@ -5301,6 +5301,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int i;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
@@ -5409,13 +5411,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (affine_sd) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
 			new_cpu = cpu;
 	}
 
 	if (!sd) {
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 	} else while (sd) {
 		struct sched_group *group;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 04/13] sched/core: Remove unnecessary null-pointer check Morten Rasmussen
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one cpu in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7d50acd6028..c1c0aa8b300b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5229,6 +5229,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	int shallowest_idle_cpu = -1;
 	int i;
 
+	/* Check if we have any choice */
+	if (group->group_weight == 1)
+		return cpumask_first(sched_group_cpus(group));
+
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		if (idle_cpu(i)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 04/13] sched/core: Remove unnecessary null-pointer check
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (2 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-18 10:56   ` [tip:sched/core] sched/core: Remove unnecessary NULL-pointer check tip-bot for Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

Checking if the sched_domain pointer returned by sd_init() is null seems
pointless as sd_init() neither checks if it is valid to begin with nor
set it to null.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e9617a7e7d9..0a40beb46841 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6824,8 +6824,6 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		struct sched_domain *child, int cpu)
 {
 	struct sched_domain *sd = sd_init(tl, cpu);
-	if (!sd)
-		return child;
 
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (3 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 04/13] sched/core: Remove unnecessary null-pointer check Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-15 10:54   ` Peter Zijlstra
                     ` (2 more replies)
  2016-07-25 13:34 ` [PATCH v3 06/13] sched/core: Pass child domain into sd_init Morten Rasmussen
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of cpu capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).

The flag is intended to indicate that extra care should be taken when
placing tasks on cpus and this level spans all the different types of
cpus found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the cpus at parent levels in the
sched_domain hierarchy.

SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY

SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY

cpu:         0      1      2      3
capacity:  756    756   1024   1024

If the topology in the example above is duplicated to create an eight
cpu example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all cpu capacities represented within them.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 930784dd7cd8..4e0c47af9b05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1021,6 +1021,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY	0x0040  /* Groups have different max cpu capacities */
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0a40beb46841..1a22fc9e3788 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5692,6 +5692,7 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUCAPACITY |
+			 SD_ASYM_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
 			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
@@ -5722,6 +5723,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_NEWIDLE |
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
+				SD_ASYM_CPUCAPACITY |
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
@@ -6336,14 +6338,16 @@ static int sched_domains_curr_level;
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  *
- * Odd one out:
+ * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
+	 SD_ASYM_CPUCAPACITY |		\
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 06/13] sched/core: Pass child domain into sd_init
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (4 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-18 10:57   ` [tip:sched/core] sched/core: Pass child domain into sd_init() tip-bot for Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 07/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

If behavioural sched_domain flags depend on topology flags set at higher
domain levels we need a way to update the child domain flags. Moving the
child pointer assignment inside sd_init() should make that possible.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a22fc9e3788..ec3f7aef321f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6351,7 +6351,8 @@ static int sched_domains_curr_level;
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-sd_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl,
+	struct sched_domain *child, int cpu)
 {
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
 	int sd_weight, sd_flags = 0;
@@ -6403,6 +6404,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		.smt_gain		= 0,
 		.max_newidle_lb_cost	= 0,
 		.next_decay_max_lb_cost	= jiffies,
+		.child			= child,
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
 #endif
@@ -6827,14 +6829,13 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = sd_init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, child, cpu);
 
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);
 		child->parent = sd;
-		sd->child = child;
 
 		if (!cpumask_subset(sched_domain_span(child),
 				    sched_domain_span(sd))) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 07/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (5 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 06/13] sched/core: Pass child domain into sd_init Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-18 10:57   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

A domain with the SD_ASYM_CPUCAPACITY flag set indicate that
sched_groups at this level and below do not include cpus of all
capacities available (e.g. group containing little-only or big-only cpus
in big.LITTLE systems). It is therefore necessary to put in more effort
in finding an appropriate cpu at task wake-up by enabling balancing at
wake-up (SD_BALANCE_WAKE) on all lower (child) levels.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec3f7aef321f..a0a74b2d9f41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6414,6 +6414,13 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * Convert topological properties into behaviour.
 	 */
 
+	if (sd->flags & SD_ASYM_CPUCAPACITY) {
+		struct sched_domain *t = sd;
+
+		for_each_lower_domain(t)
+			t->flags |= SD_BALANCE_WAKE;
+	}
+
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (6 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 07/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-01 18:53   ` Dietmar Eggemann
  2016-07-25 13:34 ` [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

The max per-cpu capacity should be 1024 for all systems except SMT,
where the capacity is currently based on smt_gain and the number of
hardware threads and is <1024. If SMT can be brought to work with a
per-thread capacity of 1024, this patch can be dropped and replaced by a
hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a0a74b2d9f41..600f1296aa4c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6873,6 +6873,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6923,11 +6924,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
+
+		/* Use READ_ONCE/WRITE_ONCE to avoid load/store tearing */
+		if (rq->cpu_capacity_orig > READ_ONCE(rq->rd->max_cpu_capacity))
+			WRITE_ONCE(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig);
+
 		cpu_attach_domain(sd, d.rd, i);
 	}
 	rcu_read_unlock();
 
+	if (rq) {
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+	}
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f44da95c70cd..444d8f38743f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,8 @@ struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (7 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-15 13:39   ` Peter Zijlstra
  2016-08-18 10:58   ` [tip:sched/core] sched/fair: Let asymmetric CPU " tip-bot for Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the waking cpu and the
previous cpu, and all cpus within their respective
SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing
take over (find_idlest_{group, cpu}()).

This patch makes affine wake-ups conditional on whether both the waker
cpu and prev_cpu has sufficient capacity for the waking task, or not,
assuming that the cpu capacities within an SD_SHARE_PKG_RESOURCES domain
(sd_llc) are homogeneous.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1c0aa8b300b..9e217eff3daf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with cpu capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5366,6 +5372,25 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+	return p->se.avg.util_avg;
+}
+
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+	long min_cap, max_cap;
+
+	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
+	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
+
+	/* Minimum capacity is close to max, no need to abort wake_affine */
+	if (max_cap - min_cap < max_cap >> 3)
+		return 0;
+
+	return min_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5389,7 +5414,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 	}
 
 	rcu_read_lock();
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (8 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-15 14:23   ` Peter Zijlstra
  2016-07-25 13:34 ` [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch introduces task_util_wake() which
computes the decayed task utilization based on the last update of the
previous cpu's last load-tracking update. It is done without having to
take the rq lock, similar to how it is done in remove_entity_load_avg().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e217eff3daf..9c6ec3bf75ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5377,6 +5377,75 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+/*
+ * task_util_wake: Returns an updated estimate of the utilization contribution
+ * of a waking task. At wake-up the task blocked utilization contribution
+ * (cfs_rq->avg) may have decayed while the utilization tracking of the task
+ * (se->avg) hasn't yet.
+ * Note that this estimate isn't perfectly accurate as the 1ms boundaries used
+ * for updating util_avg in __update_load_avg() are not considered here. This
+ * results in an error of up to 1ms utilization decay/accumulation which leads
+ * to an absolute util_avg error margin of 1024*1024/LOAD_AVG_MAX ~= 22
+ * (for LOAD_AVG_MAX = 47742).
+ */
+static inline int task_util_wake(struct task_struct *p)
+{
+	struct cfs_rq *prev_cfs_rq = &task_rq(p)->cfs;
+	struct sched_avg *psa = &p->se.avg;
+	u64 cfs_rq_last_update, p_last_update, delta;
+	u32 util_decayed;
+
+	p_last_update = psa->last_update_time;
+
+	/*
+	 * Task on rq (exec()) should be load-tracking aligned already.
+	 * New tasks have no history and should use the init value.
+	 */
+	if (p->se.on_rq || !p_last_update)
+		return task_util(p);
+
+	cfs_rq_last_update = cfs_rq_last_update_time(prev_cfs_rq);
+	delta = cfs_rq_last_update - p_last_update;
+
+	if ((s64)delta <= 0)
+		return task_util(p);
+
+	delta >>= 20;
+
+	if (!delta)
+		return task_util(p);
+
+	util_decayed = decay_load((u64)psa->util_sum, delta);
+	util_decayed /= LOAD_AVG_MAX;
+
+	/*
+	 * psa->util_avg can be slightly out of date as it is only updated
+	 * when a 1ms boundary is crossed.
+	 * See 'decayed' in __update_load_avg()
+	 */
+	util_decayed = min_t(unsigned long, util_decayed, task_util(p));
+
+	return util_decayed;
+}
+
+/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+	unsigned long util, capacity;
+
+	/* Task has no contribution or is new */
+	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+		return cpu_util(cpu);
+
+	capacity = capacity_orig_of(cpu);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_wake(p), 0);
+
+	return (util >= capacity) ? capacity : util;
+}
+
 static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group()
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (9 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-08-16 13:57   ` Vincent Guittot
  2016-07-25 13:34 ` [PATCH v3 12/13] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 13/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
  12 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c6ec3bf75ce..e3654409d099 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5164,6 +5164,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+	return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
@@ -5173,7 +5181,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *most_spare_sg = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5181,7 +5191,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5193,8 +5203,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 
-		/* Tally up the load of all CPUs in the group */
+		/*
+		 * Tally up the load of all CPUs in the group and find
+		 * the group containing the cpu with most spare capacity.
+		 */
 		avg_load = 0;
+		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5204,6 +5218,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			spare_cap = capacity_spare_wake(i, p);
+
+			if (spare_cap > max_spare_cap &&
+			    spare_cap > capacity_of(i) >> 3) {
+				max_spare_cap = spare_cap;
+			}
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -5211,12 +5232,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
-			idlest = group;
+			this_spare = max_spare_cap;
+		} else {
+			if (avg_load < min_load) {
+				min_load = avg_load;
+				idlest = group;
+			}
+
+			if (most_spare < max_spare_cap) {
+				most_spare = max_spare_cap;
+				most_spare_sg = group;
+			}
 		}
 	} while (group = group->next, group != sd->groups);
 
+	/* Found a significant amount of spare capacity. */
+	if (this_spare > task_util(p) / 2 &&
+	    imbalance*this_spare > 100*most_spare)
+		return NULL;
+	else if (most_spare > task_util(p) / 2)
+		return most_spare_sg;
+
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 12/13] sched: Add per-cpu min capacity to sched_group_capacity
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (10 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  2016-07-25 13:34 ` [PATCH v3 13/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
  12 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate min per-cpu capacity in the group a
suitable group for a given task utilization can more easily be found
such that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/fair.c  | 17 ++++++++++++-----
 kernel/sched/sched.h |  3 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 600f1296aa4c..745817970162 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5632,7 +5632,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %d)",
+			printk(KERN_CONT " (cpu_capacity = %lu)",
 				group->sgc->capacity);
 		}
 
@@ -6101,6 +6101,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3654409d099..f3019b85b6e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6731,13 +6731,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->min_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity;
+	unsigned long capacity, min_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6750,6 +6751,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 	}
 
 	capacity = 0;
+	min_capacity = ULONG_MAX;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6774,11 +6776,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 */
 			if (unlikely(!rq->sd)) {
 				capacity += capacity_of(cpu);
-				continue;
+			} else {
+				sgc = rq->sd->groups->sgc;
+				capacity += sgc->capacity;
 			}
 
-			sgc = rq->sd->groups->sgc;
-			capacity += sgc->capacity;
+			min_capacity = min(capacity, min_capacity);
 		}
 	} else  {
 		/*
@@ -6788,12 +6791,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity += group->sgc->capacity;
+			struct sched_group_capacity *sgc = group->sgc;
+
+			capacity += sgc->capacity;
+			min_capacity = min(sgc->min_capacity, min_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->min_capacity = min_capacity;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 444d8f38743f..75368a5eb836 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -869,7 +869,8 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity;
+	unsigned long capacity;
+	unsigned long min_capacity; /* Min per-cpu capacity in group */
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 13/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (11 preceding siblings ...)
  2016-07-25 13:34 ` [PATCH v3 12/13] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
@ 2016-07-25 13:34 ` Morten Rasmussen
  12 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-07-25 13:34 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3019b85b6e3..df07235f743d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6895,6 +6895,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 	return false;
 }
 
+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->min_capacity * capacity_margin <
+						ref->sgc->min_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -6998,6 +7009,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= busiest->avg_load)
 		return false;
 
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+		goto asym_packing;
+
+	/*
+	 * Candidate sg has no more than one task per cpu and
+	 * has higher per-cpu capacity. Migrating tasks to less
+	 * capable cpus may harm throughput. Maximize throughput,
+	 * power/energy consequences are not considered.
+	 */
+	if (sgs->sum_nr_running <= sgs->group_weight &&
+	    group_smaller_cpu_capacity(sds->local, sg))
+		return false;
+
+asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
 		return true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-25 13:34 ` [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
@ 2016-08-01 18:53   ` Dietmar Eggemann
  2016-08-16 12:24     ` Vincent Guittot
  2016-08-18 10:58     ` [tip:sched/core] sched/core: Store maximum per-CPU " tip-bot for Dietmar Eggemann
  0 siblings, 2 replies; 45+ messages in thread
From: Dietmar Eggemann @ 2016-08-01 18:53 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: yuyang.du, vincent.guittot, mgalbraith, sgurrappadi, freedom.tan,
	keita.kobayashi.ym, linux-kernel

On 25/07/16 14:34, Morten Rasmussen wrote:

[...]

> @@ -6923,11 +6924,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>  	/* Attach the domains */
>  	rcu_read_lock();
>  	for_each_cpu(i, cpu_map) {
> +		rq = cpu_rq(i);
>  		sd = *per_cpu_ptr(d.sd, i);
> +
> +		/* Use READ_ONCE/WRITE_ONCE to avoid load/store tearing */
> +		if (rq->cpu_capacity_orig > READ_ONCE(rq->rd->max_cpu_capacity))
> +			WRITE_ONCE(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig);

We have to use d.rd rather rq->rd here since from v3 on we have this if
condition in front of the cpu_attach_domain() call which replaces
rq->rd with d.rd. Fixed patch below.

> +
>  		cpu_attach_domain(sd, d.rd, i);
>  	}
>  	rcu_read_unlock();
>  
> +	if (rq) {
> +		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
> +			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> +	}
> +
>  	ret = 0;
>  error:
>  	__free_domain_allocs(&d, alloc_state, cpu_map);

[...]

-- >8 --

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

The max per-cpu capacity should be 1024 for all systems except SMT,
where the capacity is currently based on smt_gain and the number of
hardware threads and is <1024. If SMT can be brought to work with a
per-thread capacity of 1024, this patch can be dropped and replaced by a
hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a0a74b2d9f41..db03e6226d54 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6873,6 +6873,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6923,11 +6924,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
+
+		/* Use READ_ONCE/WRITE_ONCE to avoid load/store tearing */
+		if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
+			WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+
 		cpu_attach_domain(sd, d.rd, i);
 	}
 	rcu_read_unlock();
 
+	if (rq) {
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+	}
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f44da95c70cd..444d8f38743f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,8 @@ struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
@ 2016-08-15 10:54   ` Peter Zijlstra
  2016-08-15 11:43     ` Morten Rasmussen
  2016-08-18 10:56     ` [tip:sched/core] sched/core: Clarify SD_flags comment tip-bot for Peter Zijlstra
  2016-08-17  8:42   ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Wanpeng Li
  2016-08-18 10:56   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
  2 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-08-15 10:54 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Jul 25, 2016 at 02:34:22PM +0100, Morten Rasmussen wrote:
> @@ -6336,14 +6338,16 @@ static int sched_domains_curr_level;
>   * SD_NUMA                - describes NUMA topologies
>   * SD_SHARE_POWERDOMAIN   - describes shared power domain
>   *
> - * Odd one out:
> + * Odd ones out:
>   * SD_ASYM_PACKING        - describes SMT quirks
> + * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
>   */

So I'm not sure the new CPUCAPACITY is 'odd'.

That said, the comment is very terse and doesn't explain why PACKING is
odd.

IIRC the distinction is that the 'normal' ones only describe topology,
while the ASYM_PACKING one also prescribes behaviour. It is odd in the
way that it doesn't only describe things.

This ASYM_CPUCAPACITY otoh is purely descriptive, it doesn't prescribe
how to deal with it.

Does something like so  clarify things?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6355,13 +6355,19 @@ static int sched_domains_curr_level;
 /*
  * SD_flags allowed in topology descriptions.
  *
- * SD_SHARE_CPUCAPACITY      - describes SMT topologies
- * SD_SHARE_PKG_RESOURCES - describes shared caches
- * SD_NUMA                - describes NUMA topologies
- * SD_SHARE_POWERDOMAIN   - describes shared power domain
+ * These flags are purely descriptive of the topology and do not prescribe
+ * behaviour. Behaviour is artificial and mapped in the below sd_init()
+ * function:
  *
- * Odd one out:
- * SD_ASYM_PACKING        - describes SMT quirks
+ *   SD_SHARE_CPUCAPACITY   - describes SMT topologies
+ *   SD_SHARE_PKG_RESOURCES - describes shared caches
+ *   SD_NUMA                - describes NUMA topologies
+ *   SD_SHARE_POWERDOMAIN   - describes shared power domain
+ *
+ * Odd one out, which beside describing the topology has a quirk also
+ * prescribes the desired behaviour that goes along with it:
+ *
+ *   SD_ASYM_PACKING        - describes SMT quirks
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-08-15 10:54   ` Peter Zijlstra
@ 2016-08-15 11:43     ` Morten Rasmussen
  2016-08-18 10:56     ` [tip:sched/core] sched/core: Clarify SD_flags comment tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-15 11:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 12:54:59PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 25, 2016 at 02:34:22PM +0100, Morten Rasmussen wrote:
> > @@ -6336,14 +6338,16 @@ static int sched_domains_curr_level;
> >   * SD_NUMA                - describes NUMA topologies
> >   * SD_SHARE_POWERDOMAIN   - describes shared power domain
> >   *
> > - * Odd one out:
> > + * Odd ones out:
> >   * SD_ASYM_PACKING        - describes SMT quirks
> > + * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
> >   */
> 
> So I'm not sure the new CPUCAPACITY is 'odd'.
> 
> That said, the comment is very terse and doesn't explain why PACKING is
> odd.
> 
> IIRC the distinction is that the 'normal' ones only describe topology,
> while the ASYM_PACKING one also prescribes behaviour. It is odd in the
> way that it doesn't only describe things.
> 
> This ASYM_CPUCAPACITY otoh is purely descriptive, it doesn't prescribe
> how to deal with it.

I think I initially put it in as an 'odd' flag due to the somewhat
strange semantics in the previous versions, but now that it is fixed I
agree that it belongs together with purely descriptive flags.

> 
> Does something like so  clarify things?
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6355,13 +6355,19 @@ static int sched_domains_curr_level;
>  /*
>   * SD_flags allowed in topology descriptions.
>   *
> - * SD_SHARE_CPUCAPACITY      - describes SMT topologies
> - * SD_SHARE_PKG_RESOURCES - describes shared caches
> - * SD_NUMA                - describes NUMA topologies
> - * SD_SHARE_POWERDOMAIN   - describes shared power domain
> + * These flags are purely descriptive of the topology and do not prescribe
> + * behaviour. Behaviour is artificial and mapped in the below sd_init()
> + * function:
>   *
> - * Odd one out:
> - * SD_ASYM_PACKING        - describes SMT quirks
> + *   SD_SHARE_CPUCAPACITY   - describes SMT topologies
> + *   SD_SHARE_PKG_RESOURCES - describes shared caches
> + *   SD_NUMA                - describes NUMA topologies
> + *   SD_SHARE_POWERDOMAIN   - describes shared power domain
> + *
> + * Odd one out, which beside describing the topology has a quirk also
> + * prescribes the desired behaviour that goes along with it:
> + *
> + *   SD_ASYM_PACKING        - describes SMT quirks
>   */
>  #define TOPOLOGY_SD_FLAGS		\
>  	(SD_SHARE_CPUCAPACITY |		\

I like it :)

Morten

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-25 13:34 ` [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-08-15 13:39   ` Peter Zijlstra
  2016-08-15 15:01     ` Morten Rasmussen
  2016-08-18 10:58   ` [tip:sched/core] sched/fair: Let asymmetric CPU " tip-bot for Morten Rasmussen
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-08-15 13:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Jul 25, 2016 at 02:34:26PM +0100, Morten Rasmussen wrote:


Because I forgot _again_, I added:

/*
 * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
 * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
 *
 * In that case WAKE_AFFINE doesn't make sense and we'll let
 * BALANCE_WAKE sort things out.
 */

> +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> +{
> +	long min_cap, max_cap;
> +
> +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;

There's a tiny hole here, which I'm fairly sure we don't care about. If
@p last ran on @prev_cpu before @prev_cpu was split from @rd this
doesn't 'work' right.

> +	/* Minimum capacity is close to max, no need to abort wake_affine */
> +	if (max_cap - min_cap < max_cap >> 3)
> +		return 0;
> +
> +	return min_cap * 1024 < task_util(p) * capacity_margin;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5389,7 +5414,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> +			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>  	}
>  
>  	rcu_read_lock();
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-07-25 13:34 ` [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
@ 2016-08-15 14:23   ` Peter Zijlstra
  2016-08-15 15:42     ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-08-15 14:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Jul 25, 2016 at 02:34:27PM +0100, Morten Rasmussen wrote:
> To solve this problem, this patch introduces task_util_wake() which
> computes the decayed task utilization based on the last update of the
> previous cpu's last load-tracking update. It is done without having to
> take the rq lock, similar to how it is done in remove_entity_load_avg().

But unlike that function, it doesn't actually use __update_load_avg().
Why not?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-08-15 13:39   ` Peter Zijlstra
@ 2016-08-15 15:01     ` Morten Rasmussen
  2016-08-15 15:10       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-15 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 03:39:49PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 25, 2016 at 02:34:26PM +0100, Morten Rasmussen wrote:
> 
> 
> Because I forgot _again_, I added:
> 
> /*
>  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
>  *
>  * In that case WAKE_AFFINE doesn't make sense and we'll let
>  * BALANCE_WAKE sort things out.
>  */

Thanks.

> 
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +	long min_cap, max_cap;
> > +
> > +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> > +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> 
> There's a tiny hole here, which I'm fairly sure we don't care about. If
> @p last ran on @prev_cpu before @prev_cpu was split from @rd this
> doesn't 'work' right.

I hadn't considered that. What is 'working right' in this scenario?
Ignoring @prev_cpu as it isn't a valid option anymore?

In that case, since @prev_cpu is only used as part the min() it should
only cause min_cap to be potentially smaller than it should be, not
larger. It could lead us to let BALANCE_WAKE take over in scenarios
where select_idle_sibling() would have been sufficient, but it should
harm.

However, as you say, I'm not sure if we care that much.

Talking about @rd, I discussed with Juri and Dietmar the other week
whether the root_domain is RCU protected, and if we therefore have to
move the call to wake_cap() after the rcu_read_lock() below. I haven't
yet done thorough investigation to find the answer. Should it be
protected?

> 
> > +	/* Minimum capacity is close to max, no need to abort wake_affine */
> > +	if (max_cap - min_cap < max_cap >> 3)
> > +		return 0;
> > +
> > +	return min_cap * 1024 < task_util(p) * capacity_margin;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -5389,7 +5414,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > +			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> >  	}
> >  
> >  	rcu_read_lock();
> > -- 
> > 1.9.1
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-08-15 15:01     ` Morten Rasmussen
@ 2016-08-15 15:10       ` Peter Zijlstra
  2016-08-15 15:30         ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-08-15 15:10 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 04:01:34PM +0100, Morten Rasmussen wrote:
> On Mon, Aug 15, 2016 at 03:39:49PM +0200, Peter Zijlstra wrote:
> > > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > +{
> > > +	long min_cap, max_cap;
> > > +
> > > +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> > > +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> > 
> > There's a tiny hole here, which I'm fairly sure we don't care about. If
> > @p last ran on @prev_cpu before @prev_cpu was split from @rd this
> > doesn't 'work' right.
> 
> I hadn't considered that. What is 'working right' in this scenario?
> Ignoring @prev_cpu as it isn't a valid option anymore?

Probably, yeah.

> In that case, since @prev_cpu is only used as part the min() it should
> only cause min_cap to be potentially smaller than it should be, not
> larger. It could lead us to let BALANCE_WAKE take over in scenarios
> where select_idle_sibling() would have been sufficient, but it should
> harm.

+not, right?

> However, as you say, I'm not sure if we care that much.

Yeah, don't think so, its extremely unlikely to happen, almost nobody
mucks about with root_domains anyway. And those that do, do so once to
setup things and then leave them be.

> Talking about @rd, I discussed with Juri and Dietmar the other week
> whether the root_domain is RCU protected, and if we therefore have to
> move the call to wake_cap() after the rcu_read_lock() below. I haven't
> yet done thorough investigation to find the answer. Should it be
> protected?

Yeah, I think either RCU or RCU-sched, I forever forget.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-08-15 15:10       ` Peter Zijlstra
@ 2016-08-15 15:30         ` Morten Rasmussen
  0 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-15 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 05:10:06PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 15, 2016 at 04:01:34PM +0100, Morten Rasmussen wrote:
> > On Mon, Aug 15, 2016 at 03:39:49PM +0200, Peter Zijlstra wrote:
> > > > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > > +{
> > > > +	long min_cap, max_cap;
> > > > +
> > > > +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> > > > +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> > > 
> > > There's a tiny hole here, which I'm fairly sure we don't care about. If
> > > @p last ran on @prev_cpu before @prev_cpu was split from @rd this
> > > doesn't 'work' right.
> > 
> > I hadn't considered that. What is 'working right' in this scenario?
> > Ignoring @prev_cpu as it isn't a valid option anymore?
> 
> Probably, yeah.
> 
> > In that case, since @prev_cpu is only used as part the min() it should
> > only cause min_cap to be potentially smaller than it should be, not
> > larger. It could lead us to let BALANCE_WAKE take over in scenarios
> > where select_idle_sibling() would have been sufficient, but it should
> > harm.
> 
> +not, right?

Yes :)

> 
> > However, as you say, I'm not sure if we care that much.
> 
> Yeah, don't think so, its extremely unlikely to happen, almost nobody
> mucks about with root_domains anyway. And those that do, do so once to
> setup things and then leave them be.
> 
> > Talking about @rd, I discussed with Juri and Dietmar the other week
> > whether the root_domain is RCU protected, and if we therefore have to
> > move the call to wake_cap() after the rcu_read_lock() below. I haven't
> > yet done thorough investigation to find the answer. Should it be
> > protected?
> 
> Yeah, I think either RCU or RCU-sched, I forever forget.

Okay. Should I send an updated version?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-15 14:23   ` Peter Zijlstra
@ 2016-08-15 15:42     ` Morten Rasmussen
  2016-08-18  8:40       ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-15 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 25, 2016 at 02:34:27PM +0100, Morten Rasmussen wrote:
> > To solve this problem, this patch introduces task_util_wake() which
> > computes the decayed task utilization based on the last update of the
> > previous cpu's last load-tracking update. It is done without having to
> > take the rq lock, similar to how it is done in remove_entity_load_avg().
> 
> But unlike that function, it doesn't actually use __update_load_avg().
> Why not?

Fair question :)

We currently exploit the fact that the task utilization is _not_ updated
in wake-up balancing to make sure we don't under-estimate the capacity
requirements for tasks that have slept for a while. If we update it, we
loose the non-decayed 'peak' utilization, but I guess we could just
store it somewhere when we do the wake-up decay.

I thought there was a better reason when I wrote the patch, but I don't
recall right now. I will look into it again and see if we can use
__update_load_avg() to do a proper update instead of doing things twice.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain
  2016-08-01 18:53   ` Dietmar Eggemann
@ 2016-08-16 12:24     ` Vincent Guittot
  2016-08-18 10:58     ` [tip:sched/core] sched/core: Store maximum per-CPU " tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 45+ messages in thread
From: Vincent Guittot @ 2016-08-16 12:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	Sai Charan Gurrappadi, freedom.tan, keita.kobayashi.ym,
	linux-kernel

On 1 August 2016 at 20:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 25/07/16 14:34, Morten Rasmussen wrote:
>
> [...]
>
>> @@ -6923,11 +6924,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>>       /* Attach the domains */
>>       rcu_read_lock();
>>       for_each_cpu(i, cpu_map) {
>> +             rq = cpu_rq(i);
>>               sd = *per_cpu_ptr(d.sd, i);
>> +
>> +             /* Use READ_ONCE/WRITE_ONCE to avoid load/store tearing */
>> +             if (rq->cpu_capacity_orig > READ_ONCE(rq->rd->max_cpu_capacity))
>> +                     WRITE_ONCE(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig);
>
> We have to use d.rd rather rq->rd here since from v3 on we have this if
> condition in front of the cpu_attach_domain() call which replaces
> rq->rd with d.rd. Fixed patch below.
>
>> +
>>               cpu_attach_domain(sd, d.rd, i);
>>       }
>>       rcu_read_unlock();
>>
>> +     if (rq) {
>> +             pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
>> +                     cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
>> +     }
>> +
>>       ret = 0;
>>  error:
>>       __free_domain_allocs(&d, alloc_state, cpu_map);
>
> [...]
>
> -- >8 --
>
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> To be able to compare the capacity of the target cpu with the highest
> available cpu capacity, store the maximum per-cpu capacity in the root
> domain.
>
> The max per-cpu capacity should be 1024 for all systems except SMT,
> where the capacity is currently based on smt_gain and the number of
> hardware threads and is <1024. If SMT can be brought to work with a
> per-thread capacity of 1024, this patch can be dropped and replaced by a
> hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE).

FWIW, Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/core.c  | 12 ++++++++++++
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 14 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a0a74b2d9f41..db03e6226d54 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6873,6 +6873,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>         enum s_alloc alloc_state;
>         struct sched_domain *sd;
>         struct s_data d;
> +       struct rq *rq = NULL;
>         int i, ret = -ENOMEM;
>
>         alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
> @@ -6923,11 +6924,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>         /* Attach the domains */
>         rcu_read_lock();
>         for_each_cpu(i, cpu_map) {
> +               rq = cpu_rq(i);
>                 sd = *per_cpu_ptr(d.sd, i);
> +
> +               /* Use READ_ONCE/WRITE_ONCE to avoid load/store tearing */
> +               if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
> +                       WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
> +
>                 cpu_attach_domain(sd, d.rd, i);
>         }
>         rcu_read_unlock();
>
> +       if (rq) {
> +               pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
> +                       cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> +       }
> +
>         ret = 0;
>  error:
>         __free_domain_allocs(&d, alloc_state, cpu_map);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f44da95c70cd..444d8f38743f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -565,6 +565,8 @@ struct root_domain {
>          */
>         cpumask_var_t rto_mask;
>         struct cpupri cpupri;
> +
> +       unsigned long max_cpu_capacity;
>  };
>
>  extern struct root_domain def_root_domain;
> --
> 1.9.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group()
  2016-07-25 13:34 ` [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
@ 2016-08-16 13:57   ` Vincent Guittot
  2016-08-18 11:16     ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Vincent Guittot @ 2016-08-16 13:57 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	Sai Charan Gurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On 25 July 2016 at 15:34, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> In low-utilization scenarios comparing relative loads in
> find_idlest_group() doesn't always lead to the most optimum choice.
> Systems with groups containing different numbers of cpus and/or cpus of
> different compute capacity are significantly better off when considering
> spare capacity rather than relative load in those scenarios.
>
> In addition to existing load based search an alternative spare capacity
> based candidate sched_group is found and selected instead if sufficient
> spare capacity exists. If not, existing behaviour is preserved.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9c6ec3bf75ce..e3654409d099 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5164,6 +5164,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>         return 1;
>  }
>
> +static inline int task_util(struct task_struct *p);
> +static int cpu_util_wake(int cpu, struct task_struct *p);
> +
> +static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> +{
> +       return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
> +}
> +
>  /*
>   * find_idlest_group finds and returns the least busy CPU group within the
>   * domain.
> @@ -5173,7 +5181,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>                   int this_cpu, int sd_flag)
>  {
>         struct sched_group *idlest = NULL, *group = sd->groups;
> +       struct sched_group *most_spare_sg = NULL;
>         unsigned long min_load = ULONG_MAX, this_load = 0;
> +       unsigned long most_spare = 0, this_spare = 0;
>         int load_idx = sd->forkexec_idx;
>         int imbalance = 100 + (sd->imbalance_pct-100)/2;
>
> @@ -5181,7 +5191,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>                 load_idx = sd->wake_idx;
>
>         do {
> -               unsigned long load, avg_load;
> +               unsigned long load, avg_load, spare_cap, max_spare_cap;
>                 int local_group;
>                 int i;
>
> @@ -5193,8 +5203,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>                 local_group = cpumask_test_cpu(this_cpu,
>                                                sched_group_cpus(group));
>
> -               /* Tally up the load of all CPUs in the group */
> +               /*
> +                * Tally up the load of all CPUs in the group and find
> +                * the group containing the cpu with most spare capacity.
> +                */
>                 avg_load = 0;
> +               max_spare_cap = 0;
>
>                 for_each_cpu(i, sched_group_cpus(group)) {
>                         /* Bias balancing toward cpus of our domain */
> @@ -5204,6 +5218,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>                                 load = target_load(i, load_idx);
>
>                         avg_load += load;
> +
> +                       spare_cap = capacity_spare_wake(i, p);
> +
> +                       if (spare_cap > max_spare_cap &&
> +                           spare_cap > capacity_of(i) >> 3) {

This condition probably needs some descriptions. You're not only
looking for max spare capacity but also a significant spare capacity
(more than 12.5% of cpu_capacity_orig). Can't this additional test
lead to some strange situation where a CPU with more spare capacity
will not be selected because of this 12.5% condition whereas another
with less spare capacity will be selected because its capacity_orig is
lower ?

> +                               max_spare_cap = spare_cap;
> +                       }
>                 }
>
>                 /* Adjust by relative CPU capacity of the group */
> @@ -5211,12 +5232,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>
>                 if (local_group) {
>                         this_load = avg_load;
> -               } else if (avg_load < min_load) {
> -                       min_load = avg_load;
> -                       idlest = group;
> +                       this_spare = max_spare_cap;
> +               } else {
> +                       if (avg_load < min_load) {
> +                               min_load = avg_load;
> +                               idlest = group;
> +                       }
> +
> +                       if (most_spare < max_spare_cap) {
> +                               most_spare = max_spare_cap;
> +                               most_spare_sg = group;
> +                       }
>                 }
>         } while (group = group->next, group != sd->groups);
>
> +       /* Found a significant amount of spare capacity. */

It may worth explaining the threshold when it becomes better to choose
the most spare group instead of the least loaded group.

> +       if (this_spare > task_util(p) / 2 &&
> +           imbalance*this_spare > 100*most_spare)
> +               return NULL;
> +       else if (most_spare > task_util(p) / 2)
> +               return most_spare_sg;
> +
>         if (!idlest || 100*this_load < imbalance*min_load)
>                 return NULL;
>         return idlest;
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
  2016-08-15 10:54   ` Peter Zijlstra
@ 2016-08-17  8:42   ` Wanpeng Li
  2016-08-17  9:23     ` Morten Rasmussen
  2016-08-18 10:56   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
  2 siblings, 1 reply; 45+ messages in thread
From: Wanpeng Li @ 2016-08-17  8:42 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, freedom.tan,
	keita.kobayashi.ym, linux-kernel

2016-07-25 21:34 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> Add a topology flag to the sched_domain hierarchy indicating the lowest
> domain level where the full range of cpu capacities is represented by
> the domain members for asymmetric capacity topologies (e.g. ARM
> big.LITTLE).
>
> The flag is intended to indicate that extra care should be taken when
> placing tasks on cpus and this level spans all the different types of
> cpus found in the system (no need to look further up the domain
> hierarchy). This information is currently only available through
> iterating through the capacities of all the cpus at parent levels in the
> sched_domain hierarchy.
>
> SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
>
> SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY
>
> cpu:         0      1      2      3
> capacity:  756    756   1024   1024
>
> If the topology in the example above is duplicated to create an eight
> cpu example with third sched_domain level on top (SD 3), this level
> should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
> would both have all cpu capacities represented within them.

I didn't find the place where set SD_ASYM_CPUCAPACITY to any SDs in
this patchset, but you have testing result in cover letter, where I
miss?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-08-17  8:42   ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Wanpeng Li
@ 2016-08-17  9:23     ` Morten Rasmussen
  2016-08-17  9:26       ` Wanpeng Li
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-17  9:23 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, freedom.tan,
	keita.kobayashi.ym, linux-kernel

On Wed, Aug 17, 2016 at 04:42:36PM +0800, Wanpeng Li wrote:
> 2016-07-25 21:34 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > Add a topology flag to the sched_domain hierarchy indicating the lowest
> > domain level where the full range of cpu capacities is represented by
> > the domain members for asymmetric capacity topologies (e.g. ARM
> > big.LITTLE).
> >
> > The flag is intended to indicate that extra care should be taken when
> > placing tasks on cpus and this level spans all the different types of
> > cpus found in the system (no need to look further up the domain
> > hierarchy). This information is currently only available through
> > iterating through the capacities of all the cpus at parent levels in the
> > sched_domain hierarchy.
> >
> > SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
> >
> > SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY
> >
> > cpu:         0      1      2      3
> > capacity:  756    756   1024   1024
> >
> > If the topology in the example above is duplicated to create an eight
> > cpu example with third sched_domain level on top (SD 3), this level
> > should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
> > would both have all cpu capacities represented within them.
> 
> I didn't find the place where set SD_ASYM_CPUCAPACITY to any SDs in
> this patchset, but you have testing result in cover letter, where I
> miss?

The flag is supposed to be set by arch-specific code. I included a few
patches in v1 and v2 that set the flag for arch/arm. However, since they
are hopefully soon to be superseded by patches from Juri I dropped them
from the v3 posting and provided a pointer to branch containing this patch
set, Juri's patches, and few additional glue patches instead that
enabled the flag when necessary for arch/arm and arch/arm64.

Sorry for the confusion.
Morten

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-08-17  9:23     ` Morten Rasmussen
@ 2016-08-17  9:26       ` Wanpeng Li
  0 siblings, 0 replies; 45+ messages in thread
From: Wanpeng Li @ 2016-08-17  9:26 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

2016-08-17 17:23 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Wed, Aug 17, 2016 at 04:42:36PM +0800, Wanpeng Li wrote:
>> 2016-07-25 21:34 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> > Add a topology flag to the sched_domain hierarchy indicating the lowest
>> > domain level where the full range of cpu capacities is represented by
>> > the domain members for asymmetric capacity topologies (e.g. ARM
>> > big.LITTLE).
>> >
>> > The flag is intended to indicate that extra care should be taken when
>> > placing tasks on cpus and this level spans all the different types of
>> > cpus found in the system (no need to look further up the domain
>> > hierarchy). This information is currently only available through
>> > iterating through the capacities of all the cpus at parent levels in the
>> > sched_domain hierarchy.
>> >
>> > SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
>> >
>> > SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY
>> >
>> > cpu:         0      1      2      3
>> > capacity:  756    756   1024   1024
>> >
>> > If the topology in the example above is duplicated to create an eight
>> > cpu example with third sched_domain level on top (SD 3), this level
>> > should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
>> > would both have all cpu capacities represented within them.
>>
>> I didn't find the place where set SD_ASYM_CPUCAPACITY to any SDs in
>> this patchset, but you have testing result in cover letter, where I
>> miss?
>
> The flag is supposed to be set by arch-specific code. I included a few
> patches in v1 and v2 that set the flag for arch/arm. However, since they
> are hopefully soon to be superseded by patches from Juri I dropped them
> from the v3 posting and provided a pointer to branch containing this patch
> set, Juri's patches, and few additional glue patches instead that
> enabled the flag when necessary for arch/arm and arch/arm64.

Ah, ok, thanks for the information. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-15 15:42     ` Morten Rasmussen
@ 2016-08-18  8:40       ` Morten Rasmussen
  2016-08-18 10:24         ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-18  8:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
> On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
> > But unlike that function, it doesn't actually use __update_load_avg().
> > Why not?
> 
> Fair question :)
> 
> We currently exploit the fact that the task utilization is _not_ updated
> in wake-up balancing to make sure we don't under-estimate the capacity
> requirements for tasks that have slept for a while. If we update it, we
> loose the non-decayed 'peak' utilization, but I guess we could just
> store it somewhere when we do the wake-up decay.
> 
> I thought there was a better reason when I wrote the patch, but I don't
> recall right now. I will look into it again and see if we can use
> __update_load_avg() to do a proper update instead of doing things twice.

AFAICT, we should be able to synchronize the task utilization to the
previous rq utilization using __update_load_avg() as you suggest. The
patch below is should work as a replacement without any changes to
subsequent patches. It doesn't solve the under-estimation issue, but I
have another patch for that.

---8<---

>From 43226a896fad077c3ab4932f797df17159779d6e Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Thu, 28 Apr 2016 09:52:35 +0100
Subject: [PATCH] sched/fair: Compute task/cpu utilization at wake-up more
 correctly

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e217eff3daf..8b6b8f9da28d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3119,13 +3119,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 #endif
 
 /*
+ * Synchronize entity load avg of dequeued entity without locking
+ * the previous rq.
+ */
+void sync_entity_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 last_update_time;
+
+	last_update_time = cfs_rq_last_update_time(cfs_rq);
+	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+}
+
+/*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
 void remove_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 last_update_time;
 
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3137,9 +3149,7 @@ void remove_entity_load_avg(struct sched_entity *se)
 	 * calls this.
 	 */
 
-	last_update_time = cfs_rq_last_update_time(cfs_rq);
-
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	sync_entity_load_avg(se);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
@@ -5377,6 +5387,24 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+	unsigned long util, capacity;
+
+	/* Task has no contribution or is new */
+	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+		return cpu_util(cpu);
+
+	capacity = capacity_orig_of(cpu);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+
+	return (util >= capacity) ? capacity : util;
+}
+
 static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
@@ -5388,6 +5416,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	if (max_cap - min_cap < max_cap >> 3)
 		return 0;
 
+	/* Bring task utilization in sync with prev_cpu */
+	sync_entity_load_avg(&p->se);
+
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-18  8:40       ` Morten Rasmussen
@ 2016-08-18 10:24         ` Morten Rasmussen
  2016-08-18 11:46           ` Wanpeng Li
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-18 10:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
> > > But unlike that function, it doesn't actually use __update_load_avg().
> > > Why not?
> > 
> > Fair question :)
> > 
> > We currently exploit the fact that the task utilization is _not_ updated
> > in wake-up balancing to make sure we don't under-estimate the capacity
> > requirements for tasks that have slept for a while. If we update it, we
> > loose the non-decayed 'peak' utilization, but I guess we could just
> > store it somewhere when we do the wake-up decay.
> > 
> > I thought there was a better reason when I wrote the patch, but I don't
> > recall right now. I will look into it again and see if we can use
> > __update_load_avg() to do a proper update instead of doing things twice.
> 
> AFAICT, we should be able to synchronize the task utilization to the
> previous rq utilization using __update_load_avg() as you suggest. The
> patch below is should work as a replacement without any changes to
> subsequent patches. It doesn't solve the under-estimation issue, but I
> have another patch for that.

And here is a possible solution to the under-estimation issue. The patch
would have to go at the end of this set.

---8<---

>From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Wed, 17 Aug 2016 15:30:43 +0100
Subject: [PATCH] sched/fair: Track peak per-entity utilization

When using PELT (per-entity load tracking) utilization to place tasks at
wake-up using the decayed utilization (due to sleep) leads to
under-estimation of true utilization of the task. This could mean
putting the task on a cpu with less available capacity than is actually
needed. This issue can be mitigated by using 'peak' utilization instead
of the decayed utilization for placement decisions, e.g. at task
wake-up.

The 'peak' utilization metric, util_peak, tracks util_avg when the task
is running and retains its previous value while the task is
blocked/waiting on the rq. It is instantly updated to track util_avg
again as soon as the task running again.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/fair.c   | 18 ++++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4e0c47af9b05..40e427d1d378 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1281,7 +1281,7 @@ struct load_weight {
 struct sched_avg {
 	u64 last_update_time, load_sum;
 	u32 util_sum, period_contrib;
-	unsigned long load_avg, util_avg;
+	unsigned long load_avg, util_avg, util_peak;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11b250531ed4..8462a3d455ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
 	sa->util_avg = 0;
+	sa->util_peak = 0;
 	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -744,6 +745,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
+		sa->util_peak = sa->util_avg;
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
@@ -2806,6 +2808,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
 	}
 
+	if (running || sa->util_avg > sa->util_peak)
+		sa->util_peak = sa->util_avg;
+
 	return decayed;
 }
 
@@ -5174,7 +5179,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
-static inline int task_util(struct task_struct *p);
+static inline int task_util_peak(struct task_struct *p);
 static int cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5257,10 +5262,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	} while (group = group->next, group != sd->groups);
 
 	/* Found a significant amount of spare capacity. */
-	if (this_spare > task_util(p) / 2 &&
+	if (this_spare > task_util_peak(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
-	else if (most_spare > task_util(p) / 2)
+	else if (most_spare > task_util_peak(p) / 2)
 		return most_spare_sg;
 
 	if (!idlest || 100*this_load < imbalance*min_load)
@@ -5423,6 +5428,11 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+static inline int task_util_peak(struct task_struct *p)
+{
+	return p->se.avg.util_peak;
+}
+
 /*
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
@@ -5455,7 +5465,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return min_cap * 1024 < task_util_peak(p) * capacity_margin;
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Clarify SD_flags comment
  2016-08-15 10:54   ` Peter Zijlstra
  2016-08-15 11:43     ` Morten Rasmussen
@ 2016-08-18 10:56     ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-08-18 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, tglx, torvalds, peterz, efault, hpa,
	morten.rasmussen

Commit-ID:  94f438c84e850570f28dd36588a0d7f73b991e44
Gitweb:     http://git.kernel.org/tip/94f438c84e850570f28dd36588a0d7f73b991e44
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 15 Aug 2016 12:54:59 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:52 +0200

sched/core: Clarify SD_flags comment

The SD_flags comment is very terse and doesn't explain why PACKING is
odd.

IIRC the distinction is that the 'normal' ones only describe topology,
while the ASYM_PACKING one also prescribes behaviour. It is odd in the
way that it doesn't only describe things.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/20160815105459.GS6879@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b6b23c..54fff81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6355,13 +6355,19 @@ static int sched_domains_curr_level;
 /*
  * SD_flags allowed in topology descriptions.
  *
- * SD_SHARE_CPUCAPACITY      - describes SMT topologies
- * SD_SHARE_PKG_RESOURCES - describes shared caches
- * SD_NUMA                - describes NUMA topologies
- * SD_SHARE_POWERDOMAIN   - describes shared power domain
+ * These flags are purely descriptive of the topology and do not prescribe
+ * behaviour. Behaviour is artificial and mapped in the below sd_init()
+ * function:
  *
- * Odd one out:
- * SD_ASYM_PACKING        - describes SMT quirks
+ *   SD_SHARE_CPUCAPACITY   - describes SMT topologies
+ *   SD_SHARE_PKG_RESOURCES - describes shared caches
+ *   SD_NUMA                - describes NUMA topologies
+ *   SD_SHARE_POWERDOMAIN   - describes shared power domain
+ *
+ * Odd one out, which beside describing the topology has a quirk also
+ * prescribes the desired behaviour that goes along with it:
+ *
+ *   SD_ASYM_PACKING        - describes SMT quirks
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Remove unnecessary NULL-pointer check
  2016-07-25 13:34 ` [PATCH v3 04/13] sched/core: Remove unnecessary null-pointer check Morten Rasmussen
@ 2016-08-18 10:56   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-18 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, peterz, linux-kernel, efault, tglx, morten.rasmussen, mingo,
	torvalds

Commit-ID:  0e6d2a67a41321b3ef650b780a279a37855de08e
Gitweb:     http://git.kernel.org/tip/0e6d2a67a41321b3ef650b780a279a37855de08e
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Mon, 25 Jul 2016 14:34:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:53 +0200

sched/core: Remove unnecessary NULL-pointer check

Checking if the sched_domain pointer returned by sd_init() is NULL seems
pointless as sd_init() neither checks if it is valid to begin with nor
set it to NULL.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54fff81..1b2dd52 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6854,8 +6854,6 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		struct sched_domain *child, int cpu)
 {
 	struct sched_domain *sd = sd_init(tl, cpu);
-	if (!sd)
-		return child;
 
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
  2016-08-15 10:54   ` Peter Zijlstra
  2016-08-17  8:42   ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Wanpeng Li
@ 2016-08-18 10:56   ` tip-bot for Morten Rasmussen
  2 siblings, 0 replies; 45+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-18 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, torvalds, hpa, efault, linux-kernel, peterz,
	morten.rasmussen

Commit-ID:  1f6e6c7cb9bcd58abb5ee11243e0eefe6b36fc8e
Gitweb:     http://git.kernel.org/tip/1f6e6c7cb9bcd58abb5ee11243e0eefe6b36fc8e
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Mon, 25 Jul 2016 14:34:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:53 +0200

sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag

Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of CPU capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).

The flag is intended to indicate that extra care should be taken when
placing tasks on CPUs and this level spans all the different types of
CPUs found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the CPUs at parent levels in the
sched_domain hierarchy.

  SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY

  SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY

  CPU:         0      1      2      3
  capacity:  756    756   1024   1024

If the topology in the example above is duplicated to create an eight
CPU example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all CPU capacities represented within them.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7f64e89..d750240 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1022,6 +1022,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY	0x0040  /* Groups have different max cpu capacities */
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b2dd52..46bfb90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5716,6 +5716,7 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUCAPACITY |
+			 SD_ASYM_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
 			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
@@ -5746,6 +5747,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_NEWIDLE |
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
+				SD_ASYM_CPUCAPACITY |
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
@@ -6363,6 +6365,7 @@ static int sched_domains_curr_level;
  *   SD_SHARE_PKG_RESOURCES - describes shared caches
  *   SD_NUMA                - describes NUMA topologies
  *   SD_SHARE_POWERDOMAIN   - describes shared power domain
+ *   SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
  *
  * Odd one out, which beside describing the topology has a quirk also
  * prescribes the desired behaviour that goes along with it:
@@ -6374,6 +6377,7 @@ static int sched_domains_curr_level;
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
+	 SD_ASYM_CPUCAPACITY |		\
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Pass child domain into sd_init()
  2016-07-25 13:34 ` [PATCH v3 06/13] sched/core: Pass child domain into sd_init Morten Rasmussen
@ 2016-08-18 10:57   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-18 10:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: morten.rasmussen, linux-kernel, torvalds, hpa, tglx, efault,
	mingo, peterz

Commit-ID:  3676b13e8524c576825fe1e731e347dba0083888
Gitweb:     http://git.kernel.org/tip/3676b13e8524c576825fe1e731e347dba0083888
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Mon, 25 Jul 2016 14:34:23 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:54 +0200

sched/core: Pass child domain into sd_init()

If behavioural sched_domain flags depend on topology flags set at higher
domain levels we need a way to update the child domain flags. Moving the
child pointer assignment inside sd_init() should make that possible.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 46bfb90..5739465 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6381,7 +6381,8 @@ static int sched_domains_curr_level;
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-sd_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl,
+	struct sched_domain *child, int cpu)
 {
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
 	int sd_weight, sd_flags = 0;
@@ -6433,6 +6434,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		.smt_gain		= 0,
 		.max_newidle_lb_cost	= 0,
 		.next_decay_max_lb_cost	= jiffies,
+		.child			= child,
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
 #endif
@@ -6857,14 +6859,13 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = sd_init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, child, cpu);
 
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);
 		child->parent = sd;
-		sd->child = child;
 
 		if (!cpumask_subset(sched_domain_span(child),
 				    sched_domain_span(sd))) {

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-25 13:34 ` [PATCH v3 07/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
@ 2016-08-18 10:57   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 45+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-18 10:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, efault, mingo, peterz, tglx, morten.rasmussen,
	linux-kernel

Commit-ID:  9ee1cda5ee25c7dd82acf25892e0d229e818f8c7
Gitweb:     http://git.kernel.org/tip/9ee1cda5ee25c7dd82acf25892e0d229e818f8c7
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Mon, 25 Jul 2016 14:34:24 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:55 +0200

sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems

A domain with the SD_ASYM_CPUCAPACITY flag set indicate that
sched_groups at this level and below do not include CPUs of all
capacities available (e.g. group containing little-only or big-only CPUs
in big.LITTLE systems). It is therefore necessary to put in more effort
in finding an appropriate CPU at task wake-up by enabling balancing at
wake-up (SD_BALANCE_WAKE) on all lower (child) levels.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-8-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5739465..4695df6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6444,6 +6444,13 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * Convert topological properties into behaviour.
 	 */
 
+	if (sd->flags & SD_ASYM_CPUCAPACITY) {
+		struct sched_domain *t = sd;
+
+		for_each_lower_domain(t)
+			t->flags |= SD_BALANCE_WAKE;
+	}
+
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/core: Store maximum per-CPU capacity in root domain
  2016-08-01 18:53   ` Dietmar Eggemann
  2016-08-16 12:24     ` Vincent Guittot
@ 2016-08-18 10:58     ` tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2016-08-18 10:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tglx, linux-kernel, dietmar.eggemann, hpa, peterz,
	mingo, efault, morten.rasmussen

Commit-ID:  cd92bfd3b8cb0ec2ee825e55a3aee704cd55aea9
Gitweb:     http://git.kernel.org/tip/cd92bfd3b8cb0ec2ee825e55a3aee704cd55aea9
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Mon, 1 Aug 2016 19:53:35 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:55 +0200

sched/core: Store maximum per-CPU capacity in root domain

To be able to compare the capacity of the target CPU with the highest
available CPU capacity, store the maximum per-CPU capacity in the root
domain.

The max per-CPU capacity should be 1024 for all systems except SMT,
where the capacity is currently based on smt_gain and the number of
hardware threads and is <1024. If SMT can be brought to work with a
per-thread capacity of 1024, this patch can be dropped and replaced by a
hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/26c69258-9947-f830-a53e-0c54e7750646@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4695df6..6924314 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6903,6 +6903,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6953,11 +6954,22 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
+
+		/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
+		if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
+			WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+
 		cpu_attach_domain(sd, d.rd, i);
 	}
 	rcu_read_unlock();
 
+	if (rq) {
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+	}
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index afe76d0..420c05d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,8 @@ struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/fair: Let asymmetric CPU configurations balance at wake-up
  2016-07-25 13:34 ` [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
  2016-08-15 13:39   ` Peter Zijlstra
@ 2016-08-18 10:58   ` tip-bot for Morten Rasmussen
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-18 10:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, mingo, vincent.guittot, tglx, peterz, linux-kernel,
	torvalds, morten.rasmussen, efault

Commit-ID:  3273163c6775c4c21823985304c2364b08ca6ea2
Gitweb:     http://git.kernel.org/tip/3273163c6775c4c21823985304c2364b08ca6ea2
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Mon, 25 Jul 2016 14:34:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:26:56 +0200

sched/fair: Let asymmetric CPU configurations balance at wake-up

Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the waking CPU and the
previous CPU, and all CPUs within their respective
SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing
take over (find_idlest_{group, cpu}()).

This patch makes affine wake-ups conditional on whether both the waker
CPU and the previous CPU has sufficient capacity for the waking task,
or not, assuming that the CPU capacities within an SD_SHARE_PKG_RESOURCES
domain (sd_llc) are homogeneous.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acdc351..61d4854 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with CPU capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5376,6 +5382,32 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+	return p->se.avg.util_avg;
+}
+
+/*
+ * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
+ * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
+ *
+ * In that case WAKE_AFFINE doesn't make sense and we'll let
+ * BALANCE_WAKE sort things out.
+ */
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+	long min_cap, max_cap;
+
+	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
+	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
+
+	/* Minimum capacity is close to max, no need to abort wake_affine */
+	if (max_cap - min_cap < max_cap >> 3)
+		return 0;
+
+	return min_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5399,7 +5431,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 	}
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group()
  2016-08-16 13:57   ` Vincent Guittot
@ 2016-08-18 11:16     ` Morten Rasmussen
  2016-08-18 12:28       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-18 11:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	Sai Charan Gurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On Tue, Aug 16, 2016 at 03:57:06PM +0200, Vincent Guittot wrote:
> On 25 July 2016 at 15:34, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > In low-utilization scenarios comparing relative loads in
> > find_idlest_group() doesn't always lead to the most optimum choice.
> > Systems with groups containing different numbers of cpus and/or cpus of
> > different compute capacity are significantly better off when considering
> > spare capacity rather than relative load in those scenarios.
> >
> > In addition to existing load based search an alternative spare capacity
> > based candidate sched_group is found and selected instead if sufficient
> > spare capacity exists. If not, existing behaviour is preserved.
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 41 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9c6ec3bf75ce..e3654409d099 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5164,6 +5164,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >         return 1;
> >  }
> >
> > +static inline int task_util(struct task_struct *p);
> > +static int cpu_util_wake(int cpu, struct task_struct *p);
> > +
> > +static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> > +{
> > +       return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
> > +}
> > +
> >  /*
> >   * find_idlest_group finds and returns the least busy CPU group within the
> >   * domain.
> > @@ -5173,7 +5181,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >                   int this_cpu, int sd_flag)
> >  {
> >         struct sched_group *idlest = NULL, *group = sd->groups;
> > +       struct sched_group *most_spare_sg = NULL;
> >         unsigned long min_load = ULONG_MAX, this_load = 0;
> > +       unsigned long most_spare = 0, this_spare = 0;
> >         int load_idx = sd->forkexec_idx;
> >         int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >
> > @@ -5181,7 +5191,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >                 load_idx = sd->wake_idx;
> >
> >         do {
> > -               unsigned long load, avg_load;
> > +               unsigned long load, avg_load, spare_cap, max_spare_cap;
> >                 int local_group;
> >                 int i;
> >
> > @@ -5193,8 +5203,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >                 local_group = cpumask_test_cpu(this_cpu,
> >                                                sched_group_cpus(group));
> >
> > -               /* Tally up the load of all CPUs in the group */
> > +               /*
> > +                * Tally up the load of all CPUs in the group and find
> > +                * the group containing the cpu with most spare capacity.
> > +                */
> >                 avg_load = 0;
> > +               max_spare_cap = 0;
> >
> >                 for_each_cpu(i, sched_group_cpus(group)) {
> >                         /* Bias balancing toward cpus of our domain */
> > @@ -5204,6 +5218,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >                                 load = target_load(i, load_idx);
> >
> >                         avg_load += load;
> > +
> > +                       spare_cap = capacity_spare_wake(i, p);
> > +
> > +                       if (spare_cap > max_spare_cap &&
> > +                           spare_cap > capacity_of(i) >> 3) {
> 
> This condition probably needs some descriptions. You're not only
> looking for max spare capacity but also a significant spare capacity
> (more than 12.5% of cpu_capacity_orig). Can't this additional test
> lead to some strange situation where a CPU with more spare capacity
> will not be selected because of this 12.5% condition whereas another
> with less spare capacity will be selected because its capacity_orig is
> lower ?

Right, the reason why I added the 12.5% check is that I thought we
wouldn't want to pack cpus too aggressively. You are right that we could
reject a 1024 capacity with a spare capacity of 100 and pick a 512
capacity cpu with a spare capacity of 65.

Using an absolute margin instead, e.g. 128 instead of 12.5%, would fix
this particular issue but create another one as we would reject low
capacity cpus at very low utilization. For example, if we have cpus with
a capacity of 256. I thought that the relative threshold was more
generic and would cause less trouble, but I'm happy to discuss
alternatives.

>From a latency perspective it might not be a bad idea staying away from
cpus with a utilization even if they have more capacity available as the
task is more likely to end up waiting on the rq. For throughput tasks
you would of course want it the other way around.

> 
> > +                               max_spare_cap = spare_cap;
> > +                       }
> >                 }
> >
> >                 /* Adjust by relative CPU capacity of the group */
> > @@ -5211,12 +5232,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >
> >                 if (local_group) {
> >                         this_load = avg_load;
> > -               } else if (avg_load < min_load) {
> > -                       min_load = avg_load;
> > -                       idlest = group;
> > +                       this_spare = max_spare_cap;
> > +               } else {
> > +                       if (avg_load < min_load) {
> > +                               min_load = avg_load;
> > +                               idlest = group;
> > +                       }
> > +
> > +                       if (most_spare < max_spare_cap) {
> > +                               most_spare = max_spare_cap;
> > +                               most_spare_sg = group;
> > +                       }
> >                 }
> >         } while (group = group->next, group != sd->groups);
> >
> > +       /* Found a significant amount of spare capacity. */
> 
> It may worth explaining the threshold when it becomes better to choose
> the most spare group instead of the least loaded group.

Yes. I admit that the threshold is somewhat randomly chosen. Based on a
few experiments I found that requiring enough spare capacity to fit the
task completely was too conservative. We would bail out and go with the
least loaded groups very often, especially for new tasks, despite the
spare capacity only being slightly too small. Allowing a small degree of
stuffing of the task seemed better. Choosing the least loaded group
instead doesn't give any better throughput for the waking task unless it
has high priority. For overall throughput, the most spare capacity cpus
should be the better choice.

Should I just add a comment saying that we want to allow a little bit of
task stuffing to accommodate better for new tasks and have better overall
throughput, or should we investigate the threshold further?


> 
> > +       if (this_spare > task_util(p) / 2 &&
> > +           imbalance*this_spare > 100*most_spare)
> > +               return NULL;
> > +       else if (most_spare > task_util(p) / 2)
> > +               return most_spare_sg;
> > +
> >         if (!idlest || 100*this_load < imbalance*min_load)
> >                 return NULL;
> >         return idlest;
> > --
> > 1.9.1
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-18 10:24         ` Morten Rasmussen
@ 2016-08-18 11:46           ` Wanpeng Li
  2016-08-18 13:45             ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Wanpeng Li @ 2016-08-18 11:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

2016-08-18 18:24 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
>> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
>> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
>> > > But unlike that function, it doesn't actually use __update_load_avg().
>> > > Why not?
>> >
>> > Fair question :)
>> >
>> > We currently exploit the fact that the task utilization is _not_ updated
>> > in wake-up balancing to make sure we don't under-estimate the capacity
>> > requirements for tasks that have slept for a while. If we update it, we
>> > loose the non-decayed 'peak' utilization, but I guess we could just
>> > store it somewhere when we do the wake-up decay.
>> >
>> > I thought there was a better reason when I wrote the patch, but I don't
>> > recall right now. I will look into it again and see if we can use
>> > __update_load_avg() to do a proper update instead of doing things twice.
>>
>> AFAICT, we should be able to synchronize the task utilization to the
>> previous rq utilization using __update_load_avg() as you suggest. The
>> patch below is should work as a replacement without any changes to
>> subsequent patches. It doesn't solve the under-estimation issue, but I
>> have another patch for that.
>
> And here is a possible solution to the under-estimation issue. The patch
> would have to go at the end of this set.
>
> ---8<---
>
> From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Date: Wed, 17 Aug 2016 15:30:43 +0100
> Subject: [PATCH] sched/fair: Track peak per-entity utilization
>
> When using PELT (per-entity load tracking) utilization to place tasks at
> wake-up using the decayed utilization (due to sleep) leads to
> under-estimation of true utilization of the task. This could mean
> putting the task on a cpu with less available capacity than is actually
> needed. This issue can be mitigated by using 'peak' utilization instead
> of the decayed utilization for placement decisions, e.g. at task
> wake-up.
>
> The 'peak' utilization metric, util_peak, tracks util_avg when the task
> is running and retains its previous value while the task is
> blocked/waiting on the rq. It is instantly updated to track util_avg
> again as soon as the task running again.

Maybe this will lead to disable wake affine due to a spike peak value
for a low average load task.

Regards,
Wanpeng Li

>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  include/linux/sched.h |  2 +-
>  kernel/sched/fair.c   | 18 ++++++++++++++----
>  2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4e0c47af9b05..40e427d1d378 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1281,7 +1281,7 @@ struct load_weight {
>  struct sched_avg {
>         u64 last_update_time, load_sum;
>         u32 util_sum, period_contrib;
> -       unsigned long load_avg, util_avg;
> +       unsigned long load_avg, util_avg, util_peak;
>  };
>
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 11b250531ed4..8462a3d455ff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>          * At this point, util_avg won't be used in select_task_rq_fair anyway
>          */
>         sa->util_avg = 0;
> +       sa->util_peak = 0;
>         sa->util_sum = 0;
>         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
> @@ -744,6 +745,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>                 } else {
>                         sa->util_avg = cap;
>                 }
> +               sa->util_peak = sa->util_avg;
>                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>         }
>
> @@ -2806,6 +2808,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                 sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>         }
>
> +       if (running || sa->util_avg > sa->util_peak)
> +               sa->util_peak = sa->util_avg;
> +
>         return decayed;
>  }
>
> @@ -5174,7 +5179,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>         return 1;
>  }
>
> -static inline int task_util(struct task_struct *p);
> +static inline int task_util_peak(struct task_struct *p);
>  static int cpu_util_wake(int cpu, struct task_struct *p);
>
>  static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> @@ -5257,10 +5262,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>         } while (group = group->next, group != sd->groups);
>
>         /* Found a significant amount of spare capacity. */
> -       if (this_spare > task_util(p) / 2 &&
> +       if (this_spare > task_util_peak(p) / 2 &&
>             imbalance*this_spare > 100*most_spare)
>                 return NULL;
> -       else if (most_spare > task_util(p) / 2)
> +       else if (most_spare > task_util_peak(p) / 2)
>                 return most_spare_sg;
>
>         if (!idlest || 100*this_load < imbalance*min_load)
> @@ -5423,6 +5428,11 @@ static inline int task_util(struct task_struct *p)
>         return p->se.avg.util_avg;
>  }
>
> +static inline int task_util_peak(struct task_struct *p)
> +{
> +       return p->se.avg.util_peak;
> +}
> +
>  /*
>   * cpu_util_wake: Compute cpu utilization with any contributions from
>   * the waking task p removed.
> @@ -5455,7 +5465,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>         /* Bring task utilization in sync with prev_cpu */
>         sync_entity_load_avg(&p->se);
>
> -       return min_cap * 1024 < task_util(p) * capacity_margin;
> +       return min_cap * 1024 < task_util_peak(p) * capacity_margin;
>  }
>
>  /*
> --
> 1.9.1
>



-- 
Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group()
  2016-08-18 11:16     ` Morten Rasmussen
@ 2016-08-18 12:28       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-08-18 12:28 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	Sai Charan Gurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On Thu, Aug 18, 2016 at 12:16:33PM +0100, Morten Rasmussen wrote:
> On Tue, Aug 16, 2016 at 03:57:06PM +0200, Vincent Guittot wrote:
> > > @@ -5204,6 +5218,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > >                                 load = target_load(i, load_idx);
> > >
> > >                         avg_load += load;
> > > +
> > > +                       spare_cap = capacity_spare_wake(i, p);
> > > +
> > > +                       if (spare_cap > max_spare_cap &&
> > > +                           spare_cap > capacity_of(i) >> 3) {
> > 
> > This condition probably needs some descriptions. You're not only
> > looking for max spare capacity but also a significant spare capacity
> > (more than 12.5% of cpu_capacity_orig). Can't this additional test
> > lead to some strange situation where a CPU with more spare capacity
> > will not be selected because of this 12.5% condition whereas another
> > with less spare capacity will be selected because its capacity_orig is
> > lower ?
> 
> Right, the reason why I added the 12.5% check is that I thought we
> wouldn't want to pack cpus too aggressively. You are right that we could
> reject a 1024 capacity with a spare capacity of 100 and pick a 512
> capacity cpu with a spare capacity of 65.

You could of course track both.. but complexity. At the very least I
agree with Vincent in that this very much deserves a comment.

> From a latency perspective it might not be a bad idea staying away from
> cpus with a utilization even if they have more capacity available as the
> task is more likely to end up waiting on the rq. For throughput tasks
> you would of course want it the other way around.

(debug) tuning-knob ;-)

> > > @@ -5211,12 +5232,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > >
> > >                 if (local_group) {
> > >                         this_load = avg_load;
> > > -               } else if (avg_load < min_load) {
> > > -                       min_load = avg_load;
> > > -                       idlest = group;
> > > +                       this_spare = max_spare_cap;
> > > +               } else {
> > > +                       if (avg_load < min_load) {
> > > +                               min_load = avg_load;
> > > +                               idlest = group;
> > > +                       }
> > > +
> > > +                       if (most_spare < max_spare_cap) {
> > > +                               most_spare = max_spare_cap;
> > > +                               most_spare_sg = group;
> > > +                       }
> > >                 }
> > >         } while (group = group->next, group != sd->groups);
> > >
> > > +       /* Found a significant amount of spare capacity. */
> > 
> > It may worth explaining the threshold when it becomes better to choose
> > the most spare group instead of the least loaded group.
> 
> Yes. I admit that the threshold is somewhat randomly chosen. Based on a
> few experiments I found that requiring enough spare capacity to fit the
> task completely was too conservative. We would bail out and go with the
> least loaded groups very often, especially for new tasks, despite the
> spare capacity only being slightly too small. Allowing a small degree of
> stuffing of the task seemed better. Choosing the least loaded group
> instead doesn't give any better throughput for the waking task unless it
> has high priority. For overall throughput, the most spare capacity cpus
> should be the better choice.
> 
> Should I just add a comment saying that we want to allow a little bit of
> task stuffing to accommodate better for new tasks and have better overall
> throughput, or should we investigate the threshold further?

A comment would certainly be nice..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-18 11:46           ` Wanpeng Li
@ 2016-08-18 13:45             ` Morten Rasmussen
  2016-08-19  1:43               ` Wanpeng Li
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-18 13:45 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On Thu, Aug 18, 2016 at 07:46:44PM +0800, Wanpeng Li wrote:
> 2016-08-18 18:24 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
> >> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
> >> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
> >> > > But unlike that function, it doesn't actually use __update_load_avg().
> >> > > Why not?
> >> >
> >> > Fair question :)
> >> >
> >> > We currently exploit the fact that the task utilization is _not_ updated
> >> > in wake-up balancing to make sure we don't under-estimate the capacity
> >> > requirements for tasks that have slept for a while. If we update it, we
> >> > loose the non-decayed 'peak' utilization, but I guess we could just
> >> > store it somewhere when we do the wake-up decay.
> >> >
> >> > I thought there was a better reason when I wrote the patch, but I don't
> >> > recall right now. I will look into it again and see if we can use
> >> > __update_load_avg() to do a proper update instead of doing things twice.
> >>
> >> AFAICT, we should be able to synchronize the task utilization to the
> >> previous rq utilization using __update_load_avg() as you suggest. The
> >> patch below is should work as a replacement without any changes to
> >> subsequent patches. It doesn't solve the under-estimation issue, but I
> >> have another patch for that.
> >
> > And here is a possible solution to the under-estimation issue. The patch
> > would have to go at the end of this set.
> >
> > ---8<---
> >
> > From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > Date: Wed, 17 Aug 2016 15:30:43 +0100
> > Subject: [PATCH] sched/fair: Track peak per-entity utilization
> >
> > When using PELT (per-entity load tracking) utilization to place tasks at
> > wake-up using the decayed utilization (due to sleep) leads to
> > under-estimation of true utilization of the task. This could mean
> > putting the task on a cpu with less available capacity than is actually
> > needed. This issue can be mitigated by using 'peak' utilization instead
> > of the decayed utilization for placement decisions, e.g. at task
> > wake-up.
> >
> > The 'peak' utilization metric, util_peak, tracks util_avg when the task
> > is running and retains its previous value while the task is
> > blocked/waiting on the rq. It is instantly updated to track util_avg
> > again as soon as the task running again.
> 
> Maybe this will lead to disable wake affine due to a spike peak value
> for a low average load task.

I assume you are referring to using task_util_peak() instead of
task_util() in wake_cap()?

The peak value should never exceed the util_avg accumulated by the task
last time it ran. So any spike has to be caused by the task accumulating
more utilization last time it ran. We don't know if it a spike or a more
permanent change in behaviour, so we have to guess. So a spike on an
asymmetric system could cause us to disable wake affine in some
circumstances (either prev_cpu or waker cpu has to be low compute
capacity) for the following wake-up.

SMP should be unaffected as we should bail out on the previous
condition.

The counter-example is task with a fairly long busy period and a much
longer period (cycle). Its util_avg might have decayed away since the
last activation so it appears very small at wake-up and we end up
putting it on a low capacity cpu every time even though it keeps the cpu
busy for a long time every time it wakes up.

Did that answer your question?

Thanks,
Morten

> 
> Regards,
> Wanpeng Li
> 
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  include/linux/sched.h |  2 +-
> >  kernel/sched/fair.c   | 18 ++++++++++++++----
> >  2 files changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4e0c47af9b05..40e427d1d378 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1281,7 +1281,7 @@ struct load_weight {
> >  struct sched_avg {
> >         u64 last_update_time, load_sum;
> >         u32 util_sum, period_contrib;
> > -       unsigned long load_avg, util_avg;
> > +       unsigned long load_avg, util_avg, util_peak;
> >  };
> >
> >  #ifdef CONFIG_SCHEDSTATS
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 11b250531ed4..8462a3d455ff 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
> >          * At this point, util_avg won't be used in select_task_rq_fair anyway
> >          */
> >         sa->util_avg = 0;
> > +       sa->util_peak = 0;
> >         sa->util_sum = 0;
> >         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> >  }
> > @@ -744,6 +745,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> >                 } else {
> >                         sa->util_avg = cap;
> >                 }
> > +               sa->util_peak = sa->util_avg;
> >                 sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >         }
> >
> > @@ -2806,6 +2808,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >                 sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> >         }
> >
> > +       if (running || sa->util_avg > sa->util_peak)
> > +               sa->util_peak = sa->util_avg;
> > +
> >         return decayed;
> >  }
> >
> > @@ -5174,7 +5179,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >         return 1;
> >  }
> >
> > -static inline int task_util(struct task_struct *p);
> > +static inline int task_util_peak(struct task_struct *p);
> >  static int cpu_util_wake(int cpu, struct task_struct *p);
> >
> >  static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> > @@ -5257,10 +5262,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >         } while (group = group->next, group != sd->groups);
> >
> >         /* Found a significant amount of spare capacity. */
> > -       if (this_spare > task_util(p) / 2 &&
> > +       if (this_spare > task_util_peak(p) / 2 &&
> >             imbalance*this_spare > 100*most_spare)
> >                 return NULL;
> > -       else if (most_spare > task_util(p) / 2)
> > +       else if (most_spare > task_util_peak(p) / 2)
> >                 return most_spare_sg;
> >
> >         if (!idlest || 100*this_load < imbalance*min_load)
> > @@ -5423,6 +5428,11 @@ static inline int task_util(struct task_struct *p)
> >         return p->se.avg.util_avg;
> >  }
> >
> > +static inline int task_util_peak(struct task_struct *p)
> > +{
> > +       return p->se.avg.util_peak;
> > +}
> > +
> >  /*
> >   * cpu_util_wake: Compute cpu utilization with any contributions from
> >   * the waking task p removed.
> > @@ -5455,7 +5465,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> >         /* Bring task utilization in sync with prev_cpu */
> >         sync_entity_load_avg(&p->se);
> >
> > -       return min_cap * 1024 < task_util(p) * capacity_margin;
> > +       return min_cap * 1024 < task_util_peak(p) * capacity_margin;
> >  }
> >
> >  /*
> > --
> > 1.9.1
> >
> 
> 
> 
> -- 
> Regards,
> Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-18 13:45             ` Morten Rasmussen
@ 2016-08-19  1:43               ` Wanpeng Li
  2016-08-19 14:03                 ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Wanpeng Li @ 2016-08-19  1:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Thu, Aug 18, 2016 at 07:46:44PM +0800, Wanpeng Li wrote:
>> 2016-08-18 18:24 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> > On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
>> >> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
>> >> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
>> >> > > But unlike that function, it doesn't actually use __update_load_avg().
>> >> > > Why not?
>> >> >
>> >> > Fair question :)
>> >> >
>> >> > We currently exploit the fact that the task utilization is _not_ updated
>> >> > in wake-up balancing to make sure we don't under-estimate the capacity
>> >> > requirements for tasks that have slept for a while. If we update it, we
>> >> > loose the non-decayed 'peak' utilization, but I guess we could just
>> >> > store it somewhere when we do the wake-up decay.
>> >> >
>> >> > I thought there was a better reason when I wrote the patch, but I don't
>> >> > recall right now. I will look into it again and see if we can use
>> >> > __update_load_avg() to do a proper update instead of doing things twice.
>> >>
>> >> AFAICT, we should be able to synchronize the task utilization to the
>> >> previous rq utilization using __update_load_avg() as you suggest. The
>> >> patch below is should work as a replacement without any changes to
>> >> subsequent patches. It doesn't solve the under-estimation issue, but I
>> >> have another patch for that.
>> >
>> > And here is a possible solution to the under-estimation issue. The patch
>> > would have to go at the end of this set.
>> >
>> > ---8<---
>> >
>> > From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
>> > From: Morten Rasmussen <morten.rasmussen@arm.com>
>> > Date: Wed, 17 Aug 2016 15:30:43 +0100
>> > Subject: [PATCH] sched/fair: Track peak per-entity utilization
>> >
>> > When using PELT (per-entity load tracking) utilization to place tasks at
>> > wake-up using the decayed utilization (due to sleep) leads to
>> > under-estimation of true utilization of the task. This could mean
>> > putting the task on a cpu with less available capacity than is actually
>> > needed. This issue can be mitigated by using 'peak' utilization instead
>> > of the decayed utilization for placement decisions, e.g. at task
>> > wake-up.
>> >
>> > The 'peak' utilization metric, util_peak, tracks util_avg when the task
>> > is running and retains its previous value while the task is
>> > blocked/waiting on the rq. It is instantly updated to track util_avg
>> > again as soon as the task running again.
>>
>> Maybe this will lead to disable wake affine due to a spike peak value
>> for a low average load task.
>
> I assume you are referring to using task_util_peak() instead of
> task_util() in wake_cap()?

Yes.

>
> The peak value should never exceed the util_avg accumulated by the task
> last time it ran. So any spike has to be caused by the task accumulating
> more utilization last time it ran. We don't know if it a spike or a more

I see.

> permanent change in behaviour, so we have to guess. So a spike on an
> asymmetric system could cause us to disable wake affine in some
> circumstances (either prev_cpu or waker cpu has to be low compute
> capacity) for the following wake-up.
>
> SMP should be unaffected as we should bail out on the previous
> condition.

Why capacity_orig instead of capacity since it is checked each time
wakeup and maybe rt class/interrupt have already occupied many cpu
utilization.

>
> The counter-example is task with a fairly long busy period and a much
> longer period (cycle). Its util_avg might have decayed away since the
> last activation so it appears very small at wake-up and we end up
> putting it on a low capacity cpu every time even though it keeps the cpu
> busy for a long time every time it wakes up.

Agreed, that's the reason for under-estimation concern.

>
> Did that answer your question?

Yeah, thanks for the clarification.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-19  1:43               ` Wanpeng Li
@ 2016-08-19 14:03                 ` Morten Rasmussen
  2016-08-22  1:48                   ` Wanpeng Li
  0 siblings, 1 reply; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-19 14:03 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On Fri, Aug 19, 2016 at 09:43:00AM +0800, Wanpeng Li wrote:
> 2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > I assume you are referring to using task_util_peak() instead of
> > task_util() in wake_cap()?
> 
> Yes.
> 
> >
> > The peak value should never exceed the util_avg accumulated by the task
> > last time it ran. So any spike has to be caused by the task accumulating
> > more utilization last time it ran. We don't know if it a spike or a more
> 
> I see.
> 
> > permanent change in behaviour, so we have to guess. So a spike on an
> > asymmetric system could cause us to disable wake affine in some
> > circumstances (either prev_cpu or waker cpu has to be low compute
> > capacity) for the following wake-up.
> >
> > SMP should be unaffected as we should bail out on the previous
> > condition.
> 
> Why capacity_orig instead of capacity since it is checked each time
> wakeup and maybe rt class/interrupt have already occupied many cpu
> utilization.

We could switch to capacity for this condition if we also change the
spare capacity evaluation in find_idlest_group() to do the same. It
would open up for SMP systems to take find_idlest_group() route if the
SD_BALANCE_WAKE flag is set.

The reason why I have avoided capacity and used capacity_orig instead
is that in previous discussions about scheduling behaviour under
rt/dl/irq pressure it has been clear to me whether we want to move tasks
away from cpus with capacity < capacity_orig or not. The choice depends
on the use-case.

In some cases taking rt/dl/irq pressure into account is more complicated
as we don't know the capacities available in a sched_group without
iterating over all the cpus. However, I don't think it would complicate
these patches. It is more a question whether everyone are happy with
additional conditions in their wake-up path. I guess we could make it a
sched_feature if people are interested?

In short, I used capacity_orig to play it safe ;-)

> > The counter-example is task with a fairly long busy period and a much
> > longer period (cycle). Its util_avg might have decayed away since the
> > last activation so it appears very small at wake-up and we end up
> > putting it on a low capacity cpu every time even though it keeps the cpu
> > busy for a long time every time it wakes up.
> 
> Agreed, that's the reason for under-estimation concern.
> 
> >
> > Did that answer your question?
> 
> Yeah, thanks for the clarification.

You are welcome.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-19 14:03                 ` Morten Rasmussen
@ 2016-08-22  1:48                   ` Wanpeng Li
  2016-08-22 11:29                     ` Morten Rasmussen
  0 siblings, 1 reply; 45+ messages in thread
From: Wanpeng Li @ 2016-08-22  1:48 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

2016-08-19 22:03 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Fri, Aug 19, 2016 at 09:43:00AM +0800, Wanpeng Li wrote:
>> 2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> > I assume you are referring to using task_util_peak() instead of
>> > task_util() in wake_cap()?
>>
>> Yes.
>>
>> >
>> > The peak value should never exceed the util_avg accumulated by the task
>> > last time it ran. So any spike has to be caused by the task accumulating
>> > more utilization last time it ran. We don't know if it a spike or a more
>>
>> I see.
>>
>> > permanent change in behaviour, so we have to guess. So a spike on an
>> > asymmetric system could cause us to disable wake affine in some
>> > circumstances (either prev_cpu or waker cpu has to be low compute
>> > capacity) for the following wake-up.
>> >
>> > SMP should be unaffected as we should bail out on the previous
>> > condition.
>>
>> Why capacity_orig instead of capacity since it is checked each time
>> wakeup and maybe rt class/interrupt have already occupied many cpu
>> utilization.
>
> We could switch to capacity for this condition if we also change the
> spare capacity evaluation in find_idlest_group() to do the same. It
> would open up for SMP systems to take find_idlest_group() route if the
> SD_BALANCE_WAKE flag is set.
>
> The reason why I have avoided capacity and used capacity_orig instead
> is that in previous discussions about scheduling behaviour under
> rt/dl/irq pressure it has been clear to me whether we want to move tasks
> away from cpus with capacity < capacity_orig or not. The choice depends
> on the use-case.
>
> In some cases taking rt/dl/irq pressure into account is more complicated
> as we don't know the capacities available in a sched_group without
> iterating over all the cpus. However, I don't think it would complicate
> these patches. It is more a question whether everyone are happy with
> additional conditions in their wake-up path. I guess we could make it a
> sched_feature if people are interested?
>
> In short, I used capacity_orig to play it safe ;-)

Actually you mixed capacity_orig and capacity when evaluating max spare cap.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-08-22  1:48                   ` Wanpeng Li
@ 2016-08-22 11:29                     ` Morten Rasmussen
  0 siblings, 0 replies; 45+ messages in thread
From: Morten Rasmussen @ 2016-08-22 11:29 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Yuyang Du,
	Vincent Guittot, Mike Galbraith, sgurrappadi, Koan-Sin Tan,
	小林敬太,
	linux-kernel

On Mon, Aug 22, 2016 at 09:48:19AM +0800, Wanpeng Li wrote:
> 2016-08-19 22:03 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > On Fri, Aug 19, 2016 at 09:43:00AM +0800, Wanpeng Li wrote:
> >> 2016-08-18 21:45 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> >> > I assume you are referring to using task_util_peak() instead of
> >> > task_util() in wake_cap()?
> >>
> >> Yes.
> >>
> >> >
> >> > The peak value should never exceed the util_avg accumulated by the task
> >> > last time it ran. So any spike has to be caused by the task accumulating
> >> > more utilization last time it ran. We don't know if it a spike or a more
> >>
> >> I see.
> >>
> >> > permanent change in behaviour, so we have to guess. So a spike on an
> >> > asymmetric system could cause us to disable wake affine in some
> >> > circumstances (either prev_cpu or waker cpu has to be low compute
> >> > capacity) for the following wake-up.
> >> >
> >> > SMP should be unaffected as we should bail out on the previous
> >> > condition.
> >>
> >> Why capacity_orig instead of capacity since it is checked each time
> >> wakeup and maybe rt class/interrupt have already occupied many cpu
> >> utilization.
> >
> > We could switch to capacity for this condition if we also change the
> > spare capacity evaluation in find_idlest_group() to do the same. It
> > would open up for SMP systems to take find_idlest_group() route if the
> > SD_BALANCE_WAKE flag is set.
> >
> > The reason why I have avoided capacity and used capacity_orig instead
> > is that in previous discussions about scheduling behaviour under
> > rt/dl/irq pressure it has been clear to me whether we want to move tasks
> > away from cpus with capacity < capacity_orig or not. The choice depends
> > on the use-case.
> >
> > In some cases taking rt/dl/irq pressure into account is more complicated
> > as we don't know the capacities available in a sched_group without
> > iterating over all the cpus. However, I don't think it would complicate
> > these patches. It is more a question whether everyone are happy with
> > additional conditions in their wake-up path. I guess we could make it a
> > sched_feature if people are interested?
> >
> > In short, I used capacity_orig to play it safe ;-)
> 
> Actually you mixed capacity_orig and capacity when evaluating max spare cap.

Right, that is a mistake. Thanks for pointing that out :-)

Morten

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-08-22 11:29 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-25 13:34 [PATCH v3 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 04/13] sched/core: Remove unnecessary null-pointer check Morten Rasmussen
2016-08-18 10:56   ` [tip:sched/core] sched/core: Remove unnecessary NULL-pointer check tip-bot for Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
2016-08-15 10:54   ` Peter Zijlstra
2016-08-15 11:43     ` Morten Rasmussen
2016-08-18 10:56     ` [tip:sched/core] sched/core: Clarify SD_flags comment tip-bot for Peter Zijlstra
2016-08-17  8:42   ` [PATCH v3 05/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Wanpeng Li
2016-08-17  9:23     ` Morten Rasmussen
2016-08-17  9:26       ` Wanpeng Li
2016-08-18 10:56   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 06/13] sched/core: Pass child domain into sd_init Morten Rasmussen
2016-08-18 10:57   ` [tip:sched/core] sched/core: Pass child domain into sd_init() tip-bot for Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 07/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
2016-08-18 10:57   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 08/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
2016-08-01 18:53   ` Dietmar Eggemann
2016-08-16 12:24     ` Vincent Guittot
2016-08-18 10:58     ` [tip:sched/core] sched/core: Store maximum per-CPU " tip-bot for Dietmar Eggemann
2016-07-25 13:34 ` [PATCH v3 09/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
2016-08-15 13:39   ` Peter Zijlstra
2016-08-15 15:01     ` Morten Rasmussen
2016-08-15 15:10       ` Peter Zijlstra
2016-08-15 15:30         ` Morten Rasmussen
2016-08-18 10:58   ` [tip:sched/core] sched/fair: Let asymmetric CPU " tip-bot for Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
2016-08-15 14:23   ` Peter Zijlstra
2016-08-15 15:42     ` Morten Rasmussen
2016-08-18  8:40       ` Morten Rasmussen
2016-08-18 10:24         ` Morten Rasmussen
2016-08-18 11:46           ` Wanpeng Li
2016-08-18 13:45             ` Morten Rasmussen
2016-08-19  1:43               ` Wanpeng Li
2016-08-19 14:03                 ` Morten Rasmussen
2016-08-22  1:48                   ` Wanpeng Li
2016-08-22 11:29                     ` Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
2016-08-16 13:57   ` Vincent Guittot
2016-08-18 11:16     ` Morten Rasmussen
2016-08-18 12:28       ` Peter Zijlstra
2016-07-25 13:34 ` [PATCH v3 12/13] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
2016-07-25 13:34 ` [PATCH v3 13/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.