[PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support
@ 2016-05-23 10:58 Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 01/16] sched: Fix power to capacity renaming in comment Morten Rasmussen
                   ` (15 more replies)
  0 siblings, 16 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarious. SMP behaviour and performance should be
unaffected.

Test 0:
	for i in `seq 1 10`; \
	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
	       done \
	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
	       END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

	(Higher is better)
tip:	Average events: 150.2
patch:	Average events: 217.9

Test 1:
	perf stat --null --repeat 10 -- \
	perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip:	4.831538935 seconds time elapsed ( +-  1.58% )
patch:	4.839951382 seconds time elapsed ( +-  1.01% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip:	61.406552538 seconds time elapsed ( +-  0.12% )
patch:	61.589263159 seconds time elapsed ( +-  0.22% )

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

Patch   1-4: Generic fixes and clean-ups.
Patch  5-13: Improve capacity awareness.
Patch 14-16: Arch features for arm to enable asymmetric capacity support.

Dietmar Eggemann (1):
  sched: Store maximum per-cpu capacity in root domain

Morten Rasmussen (15):
  sched: Fix power to capacity renaming in comment
  sched/fair: Consistent use of prev_cpu in wakeup path
  sched/fair: Disregard idle task wakee_flips in wake_wide
  sched/fair: Optimize find_idlest_cpu() when there is no choice
  sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  sched: Disable WAKE_AFFINE for asymmetric configurations
  sched: Make SD_BALANCE_WAKE a topology flag
  sched/fair: Let asymmetric cpu configurations balance at wake-up
  sched/fair: Compute task/cpu utilization at wake-up more correctly
  sched/fair: Consider spare capacity in find_idlest_group()
  sched: Add per-cpu max capacity to sched_group_capacity
  sched/fair: Avoid pulling tasks from non-overloaded higher capacity
    groups
  arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
  arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems
  arm: Update arch_scale_cpu_capacity() to reflect change to define

 arch/arm/include/asm/topology.h |   5 +
 arch/arm/kernel/topology.c      |  25 ++++-
 include/linux/sched.h           |   3 +-
 kernel/sched/core.c             |  25 ++++-
 kernel/sched/fair.c             | 217 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h            |   5 +-
 6 files changed, 250 insertions(+), 30 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/16] sched: Fix power to capacity renaming in comment
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 38526b6..463b91f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1015,7 +1015,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
-#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 01/16] sched: Fix power to capacity renaming in comment Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-06-01 19:49   ` Peter Zijlstra
  2016-05-23 10:58 ` [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide Morten Rasmussen
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping). However, the
select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same cpu
everywhere in the wakeup path.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e8..c49e25a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -1502,7 +1502,8 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+						   env->dst_cpu);
 
 assign:
 	assigned = true;
@@ -5013,18 +5014,18 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p,
+		       int prev_cpu, int sync)
 {
 	s64 this_load, load;
 	s64 this_eff_load, prev_eff_load;
-	int idx, this_cpu, prev_cpu;
+	int idx, this_cpu;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
 
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
-	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
 	this_load = target_load(this_cpu, idx);
 
@@ -5189,11 +5190,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
@@ -5201,8 +5201,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+		return prev;
 
 	/*
 	 * Otherwise, iterate the domains and find an eligible idle cpu.
@@ -5223,6 +5223,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int i;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
@@ -5331,13 +5333,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (affine_sd) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
 			new_cpu = cpu;
 	}
 
 	if (!sd) {
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 	} else while (sd) {
 		struct sched_group *group;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 01/16] sched: Fix power to capacity renaming in comment Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 11:12   ` Mike Galbraith
  2016-05-23 10:58 ` [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

wake_wide() is based on task wakee_flips of the waker and the wakee to
decide whether an affine wakeup is desirable. On lightly loaded systems
the waker is frequently the idle task (pid=0) which can accumulate a lot
of wakee_flips in that scenario. It makes little sense to prevent affine
wakeups on an idle cpu due to the idle task wakee_flips, so it makes
more sense to ignore them in wake_wide().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c49e25a..0fe3020 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
 	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
+	/* Don't let the idle task prevent affine wakeups */
+	if (is_idle_task(current))
+		return 0;
+
 	if (master < slave)
 		swap(master, slave);
 	if (slave < factor || master < slave * factor)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (2 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-24  6:29   ` Mike Galbraith
  2016-06-01 19:59   ` Peter Zijlstra
  2016-05-23 10:58 ` [PATCH 05/16] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one cpu in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe3020..564215d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	int shallowest_idle_cpu = -1;
 	int i;
 
+	/* Check if we have any choice */
+	if (group->group_weight == 1) {
+		return cpumask_first(sched_group_cpus(group));
+	}
+
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		if (idle_cpu(i)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/16] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (3 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations Morten Rasmussen
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Add a topology flag to the sched_domain hierarchy indicating which
domains span cpus of different capacity (e.g. big.LITTLE). This
information is currently only available through iterating through the
capacities of all the cpus.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 463b91f..37752db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1015,6 +1015,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY	0x0040  /* Domain contains cpus of different capacity*/
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c078..d9619a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5659,6 +5659,7 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUCAPACITY |
+			 SD_ASYM_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
 			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
@@ -5689,6 +5690,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_NEWIDLE |
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
+				SD_ASYM_CPUCAPACITY |
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
@@ -6303,14 +6305,16 @@ static int sched_domains_curr_level;
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  *
- * Odd one out:
+ * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
+	 SD_ASYM_CPUCAPACITY |		\
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (4 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 05/16] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-24  9:10   ` Vincent Guittot
  2016-05-23 10:58 ` [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag Morten Rasmussen
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

If the system has cpu of different compute capacities (e.g. big.LITTLE)
let affine wakeups be constrained to cpus of the same type.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9619a3..558ec4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6410,6 +6410,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		sd->idle_idx = 1;
 	}
 
+	if (sd->flags & SD_ASYM_CPUCAPACITY)
+		sd->flags &= ~SD_WAKE_AFFINE;
+
 	sd->private = &tl->data;
 
 	return sd;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (5 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-24 23:52   ` Yuyang Du
  2016-06-01 20:18   ` Peter Zijlstra
  2016-05-23 10:58 ` [PATCH 08/16] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
sched_domain hierarchy we need a way to enable wake-up balancing for the
lower levels as well as we may want to balance tasks that don't fit the
capacity of the previous cpu.

We have the option of introducing a new topology flag to express this
requirement, or let the existing SD_BALANCE_WAKE flag be set by the
architecture as a topology flag. The former means introducing yet
another flag, the latter breaks the current meaning of topology flags.
None of the options are really desirable.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 558ec4a..8014b4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5658,6 +5658,7 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_NEWIDLE |
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
+			 SD_BALANCE_WAKE |
 			 SD_SHARE_CPUCAPACITY |
 			 SD_ASYM_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
@@ -5690,6 +5691,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_NEWIDLE |
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
+				SD_BALANCE_WAKE |
 				SD_ASYM_CPUCAPACITY |
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
@@ -6308,6 +6310,7 @@ static int sched_domains_curr_level;
  * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
  * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
+ * SD_BALANCE_WAKE	  - controls wake-up balancing (expensive)
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\
@@ -6315,6 +6318,7 @@ static int sched_domains_curr_level;
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
 	 SD_ASYM_CPUCAPACITY |		\
+	 SD_BALANCE_WAKE |		\
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/16] sched: Store maximum per-cpu capacity in root domain
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (6 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8014b4a..1d4059c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6841,6 +6841,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6891,11 +6892,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
 		cpu_attach_domain(sd, d.rd, i);
+
+		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
 	}
 	rcu_read_unlock();
 
+	if (rq)
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e51145e..72150c2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -564,6 +564,8 @@ struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (7 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 08/16] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-24  0:04   ` Yuyang Du
                     ` (3 more replies)
  2016-05-23 10:58 ` [PATCH 10/16] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
                   ` (6 subsequent siblings)
  15 siblings, 4 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the cpu capacities
available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
balancing take over (find_idlest_{group, cpu}()).

The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
containing cpus with different capacities. This is enforced by a
previous patch based on the SD_ASYM_CPUCAPACITY flag.

Ideally, we shouldn't set 'want_affine' in the first place, but we don't
know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
traversing them.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 564215d..ce44fa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with cpu capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+	return p->se.avg.util_avg;
+}
+
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+	long delta;
+	long prev_cap = capacity_of(prev_cpu);
+
+	delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
+
+	/* prev_cpu is fairly close to max, no need to abort wake_affine */
+	if (delta < prev_cap >> 3)
+		return 0;
+
+	return prev_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 	}
 
 	rcu_read_lock();
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/16] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (8 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 11/16] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch introduces task_util_wake() which
computes the decayed task utilization based on the last update of the
previous cpu's last load-tracking update. It is done without having to
take the rq lock, similar to how it is done in remove_entity_load_avg().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce44fa7..6d3369a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5304,6 +5304,75 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+/*
+ * task_util_wake: Returns an updated estimate of the utilization contribution
+ * of a waking task. At wake-up the task blocked utilization contribution
+ * (cfs_rq->avg) may have decayed while the utilization tracking of the task
+ * (se->avg) hasn't yet.
+ * Note that this estimate isn't perfectly accurate as the 1ms boundaries used
+ * for updating util_avg in __update_load_avg() are not considered here. This
+ * results in an error of up to 1ms utilization decay/accumulation which leads
+ * to an absolute util_avg error margin of 1024*1024/LOAD_AVG_MAX ~= 22
+ * (for LOAD_AVG_MAX = 47742).
+ */
+static inline int task_util_wake(struct task_struct *p)
+{
+	struct cfs_rq *prev_cfs_rq = &task_rq(p)->cfs;
+	struct sched_avg *psa = &p->se.avg;
+	u64 cfs_rq_last_update, p_last_update, delta;
+	u32 util_decayed;
+
+	p_last_update = psa->last_update_time;
+
+	/*
+	 * Task on rq (exec()) should be load-tracking aligned already.
+	 * New tasks have no history and should use the init value.
+	 */
+	if (p->se.on_rq || !p_last_update)
+		return task_util(p);
+
+	cfs_rq_last_update = cfs_rq_last_update_time(prev_cfs_rq);
+	delta = cfs_rq_last_update - p_last_update;
+
+	if ((s64)delta <= 0)
+		return task_util(p);
+
+	delta >>= 20;
+
+	if (!delta)
+		return task_util(p);
+
+	util_decayed = decay_load((u64)psa->util_sum, delta);
+	util_decayed /= LOAD_AVG_MAX;
+
+	/*
+	 * psa->util_avg can be slightly out of date as it is only updated
+	 * when a 1ms boundary is crossed.
+	 * See 'decayed' in __update_load_avg()
+	 */
+	util_decayed = min_t(unsigned long, util_decayed, task_util(p));
+
+	return util_decayed;
+}
+
+/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+	unsigned long util, capacity;
+
+	/* Task has no contribution or is new */
+	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+		return cpu_util(cpu);
+
+	capacity = capacity_orig_of(cpu);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_wake(p), 0);
+
+	return (util >= capacity) ? capacity : util;
+}
+
 static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long delta;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/16] sched/fair: Consider spare capacity in find_idlest_group()
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (9 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 10/16] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 12/16] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3369a..d4c5339 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5090,6 +5090,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+	return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
@@ -5099,7 +5107,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *most_spare_sg = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5107,7 +5117,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5119,8 +5129,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 
-		/* Tally up the load of all CPUs in the group */
+		/*
+		 * Tally up the load of all CPUs in the group and find
+		 * the group containing the cpu with most spare capacity.
+		 */
 		avg_load = 0;
+		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5130,6 +5144,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			spare_cap = capacity_spare_wake(i, p);
+
+			if (spare_cap > max_spare_cap &&
+			    spare_cap > capacity_of(i) >> 3) {
+				max_spare_cap = spare_cap;
+			}
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -5137,12 +5158,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
-			idlest = group;
+			this_spare = max_spare_cap;
+		} else {
+			if (avg_load < min_load) {
+				min_load = avg_load;
+				idlest = group;
+			}
+
+			if (most_spare < max_spare_cap) {
+				most_spare = max_spare_cap;
+				most_spare_sg = group;
+			}
 		}
 	} while (group = group->next, group != sd->groups);
 
+	/* Found a significant amount of spare capacity. */
+	if (this_spare > task_util(p) / 2 &&
+	    imbalance*this_spare > 100*most_spare)
+		return NULL;
+	else if (most_spare > task_util(p) / 2)
+		return most_spare_sg;
+
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/16] sched: Add per-cpu max capacity to sched_group_capacity
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (10 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 11/16] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 13/16] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate max per-cpu capacity in the group a
suitable group for a given task utilization can easily be found such
that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/fair.c  | 17 ++++++++++++-----
 kernel/sched/sched.h |  3 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4059c..4dd5cd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5599,7 +5599,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %d)",
+			printk(KERN_CONT " (cpu_capacity = %lu)",
 				group->sgc->capacity);
 		}
 
@@ -6070,6 +6070,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+		sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4c5339..0dd2d9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6658,13 +6658,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity;
+	unsigned long capacity, max_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6677,6 +6678,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 	}
 
 	capacity = 0;
+	max_capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6701,11 +6703,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 */
 			if (unlikely(!rq->sd)) {
 				capacity += capacity_of(cpu);
-				continue;
+			} else {
+				sgc = rq->sd->groups->sgc;
+				capacity += sgc->capacity;
 			}
 
-			sgc = rq->sd->groups->sgc;
-			capacity += sgc->capacity;
+			max_capacity = max(capacity, max_capacity);
 		}
 	} else  {
 		/*
@@ -6715,12 +6718,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity += group->sgc->capacity;
+			struct sched_group_capacity *sgc = group->sgc;
+
+			capacity += sgc->capacity;
+			max_capacity = max(sgc->max_capacity, max_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72150c2..359f689 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -868,7 +868,8 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity;
+	unsigned long capacity;
+	unsigned long max_capacity; /* Max per-cpu capacity in group */
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/16] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (11 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 12/16] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 14/16] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dd2d9c..5213bb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6822,6 +6822,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 	return false;
 }
 
+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->max_capacity * capacity_margin <
+						ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -6925,6 +6936,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= busiest->avg_load)
 		return false;
 
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+		goto asym_packing;
+
+	/* Candidate sg has no more than one task per cpu and has
+	 * higher per-cpu capacity. Migrating tasks to less capable
+	 * cpus may harm throughput. Maximize throughput,
+	 * power/energy consequences are not considered.
+	 */
+	if (sgs->sum_nr_running <= sgs->group_weight &&
+	    group_smaller_cpu_capacity(sds->local, sg))
+		return false;
+
+asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
 		return true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 14/16] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (12 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 13/16] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 15/16] arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 16/16] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Russell King

Set the SD_ASYM_CPUCAPACITY flag for DIE level sched_domain on
big.LITTLE systems.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ec279d1..1b81b31 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -78,6 +78,7 @@ static unsigned long *__cpu_capacity;
 #define cpu_capacity(cpu)	__cpu_capacity[cpu]
 
 static unsigned long middle_capacity = 1;
+static unsigned int big_little;
 
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -151,6 +152,8 @@ static void __init parse_dt_topology(void)
 		middle_capacity = ((max_capacity / 3)
 				>> (SCHED_CAPACITY_SHIFT-1)) + 1;
 
+	if (max_capacity && max_capacity != min_capacity)
+		big_little = 1;
 }
 
 /*
@@ -280,12 +283,17 @@ static inline int cpu_corepower_flags(void)
 	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
 }
 
+static inline int arm_cpu_cpu_flags(void)
+{
+	return big_little ? SD_ASYM_CPUCAPACITY : 0;
+}
+
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, arm_cpu_cpu_flags, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 15/16] arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (13 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 14/16] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  2016-05-23 10:58 ` [PATCH 16/16] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Russell King

Asymmetric capacity systems (e.g. ARM big.LITTLE) can not rely
exclusively on fast idle-based task placement at wake-up in all
scenarios. Enable SD_BALANCE_WAKE to have the option to do
load/utilization based wake-up task placement (existing tasks) if affine
wake-up fails.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 1b81b31..32d0a1b 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -280,18 +280,27 @@ void store_cpu_topology(unsigned int cpuid)
 
 static inline int cpu_corepower_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+	int flags = SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+
+	return big_little ? flags | SD_BALANCE_WAKE : flags;
+}
+
+static inline int arm_cpu_core_flags(void)
+{
+	int flags = SD_SHARE_PKG_RESOURCES;
+
+	return big_little ? flags | SD_BALANCE_WAKE : flags;
 }
 
 static inline int arm_cpu_cpu_flags(void)
 {
-	return big_little ? SD_ASYM_CPUCAPACITY : 0;
+	return big_little ? SD_ASYM_CPUCAPACITY | SD_BALANCE_WAKE : 0;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+	{ cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
 	{ cpu_cpu_mask, arm_cpu_cpu_flags, SD_INIT_NAME(DIE) },
 	{ NULL, },
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 16/16] arm: Update arch_scale_cpu_capacity() to reflect change to define
  2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (14 preceding siblings ...)
  2016-05-23 10:58 ` [PATCH 15/16] arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems Morten Rasmussen
@ 2016-05-23 10:58 ` Morten Rasmussen
  15 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 10:58 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Russell King

arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/include/asm/topology.h | 5 +++++
 arch/arm/kernel/topology.c      | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a7..a09a105 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,11 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
 #else
 
 static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 32d0a1b..dc5f8aa 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 10:58 ` [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide Morten Rasmussen
@ 2016-05-23 11:12   ` Mike Galbraith
  2016-05-23 12:00     ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Galbraith @ 2016-05-23 11:12 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, linux-kernel

On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> wake_wide() is based on task wakee_flips of the waker and the wakee to
> decide whether an affine wakeup is desirable. On lightly loaded systems
> the waker is frequently the idle task (pid=0) which can accumulate a lot
> of wakee_flips in that scenario. It makes little sense to prevent affine
> wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> more sense to ignore them in wake_wide().

You sure?  What's the difference between a task flipping enough to
warrant spreading the load, and an interrupt source doing the same? 
 I've both witnessed firsthand, and received user confirmation of this
very thing improving utilization.

> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c49e25a..0fe3020 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
>  	unsigned int slave = p->wakee_flips;
>  	int factor = this_cpu_read(sd_llc_size);
>  
> +	/* Don't let the idle task prevent affine wakeups */
> +	if (is_idle_task(current))
> +		return 0;
> +
>  	if (master < slave)
>  		swap(master, slave);
>  	if (slave < factor || master < slave * factor)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 11:12   ` Mike Galbraith
@ 2016-05-23 12:00     ` Morten Rasmussen
  2016-05-23 13:00       ` Mike Galbraith
                         ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 12:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > decide whether an affine wakeup is desirable. On lightly loaded systems
> > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > of wakee_flips in that scenario. It makes little sense to prevent affine
> > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > more sense to ignore them in wake_wide().
> 
> You sure?  What's the difference between a task flipping enough to
> warrant spreading the load, and an interrupt source doing the same? 
>  I've both witnessed firsthand, and received user confirmation of this
> very thing improving utilization.

Right, I didn't consider the interrupt source scenario, my fault.

The problem then seems to be distinguishing truly idle and busy doing
interrupts. The issue that I observe is that wake_wide() likes pushing
tasks around in lightly scenarios which isn't desirable for power
management. Selecting the same cpu again may potentially let others
reach deeper C-state.

With that in mind I will if I can do better. Suggestions are welcome :-)

> 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c49e25a..0fe3020 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
> >  	unsigned int slave = p->wakee_flips;
> >  	int factor = this_cpu_read(sd_llc_size);
> >  
> > +	/* Don't let the idle task prevent affine wakeups */
> > +	if (is_idle_task(current))
> > +		return 0;
> > +
> >  	if (master < slave)
> >  		swap(master, slave);
> >  	if (slave < factor || master < slave * factor)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 12:00     ` Morten Rasmussen
@ 2016-05-23 13:00       ` Mike Galbraith
  2016-05-23 14:10         ` Morten Rasmussen
  2016-05-23 23:04       ` Yuyang Du
  2016-06-01 19:57       ` Peter Zijlstra
  2 siblings, 1 reply; 61+ messages in thread
From: Mike Galbraith @ 2016-05-23 13:00 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:

> The problem then seems to be distinguishing truly idle and busy doing
> interrupts. The issue that I observe is that wake_wide() likes pushing
> tasks around in lightly scenarios which isn't desirable for power
> management. Selecting the same cpu again may potentially let others
> reach deeper C-state.
> 
> With that in mind I will if I can do better. Suggestions are welcome :-)

None here.  For big boxen that are highly idle, you'd likely want to
shut down nodes and consolidate load, but otoh, all that slows response
to burst, which I hate.  I prefer race to idle, let power gating do its
job.  If I had a server farm with enough capacity vs load variability
to worry about, I suspect I'd become highly interested in routing.

	-Mike

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 13:00       ` Mike Galbraith
@ 2016-05-23 14:10         ` Morten Rasmussen
  2016-05-23 15:42           ` Mike Galbraith
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-23 14:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Mon, May 23, 2016 at 03:00:46PM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:
> 
> > The problem then seems to be distinguishing truly idle and busy doing
> > interrupts. The issue that I observe is that wake_wide() likes pushing
> > tasks around in lightly scenarios which isn't desirable for power
> > management. Selecting the same cpu again may potentially let others
> > reach deeper C-state.
> > 
> > With that in mind I will if I can do better. Suggestions are welcome :-)
> 
> None here.  For big boxen that are highly idle, you'd likely want to
> shut down nodes and consolidate load, but otoh, all that slows response
> to burst, which I hate.  I prefer race to idle, let power gating do its
> job.  If I had a server farm with enough capacity vs load variability
> to worry about, I suspect I'd become highly interested in routing.

I don't disagree for systems of that scale, but at the other end of the
spectrum it is a single SoC we are trying squeeze the best possible
mileage out of. That implies optimizing for power gating to reach deeper
C-states when possible by consolidating idle-time and grouping
idle cpus. Migrating task unnecessarily isn't helping us in achieving
that, unfortunately :-(

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 14:10         ` Morten Rasmussen
@ 2016-05-23 15:42           ` Mike Galbraith
  2016-05-23 23:17             ` Yuyang Du
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Galbraith @ 2016-05-23 15:42 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Mon, 2016-05-23 at 15:10 +0100, Morten Rasmussen wrote:
> On Mon, May 23, 2016 at 03:00:46PM +0200, Mike Galbraith wrote:
> > On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:
> > 
> > > The problem then seems to be distinguishing truly idle and busy doing
> > > interrupts. The issue that I observe is that wake_wide() likes pushing
> > > tasks around in lightly scenarios which isn't desirable for power
> > > management. Selecting the same cpu again may potentially let others
> > > reach deeper C-state.
> > > 
> > > With that in mind I will if I can do better. Suggestions are welcome :-)
> > 
> > None here.  For big boxen that are highly idle, you'd likely want to
> > shut down nodes and consolidate load, but otoh, all that slows response
> > to burst, which I hate.  I prefer race to idle, let power gating do its
> > job.  If I had a server farm with enough capacity vs load variability
> > to worry about, I suspect I'd become highly interested in routing.
> 
> I don't disagree for systems of that scale, but at the other end of the
> spectrum it is a single SoC we are trying squeeze the best possible
> mileage out of. That implies optimizing for power gating to reach deeper
> C-states when possible by consolidating idle-time and grouping
> idle cpus. Migrating task unnecessarily isn't helping us in achieving
> that, unfortunately :-(

Yup, the goals are pretty much mutually exclusive.  For your goal, you
want more of an allocator like behavior, where stacking of tasks is bad
only once there's too much overlap (ie latency, defining is hard), and
allocation always has the same order (expand rightward or such for the
general case, adding little/big complexity for arm).  For mine, current
behavior is good, avoid stacking like the plague.

	-Mike

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 12:00     ` Morten Rasmussen
  2016-05-23 13:00       ` Mike Galbraith
@ 2016-05-23 23:04       ` Yuyang Du
  2016-06-01 19:57       ` Peter Zijlstra
  2 siblings, 0 replies; 61+ messages in thread
From: Yuyang Du @ 2016-05-23 23:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Mike Galbraith, peterz, mingo, dietmar.eggemann, vincent.guittot,
	linux-kernel

On Mon, May 23, 2016 at 01:00:10PM +0100, Morten Rasmussen wrote:
> On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> > On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > > decide whether an affine wakeup is desirable. On lightly loaded systems
> > > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > > of wakee_flips in that scenario. It makes little sense to prevent affine
> > > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > > more sense to ignore them in wake_wide().
> > 
> > You sure?  What's the difference between a task flipping enough to
> > warrant spreading the load, and an interrupt source doing the same? 
> >  I've both witnessed firsthand, and received user confirmation of this
> > very thing improving utilization.
> 
> Right, I didn't consider the interrupt source scenario, my fault.
> 
> The problem then seems to be distinguishing truly idle and busy doing
> interrupts. The issue that I observe is that wake_wide() likes pushing
> tasks around in lightly scenarios which isn't desirable for power
> management. Selecting the same cpu again may potentially let others
> reach deeper C-state.
> 
> With that in mind I will if I can do better. Suggestions are welcome :-)
 
On mobile, the factor is as small as 2 to 4, may easily be exceeded,
so decay at HZ may be too slow.

> > 
> > > cc: Ingo Molnar <mingo@redhat.com>
> > > cc: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c49e25a..0fe3020 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
> > >  	unsigned int slave = p->wakee_flips;
> > >  	int factor = this_cpu_read(sd_llc_size);
> > >  
> > > +	/* Don't let the idle task prevent affine wakeups */
> > > +	if (is_idle_task(current))
> > > +		return 0;
> > > +
> > >  	if (master < slave)
> > >  		swap(master, slave);
> > >  	if (slave < factor || master < slave * factor)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 15:42           ` Mike Galbraith
@ 2016-05-23 23:17             ` Yuyang Du
  0 siblings, 0 replies; 61+ messages in thread
From: Yuyang Du @ 2016-05-23 23:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Morten Rasmussen, peterz, mingo, dietmar.eggemann,
	vincent.guittot, linux-kernel

On Mon, May 23, 2016 at 05:42:20PM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 15:10 +0100, Morten Rasmussen wrote:
> > On Mon, May 23, 2016 at 03:00:46PM +0200, Mike Galbraith wrote:
> > > On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:
> > > 
> > > > The problem then seems to be distinguishing truly idle and busy doing
> > > > interrupts. The issue that I observe is that wake_wide() likes pushing
> > > > tasks around in lightly scenarios which isn't desirable for power
> > > > management. Selecting the same cpu again may potentially let others
> > > > reach deeper C-state.
> > > > 
> > > > With that in mind I will if I can do better. Suggestions are welcome :-)
> > > 
> > > None here.  For big boxen that are highly idle, you'd likely want to
> > > shut down nodes and consolidate load, but otoh, all that slows response
> > > to burst, which I hate.  I prefer race to idle, let power gating do its
> > > job.  If I had a server farm with enough capacity vs load variability
> > > to worry about, I suspect I'd become highly interested in routing.
> > 
> > I don't disagree for systems of that scale, but at the other end of the
> > spectrum it is a single SoC we are trying squeeze the best possible
> > mileage out of. That implies optimizing for power gating to reach deeper
> > C-states when possible by consolidating idle-time and grouping
> > idle cpus. Migrating task unnecessarily isn't helping us in achieving
> > that, unfortunately :-(
> 
> Yup, the goals are pretty much mutually exclusive.  For your goal, you
> want more of an allocator like behavior, where stacking of tasks is bad
> only once there's too much overlap (ie latency, defining is hard), and
> allocation always has the same order (expand rightward or such for the
> general case, adding little/big complexity for arm).  For mine, current
> behavior is good, avoid stacking like the plague.

I'd be happy to have a switch to either one goal.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-05-24  0:04   ` Yuyang Du
  2016-05-24  8:10     ` Morten Rasmussen
  2016-05-24  7:03   ` Mike Galbraith
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: Yuyang Du @ 2016-05-24  0:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> balancing take over (find_idlest_{group, cpu}()).
> 
> The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> containing cpus with different capacities. This is enforced by a
> previous patch based on the SD_ASYM_CPUCAPACITY flag.
> 
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 564215d..ce44fa7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>  #endif
>  
> +/*
> + * The margin used when comparing utilization with cpu capacity:
> + * util * 1024 < capacity * margin
> + */
> +unsigned int capacity_margin = 1280; /* ~20% */
> +
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>  {
>  	lw->weight += inc;
> @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
>  	return (util >= capacity) ? capacity : util;
>  }
>  
> +static inline int task_util(struct task_struct *p)
> +{
> +	return p->se.avg.util_avg;
> +}
> +
> +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> +{
> +	long delta;
> +	long prev_cap = capacity_of(prev_cpu);
> +
> +	delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> +
> +	/* prev_cpu is fairly close to max, no need to abort wake_affine */
> +	if (delta < prev_cap >> 3)
> +		return 0;

delta can be negative? still return 0?

> +
> +	return prev_cap * 1024 < task_util(p) * capacity_margin;
> +}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-05-23 10:58 ` [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
@ 2016-05-24  6:29   ` Mike Galbraith
  2016-05-24  8:05     ` Morten Rasmussen
  2016-06-01 19:59   ` Peter Zijlstra
  1 sibling, 1 reply; 61+ messages in thread
From: Mike Galbraith @ 2016-05-24  6:29 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, linux-kernel

On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> In the current find_idlest_group()/find_idlest_cpu() search we end up
> calling find_idlest_cpu() in a sched_group containing only one cpu in
> the end. Checking idle-states becomes pointless when there is no
> alternative, so bail out instead.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fe3020..564215d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  > 	> int shallowest_idle_cpu = -1;
>  > 	> int i;
>  
> +> 	> /* Check if we have any choice */
> +> 	> if (group->group_weight == 1) {
> +> 	> 	> return cpumask_first(sched_group_cpus(group));
> +> 	> }
> +

Hm, if task isn't allowed there, too bad?

>  	/* Traverse only the allowed CPUs */
>  > 	> for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>  > 	> 	> if (idle_cpu(i)) {

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
  2016-05-24  0:04   ` Yuyang Du
@ 2016-05-24  7:03   ` Mike Galbraith
  2016-05-24  7:15     ` Mike Galbraith
  2016-05-25  6:57   ` Wanpeng Li
  2016-06-02 14:21   ` Peter Zijlstra
  3 siblings, 1 reply; 61+ messages in thread
From: Mike Galbraith @ 2016-05-24  7:03 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, linux-kernel

On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> balancing take over (find_idlest_{group, cpu}()).
> 
> The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> containing cpus with different capacities. This is enforced by a
> previous patch based on the SD_ASYM_CPUCAPACITY flag.
> 
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.

This doesn't look like it's restricted to big/little setups, so could
overrule wake_wide() wanting to NAK a x-node pull.

> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 564215d..ce44fa7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >  #endif
> >  
> > +/*
> > + * The margin used when comparing utilization with cpu capacity:
> > + * util * 1024 < capacity * margin
> > + */
> > +unsigned int capacity_margin = 1280; /* ~20% */
> > +
> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> >  {
> >  > > 	> > lw->weight += inc;
> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
> >  > > 	> > return (util >= capacity) ? capacity : util;
> >  }
> >  
> > +static inline int task_util(struct task_struct *p)
> > +{
> > +> > 	> > return p->se.avg.util_avg;
> > +}
> > +
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +> > 	> > long delta;
> > +> > 	> > long prev_cap = capacity_of(prev_cpu);
> > +
> > +> > 	> > delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> > +
> > +> > 	> > /* prev_cpu is fairly close to max, no need to abort wake_affine */
> > +> > 	> > if (delta < prev_cap >> 3)
> > +> > 	> > 	> > return 0;
> > +
> > +> > 	> > return prev_cap * 1024 < task_util(p) * capacity_margin;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  
> >  > > 	> > if (sd_flag & SD_BALANCE_WAKE) {
> >  > > 	> > 	> > record_wakee(p);
> > -> > 	> > 	> > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +> > 	> > 	> > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > +> > 	> > 	> > 	> >       && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> >  > > 	> > }
> >  
> >  > > 	> > rcu_read_lock();

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-24  7:03   ` Mike Galbraith
@ 2016-05-24  7:15     ` Mike Galbraith
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Galbraith @ 2016-05-24  7:15 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, linux-kernel

On Tue, 2016-05-24 at 09:03 +0200, Mike Galbraith wrote:

> This doesn't look like it's restricted to big/little setups, so could
> overrule wake_wide() wanting to NAK a x-node pull.

Bah, nevermind.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-05-24  6:29   ` Mike Galbraith
@ 2016-05-24  8:05     ` Morten Rasmussen
  2016-05-24  8:12       ` Mike Galbraith
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24  8:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Tue, May 24, 2016 at 08:29:05AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > In the current find_idlest_group()/find_idlest_cpu() search we end up
> > calling find_idlest_cpu() in a sched_group containing only one cpu in
> > the end. Checking idle-states becomes pointless when there is no
> > alternative, so bail out instead.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0fe3020..564215d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> >  > 	> int shallowest_idle_cpu = -1;
> >  > 	> int i;
> >  
> > +> 	> /* Check if we have any choice */
> > +> 	> if (group->group_weight == 1) {
> > +> 	> 	> return cpumask_first(sched_group_cpus(group));
> > +> 	> }
> > +
> 
> Hm, if task isn't allowed there, too bad?

Is that possible for single-cpu groups? I thought we skipped groups with
no cpus allowed in find_idlest_group():

                /* Skip over this group if it has no CPUs allowed */
                if (!cpumask_intersects(sched_group_cpus(group),
                                        tsk_cpus_allowed(p)))
                        continue;

Since the group has at least one cpu allowed and only contains one cpu,
that cpu must be allowed. No?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-24  0:04   ` Yuyang Du
@ 2016-05-24  8:10     ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24  8:10 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, dietmar.eggemann, vincent.guittot, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 08:04:24AM +0800, Yuyang Du wrote:
> On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:
> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> > compute demand (utilization) is suitable for the cpu capacities
> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> > balancing take over (find_idlest_{group, cpu}()).
> > 
> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> > containing cpus with different capacities. This is enforced by a
> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
> > 
> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> > traversing them.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 564215d..ce44fa7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >  #endif
> >  
> > +/*
> > + * The margin used when comparing utilization with cpu capacity:
> > + * util * 1024 < capacity * margin
> > + */
> > +unsigned int capacity_margin = 1280; /* ~20% */
> > +
> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> >  {
> >  	lw->weight += inc;
> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
> >  	return (util >= capacity) ? capacity : util;
> >  }
> >  
> > +static inline int task_util(struct task_struct *p)
> > +{
> > +	return p->se.avg.util_avg;
> > +}
> > +
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +	long delta;
> > +	long prev_cap = capacity_of(prev_cpu);
> > +
> > +	delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> > +
> > +	/* prev_cpu is fairly close to max, no need to abort wake_affine */
> > +	if (delta < prev_cap >> 3)
> > +		return 0;
> 
> delta can be negative? still return 0?

I could add an abs() around delta.

Do you have a specific scenario in mind? Under normal circumstances, I
don't think it can be negative?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-05-24  8:05     ` Morten Rasmussen
@ 2016-05-24  8:12       ` Mike Galbraith
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Galbraith @ 2016-05-24  8:12 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	linux-kernel

On Tue, 2016-05-24 at 09:05 +0100, Morten Rasmussen wrote:
> On Tue, May 24, 2016 at 08:29:05AM +0200, Mike Galbraith wrote:
> >  
> > > +> > > > 	> > > > /* Check if we have any choice */
> > > +> > > > 	> > > > if (group->group_weight == 1) {
> > > +> > > > 	> > > > > > > 	> > > > return cpumask_first(sched_group_cpus(group));
> > > +> > > > 	> > > > }
> > > +
> > 
> > Hm, if task isn't allowed there, too bad?
> 
> Is that possible for single-cpu groups? I thought we skipped groups with
> no cpus allowed in find_idlest_group():
> 
>                 /* Skip over this group if it has no CPUs allowed */
>                 if (!cpumask_intersects(sched_group_cpus(group),
>                                         tsk_cpus_allowed(p)))
>                         continue;
> 
> Since the group has at least one cpu allowed and only contains one cpu,
> that cpu must be allowed. No?

Yup, you're right, handled before we got there.

	-Mike

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-23 10:58 ` [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations Morten Rasmussen
@ 2016-05-24  9:10   ` Vincent Guittot
  2016-05-24 10:29     ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-24  9:10 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> If the system has cpu of different compute capacities (e.g. big.LITTLE)
> let affine wakeups be constrained to cpus of the same type.

Can you explain why you don't want wake affine with cpus with
different compute capacity ?

>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/core.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d9619a3..558ec4a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6410,6 +6410,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>                 sd->idle_idx = 1;
>         }
>
> +       if (sd->flags & SD_ASYM_CPUCAPACITY)
> +               sd->flags &= ~SD_WAKE_AFFINE;
> +
>         sd->private = &tl->data;
>
>         return sd;
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24  9:10   ` Vincent Guittot
@ 2016-05-24 10:29     ` Morten Rasmussen
  2016-05-24 12:12       ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24 10:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
> > let affine wakeups be constrained to cpus of the same type.
> 
> Can you explain why you don't want wake affine with cpus with
> different compute capacity ?

I should have made the overall idea a bit more clear. The idea is to
deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
path so we don't have to touch select_idle_sibling().
select_idle_sibling() is critical for wake-up latency, and I'm assumed
that people wouldn't like adding extra overhead in there to deal with
capacity and utilization.

So the overall idea is that symmetric capacity systems, everything
should work as normal. For asymmetric capacity systems, we restrict
select_idle_sibling() to only look among same-capacity cpus and then use
wake_cap() to use find_idlest_{group, cpu}() to look wider if we think
should look for cpu with higher capacity than the previous one. So, for
assymmetric cpus we take one of the two routes depending on whether a
cpu of the same capacity as the previous one is okay.

Do that make any sense?

> 
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/core.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d9619a3..558ec4a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6410,6 +6410,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
> >                 sd->idle_idx = 1;
> >         }
> >
> > +       if (sd->flags & SD_ASYM_CPUCAPACITY)
> > +               sd->flags &= ~SD_WAKE_AFFINE;
> > +
> >         sd->private = &tl->data;
> >
> >         return sd;
> > --
> > 1.9.1
> >

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 10:29     ` Morten Rasmussen
@ 2016-05-24 12:12       ` Vincent Guittot
  2016-05-24 13:16         ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-24 12:12 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
>> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
>> > let affine wakeups be constrained to cpus of the same type.
>>
>> Can you explain why you don't want wake affine with cpus with
>> different compute capacity ?
>
> I should have made the overall idea a bit more clear. The idea is to
> deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
> path so we don't have to touch select_idle_sibling().
> select_idle_sibling() is critical for wake-up latency, and I'm assumed
> that people wouldn't like adding extra overhead in there to deal with
> capacity and utilization.

So this means that we will never use the quick path of
select_idle_sibling for cross capacity migration but always the one
with extra overhead?
Patch 9 adds more tests for enabling wake_affine path. Can't it also
be used for cross capacity migration ? so we can use wake_affine if
the task or the cpus (even with different capacity) doesn't need this
extra overhead

>
> So the overall idea is that symmetric capacity systems, everything
> should work as normal. For asymmetric capacity systems, we restrict
> select_idle_sibling() to only look among same-capacity cpus and then use
> wake_cap() to use find_idlest_{group, cpu}() to look wider if we think
> should look for cpu with higher capacity than the previous one. So, for
> assymmetric cpus we take one of the two routes depending on whether a
> cpu of the same capacity as the previous one is okay.
>
> Do that make any sense?
>
>>
>> >
>> > cc: Ingo Molnar <mingo@redhat.com>
>> > cc: Peter Zijlstra <peterz@infradead.org>
>> >
>> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> > ---
>> >  kernel/sched/core.c | 3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d9619a3..558ec4a 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -6410,6 +6410,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>> >                 sd->idle_idx = 1;
>> >         }
>> >
>> > +       if (sd->flags & SD_ASYM_CPUCAPACITY)
>> > +               sd->flags &= ~SD_WAKE_AFFINE;
>> > +
>> >         sd->private = &tl->data;
>> >
>> >         return sd;
>> > --
>> > 1.9.1
>> >

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 12:12       ` Vincent Guittot
@ 2016-05-24 13:16         ` Morten Rasmussen
  2016-05-24 13:27           ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24 13:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
> >> > let affine wakeups be constrained to cpus of the same type.
> >>
> >> Can you explain why you don't want wake affine with cpus with
> >> different compute capacity ?
> >
> > I should have made the overall idea a bit more clear. The idea is to
> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
> > path so we don't have to touch select_idle_sibling().
> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
> > that people wouldn't like adding extra overhead in there to deal with
> > capacity and utilization.
> 
> So this means that we will never use the quick path of
> select_idle_sibling for cross capacity migration but always the one
> with extra overhead?

Yes. select_idle_sibling() is only used to choose among equal capacity
cpus (capacity_orig).

> Patch 9 adds more tests for enabling wake_affine path. Can't it also
> be used for cross capacity migration ? so we can use wake_affine if
> the task or the cpus (even with different capacity) doesn't need this
> extra overhead

The test in patch 9 is to determine whether we are happy with the
capacity of the previous cpu, or we should go look for one with more
capacity. I don't see how we can use select_idle_sibling() unmodified
for sched domains containing cpus of different capacity to select an
appropriate cpu. It is just picking an idle cpu, it might have high
capacity or low, it wouldn't care.

How would you avoid the overhead of checking capacity and utilization of
the cpus and still pick an appropriate cpu?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 13:16         ` Morten Rasmussen
@ 2016-05-24 13:27           ` Vincent Guittot
  2016-05-24 13:36             ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-24 13:27 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
>> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
>> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
>> >> > let affine wakeups be constrained to cpus of the same type.
>> >>
>> >> Can you explain why you don't want wake affine with cpus with
>> >> different compute capacity ?
>> >
>> > I should have made the overall idea a bit more clear. The idea is to
>> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
>> > path so we don't have to touch select_idle_sibling().
>> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
>> > that people wouldn't like adding extra overhead in there to deal with
>> > capacity and utilization.
>>
>> So this means that we will never use the quick path of
>> select_idle_sibling for cross capacity migration but always the one
>> with extra overhead?
>
> Yes. select_idle_sibling() is only used to choose among equal capacity
> cpus (capacity_orig).
>
>> Patch 9 adds more tests for enabling wake_affine path. Can't it also
>> be used for cross capacity migration ? so we can use wake_affine if
>> the task or the cpus (even with different capacity) doesn't need this
>> extra overhead
>
> The test in patch 9 is to determine whether we are happy with the
> capacity of the previous cpu, or we should go look for one with more
> capacity. I don't see how we can use select_idle_sibling() unmodified
> for sched domains containing cpus of different capacity to select an
> appropriate cpu. It is just picking an idle cpu, it might have high
> capacity or low, it wouldn't care.
>
> How would you avoid the overhead of checking capacity and utilization of
> the cpus and still pick an appropriate cpu?

My point is that there is some wake up case where we don't care about
the capacity and utilization of cpus even for cross capacity migration
and we will never take benefit of this fast path.
You have added an extra check for setting want_affine in patch 9 which
uses capacity and utilization of cpu to disable this fast path when a
task needs more capacity than available. Can't you use this function
to disable the want_affine for cross-capacity migration situation that
cares of the capacity and need the full scan of sched_domain but keep
it enable for other cases ?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 13:27           ` Vincent Guittot
@ 2016-05-24 13:36             ` Morten Rasmussen
  2016-05-24 13:52               ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24 13:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
> >> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
> >> >> > let affine wakeups be constrained to cpus of the same type.
> >> >>
> >> >> Can you explain why you don't want wake affine with cpus with
> >> >> different compute capacity ?
> >> >
> >> > I should have made the overall idea a bit more clear. The idea is to
> >> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
> >> > path so we don't have to touch select_idle_sibling().
> >> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
> >> > that people wouldn't like adding extra overhead in there to deal with
> >> > capacity and utilization.
> >>
> >> So this means that we will never use the quick path of
> >> select_idle_sibling for cross capacity migration but always the one
> >> with extra overhead?
> >
> > Yes. select_idle_sibling() is only used to choose among equal capacity
> > cpus (capacity_orig).
> >
> >> Patch 9 adds more tests for enabling wake_affine path. Can't it also
> >> be used for cross capacity migration ? so we can use wake_affine if
> >> the task or the cpus (even with different capacity) doesn't need this
> >> extra overhead
> >
> > The test in patch 9 is to determine whether we are happy with the
> > capacity of the previous cpu, or we should go look for one with more
> > capacity. I don't see how we can use select_idle_sibling() unmodified
> > for sched domains containing cpus of different capacity to select an
> > appropriate cpu. It is just picking an idle cpu, it might have high
> > capacity or low, it wouldn't care.
> >
> > How would you avoid the overhead of checking capacity and utilization of
> > the cpus and still pick an appropriate cpu?
> 
> My point is that there is some wake up case where we don't care about
> the capacity and utilization of cpus even for cross capacity migration
> and we will never take benefit of this fast path.
> You have added an extra check for setting want_affine in patch 9 which
> uses capacity and utilization of cpu to disable this fast path when a
> task needs more capacity than available. Can't you use this function
> to disable the want_affine for cross-capacity migration situation that
> cares of the capacity and need the full scan of sched_domain but keep
> it enable for other cases ?

It is not clear to me what the other cases are. What kind of cases do
you have in mind?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 13:36             ` Morten Rasmussen
@ 2016-05-24 13:52               ` Vincent Guittot
  2016-05-24 15:02                 ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-24 13:52 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
>> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
>> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
>> >> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
>> >> >> > let affine wakeups be constrained to cpus of the same type.
>> >> >>
>> >> >> Can you explain why you don't want wake affine with cpus with
>> >> >> different compute capacity ?
>> >> >
>> >> > I should have made the overall idea a bit more clear. The idea is to
>> >> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
>> >> > path so we don't have to touch select_idle_sibling().
>> >> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
>> >> > that people wouldn't like adding extra overhead in there to deal with
>> >> > capacity and utilization.
>> >>
>> >> So this means that we will never use the quick path of
>> >> select_idle_sibling for cross capacity migration but always the one
>> >> with extra overhead?
>> >
>> > Yes. select_idle_sibling() is only used to choose among equal capacity
>> > cpus (capacity_orig).
>> >
>> >> Patch 9 adds more tests for enabling wake_affine path. Can't it also
>> >> be used for cross capacity migration ? so we can use wake_affine if
>> >> the task or the cpus (even with different capacity) doesn't need this
>> >> extra overhead
>> >
>> > The test in patch 9 is to determine whether we are happy with the
>> > capacity of the previous cpu, or we should go look for one with more
>> > capacity. I don't see how we can use select_idle_sibling() unmodified
>> > for sched domains containing cpus of different capacity to select an
>> > appropriate cpu. It is just picking an idle cpu, it might have high
>> > capacity or low, it wouldn't care.
>> >
>> > How would you avoid the overhead of checking capacity and utilization of
>> > the cpus and still pick an appropriate cpu?
>>
>> My point is that there is some wake up case where we don't care about
>> the capacity and utilization of cpus even for cross capacity migration
>> and we will never take benefit of this fast path.
>> You have added an extra check for setting want_affine in patch 9 which
>> uses capacity and utilization of cpu to disable this fast path when a
>> task needs more capacity than available. Can't you use this function
>> to disable the want_affine for cross-capacity migration situation that
>> cares of the capacity and need the full scan of sched_domain but keep
>> it enable for other cases ?
>
> It is not clear to me what the other cases are. What kind of cases do
> you have in mind?

As an example, you have a task A that have to be on a big CPU because
of the requirement of compute capacity, that wakes up a task B that
can run on any cpu according to its utilization. The fast wake up path
is fine for task B whatever prev cpu is.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 13:52               ` Vincent Guittot
@ 2016-05-24 15:02                 ` Morten Rasmussen
  2016-05-24 15:53                   ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-24 15:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 03:52:00PM +0200, Vincent Guittot wrote:
> On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
> >> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
> >> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
> >> >> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
> >> >> >> > let affine wakeups be constrained to cpus of the same type.
> >> >> >>
> >> >> >> Can you explain why you don't want wake affine with cpus with
> >> >> >> different compute capacity ?
> >> >> >
> >> >> > I should have made the overall idea a bit more clear. The idea is to
> >> >> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
> >> >> > path so we don't have to touch select_idle_sibling().
> >> >> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
> >> >> > that people wouldn't like adding extra overhead in there to deal with
> >> >> > capacity and utilization.
> >> >>
> >> >> So this means that we will never use the quick path of
> >> >> select_idle_sibling for cross capacity migration but always the one
> >> >> with extra overhead?
> >> >
> >> > Yes. select_idle_sibling() is only used to choose among equal capacity
> >> > cpus (capacity_orig).
> >> >
> >> >> Patch 9 adds more tests for enabling wake_affine path. Can't it also
> >> >> be used for cross capacity migration ? so we can use wake_affine if
> >> >> the task or the cpus (even with different capacity) doesn't need this
> >> >> extra overhead
> >> >
> >> > The test in patch 9 is to determine whether we are happy with the
> >> > capacity of the previous cpu, or we should go look for one with more
> >> > capacity. I don't see how we can use select_idle_sibling() unmodified
> >> > for sched domains containing cpus of different capacity to select an
> >> > appropriate cpu. It is just picking an idle cpu, it might have high
> >> > capacity or low, it wouldn't care.
> >> >
> >> > How would you avoid the overhead of checking capacity and utilization of
> >> > the cpus and still pick an appropriate cpu?
> >>
> >> My point is that there is some wake up case where we don't care about
> >> the capacity and utilization of cpus even for cross capacity migration
> >> and we will never take benefit of this fast path.
> >> You have added an extra check for setting want_affine in patch 9 which
> >> uses capacity and utilization of cpu to disable this fast path when a
> >> task needs more capacity than available. Can't you use this function
> >> to disable the want_affine for cross-capacity migration situation that
> >> cares of the capacity and need the full scan of sched_domain but keep
> >> it enable for other cases ?
> >
> > It is not clear to me what the other cases are. What kind of cases do
> > you have in mind?
> 
> As an example, you have a task A that have to be on a big CPU because
> of the requirement of compute capacity, that wakes up a task B that
> can run on any cpu according to its utilization. The fast wake up path
> is fine for task B whatever prev cpu is.

In that case, we will take always take fast path (select_idle_sibling())
for task B if wake_wide() allows it, which should be fine.

wake_cap() will return true as the B's prev_cpu is either a big cpu
(first criteria) or have sufficient capacity for B (second criteria).
Given that wake_wide() allows returns false as well and there are no
restrictions, want_affine will be true. Depending on where wake_affine()
sends us, we will use select_idle_sibling() to search around B's
prev_cpu or this cpu (where task A is running).

We avoid the overhead of looking for cpu capacity and utilization, but
we have restricted the search space for select_idle_sibling(). In case
B's prev_cpu is a little cpu, the choice whether we looks for little or
big capacity cpus depends on the wake_affine()'s decision. So the search
space isn't as wide as it could be.

To expand the search space we would have be able to adjust the
sched_domain level at which select_idle_sibling() is operating, so we
can look at same-capacity cpus only in the fast path for tasks like A,
and look at all cpus for tasks like B. It could possibly be done, if we
dare touching select_idle_sibling() ;-) I still have to look at those
patches PeterZ posted a while back.

TLDR; The fast path should already be used for task B, but the cpu
search space is restricted to a specific subset of cpus selected by
wake_affine() which isn't ideal, but much less invasive in terms of code
changes. 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 15:02                 ` Morten Rasmussen
@ 2016-05-24 15:53                   ` Vincent Guittot
  2016-05-25  9:12                     ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-24 15:53 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 24 May 2016 at 17:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, May 24, 2016 at 03:52:00PM +0200, Vincent Guittot wrote:
>> On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
>> >> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
>> >> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> >> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
>> >> >> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> >> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
>> >> >> >> > let affine wakeups be constrained to cpus of the same type.
>> >> >> >>
>> >> >> >> Can you explain why you don't want wake affine with cpus with
>> >> >> >> different compute capacity ?
>> >> >> >
>> >> >> > I should have made the overall idea a bit more clear. The idea is to
>> >> >> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
>> >> >> > path so we don't have to touch select_idle_sibling().
>> >> >> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
>> >> >> > that people wouldn't like adding extra overhead in there to deal with
>> >> >> > capacity and utilization.
>> >> >>
>> >> >> So this means that we will never use the quick path of
>> >> >> select_idle_sibling for cross capacity migration but always the one
>> >> >> with extra overhead?
>> >> >
>> >> > Yes. select_idle_sibling() is only used to choose among equal capacity
>> >> > cpus (capacity_orig).
>> >> >
>> >> >> Patch 9 adds more tests for enabling wake_affine path. Can't it also
>> >> >> be used for cross capacity migration ? so we can use wake_affine if
>> >> >> the task or the cpus (even with different capacity) doesn't need this
>> >> >> extra overhead
>> >> >
>> >> > The test in patch 9 is to determine whether we are happy with the
>> >> > capacity of the previous cpu, or we should go look for one with more
>> >> > capacity. I don't see how we can use select_idle_sibling() unmodified
>> >> > for sched domains containing cpus of different capacity to select an
>> >> > appropriate cpu. It is just picking an idle cpu, it might have high
>> >> > capacity or low, it wouldn't care.
>> >> >
>> >> > How would you avoid the overhead of checking capacity and utilization of
>> >> > the cpus and still pick an appropriate cpu?
>> >>
>> >> My point is that there is some wake up case where we don't care about
>> >> the capacity and utilization of cpus even for cross capacity migration
>> >> and we will never take benefit of this fast path.
>> >> You have added an extra check for setting want_affine in patch 9 which
>> >> uses capacity and utilization of cpu to disable this fast path when a
>> >> task needs more capacity than available. Can't you use this function
>> >> to disable the want_affine for cross-capacity migration situation that
>> >> cares of the capacity and need the full scan of sched_domain but keep
>> >> it enable for other cases ?
>> >
>> > It is not clear to me what the other cases are. What kind of cases do
>> > you have in mind?
>>
>> As an example, you have a task A that have to be on a big CPU because
>> of the requirement of compute capacity, that wakes up a task B that
>> can run on any cpu according to its utilization. The fast wake up path
>> is fine for task B whatever prev cpu is.
>
> In that case, we will take always take fast path (select_idle_sibling())
> for task B if wake_wide() allows it, which should be fine.

Even if want_affine is set, the wake up of task B will not use the fast path.
The affine_sd will not be set because the sched_domain, which have
both cpus, will not have the SD_WAKE_AFFINE flag according to this
patch, isn't it ?
So task B can't use the fast path whereas nothing prevent him to take
benefit of it

Am I missing something ?

>
> wake_cap() will return true as the B's prev_cpu is either a big cpu
> (first criteria) or have sufficient capacity for B (second criteria).
> Given that wake_wide() allows returns false as well and there are no
> restrictions, want_affine will be true. Depending on where wake_affine()
> sends us, we will use select_idle_sibling() to search around B's
> prev_cpu or this cpu (where task A is running).
>
> We avoid the overhead of looking for cpu capacity and utilization, but
> we have restricted the search space for select_idle_sibling(). In case
> B's prev_cpu is a little cpu, the choice whether we looks for little or
> big capacity cpus depends on the wake_affine()'s decision. So the search
> space isn't as wide as it could be.
>
> To expand the search space we would have be able to adjust the
> sched_domain level at which select_idle_sibling() is operating, so we
> can look at same-capacity cpus only in the fast path for tasks like A,
> and look at all cpus for tasks like B. It could possibly be done, if we
> dare touching select_idle_sibling() ;-) I still have to look at those
> patches PeterZ posted a while back.
>
> TLDR; The fast path should already be used for task B, but the cpu
> search space is restricted to a specific subset of cpus selected by
> wake_affine() which isn't ideal, but much less invasive in terms of code
> changes.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag
  2016-05-23 10:58 ` [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag Morten Rasmussen
@ 2016-05-24 23:52   ` Yuyang Du
  2016-05-25  9:27     ` Morten Rasmussen
  2016-06-01 20:18   ` Peter Zijlstra
  1 sibling, 1 reply; 61+ messages in thread
From: Yuyang Du @ 2016-05-24 23:52 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, May 23, 2016 at 11:58:49AM +0100, Morten Rasmussen wrote:
> For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
> sched_domain hierarchy we need a way to enable wake-up balancing for the
> lower levels as well as we may want to balance tasks that don't fit the
> capacity of the previous cpu.
> 
> We have the option of introducing a new topology flag to express this
> requirement, or let the existing SD_BALANCE_WAKE flag be set by the
> architecture as a topology flag. The former means introducing yet
> another flag, the latter breaks the current meaning of topology flags.
> None of the options are really desirable.
 
I'd propose to replace SD_WAKE_AFFINE with SD_BALANCE_WAKE. And the
SD_WAKE_AFFINE semantic is simply "waker allowed":

waker_allowed = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

This can be implemented without current functionality change.

>From there, the choice between waker and wakee, and fast path
select_idle_sibling() and the rest slow path should be reworked, which
I am thinking about.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
  2016-05-24  0:04   ` Yuyang Du
  2016-05-24  7:03   ` Mike Galbraith
@ 2016-05-25  6:57   ` Wanpeng Li
  2016-05-25  9:49     ` Morten Rasmussen
  2016-06-02 14:21   ` Peter Zijlstra
  3 siblings, 1 reply; 61+ messages in thread
From: Wanpeng Li @ 2016-05-25  6:57 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, yuyang.du,
	Vincent Guittot, Mike Galbraith, linux-kernel

2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> balancing take over (find_idlest_{group, cpu}()).
>
> The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> containing cpus with different capacities. This is enforced by a
> previous patch based on the SD_ASYM_CPUCAPACITY flag.
>
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 564215d..ce44fa7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>  #endif
>
> +/*
> + * The margin used when comparing utilization with cpu capacity:
> + * util * 1024 < capacity * margin
> + */
> +unsigned int capacity_margin = 1280; /* ~20% */
> +
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>  {
>         lw->weight += inc;
> @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
>         return (util >= capacity) ? capacity : util;
>  }
>
> +static inline int task_util(struct task_struct *p)
> +{
> +       return p->se.avg.util_avg;
> +}
> +
> +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> +{
> +       long delta;
> +       long prev_cap = capacity_of(prev_cpu);
> +
> +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> +
> +       /* prev_cpu is fairly close to max, no need to abort wake_affine */
> +       if (delta < prev_cap >> 3)
> +               return 0;
> +
> +       return prev_cap * 1024 < task_util(p) * capacity_margin;
> +}

If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/
SMT enabled, then each HT has capacity 589, wake_cap() will result in
always not wake affine, right?

Regards,
Wanpeng Li

> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
>         if (sd_flag & SD_BALANCE_WAKE) {
>                 record_wakee(p);
> -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>         }
>
>         rcu_read_lock();
> --
> 1.9.1
>



-- 
Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-24 15:53                   ` Vincent Guittot
@ 2016-05-25  9:12                     ` Morten Rasmussen
  2016-05-26  6:45                       ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-25  9:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Tue, May 24, 2016 at 05:53:27PM +0200, Vincent Guittot wrote:
> On 24 May 2016 at 17:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, May 24, 2016 at 03:52:00PM +0200, Vincent Guittot wrote:
> >> On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
> >> >> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
> >> >> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> >> > On Tue, May 24, 2016 at 11:10:28AM +0200, Vincent Guittot wrote:
> >> >> >> >> On 23 May 2016 at 12:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> >> >> > If the system has cpu of different compute capacities (e.g. big.LITTLE)
> >> >> >> >> > let affine wakeups be constrained to cpus of the same type.
> >> >> >> >>
> >> >> >> >> Can you explain why you don't want wake affine with cpus with
> >> >> >> >> different compute capacity ?
> >> >> >> >
> >> >> >> > I should have made the overall idea a bit more clear. The idea is to
> >> >> >> > deal with cross-capacity migrations in the find_idlest_{group, cpu}{}
> >> >> >> > path so we don't have to touch select_idle_sibling().
> >> >> >> > select_idle_sibling() is critical for wake-up latency, and I'm assumed
> >> >> >> > that people wouldn't like adding extra overhead in there to deal with
> >> >> >> > capacity and utilization.
> >> >> >>
> >> >> >> So this means that we will never use the quick path of
> >> >> >> select_idle_sibling for cross capacity migration but always the one
> >> >> >> with extra overhead?
> >> >> >
> >> >> > Yes. select_idle_sibling() is only used to choose among equal capacity
> >> >> > cpus (capacity_orig).
> >> >> >
> >> >> >> Patch 9 adds more tests for enabling wake_affine path. Can't it also
> >> >> >> be used for cross capacity migration ? so we can use wake_affine if
> >> >> >> the task or the cpus (even with different capacity) doesn't need this
> >> >> >> extra overhead
> >> >> >
> >> >> > The test in patch 9 is to determine whether we are happy with the
> >> >> > capacity of the previous cpu, or we should go look for one with more
> >> >> > capacity. I don't see how we can use select_idle_sibling() unmodified
> >> >> > for sched domains containing cpus of different capacity to select an
> >> >> > appropriate cpu. It is just picking an idle cpu, it might have high
> >> >> > capacity or low, it wouldn't care.
> >> >> >
> >> >> > How would you avoid the overhead of checking capacity and utilization of
> >> >> > the cpus and still pick an appropriate cpu?
> >> >>
> >> >> My point is that there is some wake up case where we don't care about
> >> >> the capacity and utilization of cpus even for cross capacity migration
> >> >> and we will never take benefit of this fast path.
> >> >> You have added an extra check for setting want_affine in patch 9 which
> >> >> uses capacity and utilization of cpu to disable this fast path when a
> >> >> task needs more capacity than available. Can't you use this function
> >> >> to disable the want_affine for cross-capacity migration situation that
> >> >> cares of the capacity and need the full scan of sched_domain but keep
> >> >> it enable for other cases ?
> >> >
> >> > It is not clear to me what the other cases are. What kind of cases do
> >> > you have in mind?
> >>
> >> As an example, you have a task A that have to be on a big CPU because
> >> of the requirement of compute capacity, that wakes up a task B that
> >> can run on any cpu according to its utilization. The fast wake up path
> >> is fine for task B whatever prev cpu is.
> >
> > In that case, we will take always take fast path (select_idle_sibling())
> > for task B if wake_wide() allows it, which should be fine.
> 
> Even if want_affine is set, the wake up of task B will not use the fast path.
> The affine_sd will not be set because the sched_domain, which have
> both cpus, will not have the SD_WAKE_AFFINE flag according to this
> patch, isn't it ?
> So task B can't use the fast path whereas nothing prevent him to take
> benefit of it
> 
> Am I missing something ?

No, I think you are right. Very good point. The cpumask test with
sched_domain_span() will of cause return false. So yes, in this case the
slow path is taken. It isn't wrong as such, just slower for asymmetric
capacity systems :-)

It is clearly not as optimized for asymmetric capacity systems as it
could be, but my focus was to not ruin existing behaviour and minimize
overhead for others. There are a lot of different routes through those
conditions in the first half of select_task_rq_fair() that aren't
obvious. I worry that some users depend on them and that I don't
see/understand all of them.

If people agree on changing things, it is fine with me. I just tried to
avoid getting the patches shot down on that account ;-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag
  2016-05-24 23:52   ` Yuyang Du
@ 2016-05-25  9:27     ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-25  9:27 UTC (permalink / raw)
  To: Yuyang Du
  Cc: peterz, mingo, dietmar.eggemann, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, May 25, 2016 at 07:52:49AM +0800, Yuyang Du wrote:
> On Mon, May 23, 2016 at 11:58:49AM +0100, Morten Rasmussen wrote:
> > For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
> > sched_domain hierarchy we need a way to enable wake-up balancing for the
> > lower levels as well as we may want to balance tasks that don't fit the
> > capacity of the previous cpu.
> > 
> > We have the option of introducing a new topology flag to express this
> > requirement, or let the existing SD_BALANCE_WAKE flag be set by the
> > architecture as a topology flag. The former means introducing yet
> > another flag, the latter breaks the current meaning of topology flags.
> > None of the options are really desirable.
>  
> I'd propose to replace SD_WAKE_AFFINE with SD_BALANCE_WAKE. And the
> SD_WAKE_AFFINE semantic is simply "waker allowed":
> 
> waker_allowed = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> 
> This can be implemented without current functionality change.
> 
> From there, the choice between waker and wakee, and fast path
> select_idle_sibling() and the rest slow path should be reworked, which
> I am thinking about.

I don't really understand how that would work. If you change the
semantics of the flags you don't preserve current behaviour. To me it
sounds like at total rewrite of everything.

SD_BALANCE_WAKE controls whether we go slow path or not in case
want_affine is false. SD_WAKE_AFFINE controls whether we should consider
waking up near the waker instead of always waking up near the previous
cpu.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-25  6:57   ` Wanpeng Li
@ 2016-05-25  9:49     ` Morten Rasmussen
  2016-05-25 10:29       ` Wanpeng Li
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-25  9:49 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, yuyang.du,
	Vincent Guittot, Mike Galbraith, linux-kernel

On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:
> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> > compute demand (utilization) is suitable for the cpu capacities
> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> > balancing take over (find_idlest_{group, cpu}()).
> >
> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> > containing cpus with different capacities. This is enforced by a
> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
> >
> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> > traversing them.
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 564215d..ce44fa7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >  #endif
> >
> > +/*
> > + * The margin used when comparing utilization with cpu capacity:
> > + * util * 1024 < capacity * margin
> > + */
> > +unsigned int capacity_margin = 1280; /* ~20% */
> > +
> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> >  {
> >         lw->weight += inc;
> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
> >         return (util >= capacity) ? capacity : util;
> >  }
> >
> > +static inline int task_util(struct task_struct *p)
> > +{
> > +       return p->se.avg.util_avg;
> > +}
> > +
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +       long delta;
> > +       long prev_cap = capacity_of(prev_cpu);
> > +
> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> > +
> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */
> > +       if (delta < prev_cap >> 3)
> > +               return 0;
> > +
> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;
> > +}
> 
> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/
> SMT enabled, then each HT has capacity 589, wake_cap() will result in
> always not wake affine, right?

The idea is that SMT systems would bail out already at the previous
condition. We should have max_cpu_capacity == prev_cap == 589, delta
should then be zero and make the first condition true and make
wake_cap() always return 0 for any system with symmetric capacities
regardless of their actual capacity values.

Note that this isn't entirely true as I used capacity_of() for prev_cap,
if I change that to capacity_orig_of() it should be true.

By making the !wake_cap() condition always true for want_affine, we
should preserve existing behaviour for SMT/SMP. The only overhead is the
capacity delta computation and comparison, which should be cheap.

Does that make sense?

Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible
temporarily, it should decay to util_avg <=
capacity_orig_of(task_cpu(p)) over time. That doesn't affect your
question though as the second condition would still evaluate true if
util_avg == capacity_orig_of(task_cpu(p)), but as said above the first
condition should bail out before we get here.

Morten

> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >
> >         if (sd_flag & SD_BALANCE_WAKE) {
> >                 record_wakee(p);
> > -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> >         }
> >
> >         rcu_read_lock();
> > --
> > 1.9.1
> >

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-25  9:49     ` Morten Rasmussen
@ 2016-05-25 10:29       ` Wanpeng Li
  2016-05-25 10:54         ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Wanpeng Li @ 2016-05-25 10:29 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, yuyang.du,
	Vincent Guittot, Mike Galbraith, linux-kernel

2016-05-25 17:49 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:
>> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
>> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
>> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
>> > compute demand (utilization) is suitable for the cpu capacities
>> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
>> > balancing take over (find_idlest_{group, cpu}()).
>> >
>> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
>> > containing cpus with different capacities. This is enforced by a
>> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
>> >
>> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
>> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
>> > traversing them.
>> >
>> > cc: Ingo Molnar <mingo@redhat.com>
>> > cc: Peter Zijlstra <peterz@infradead.org>
>> >
>> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> > ---
>> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>> >  1 file changed, 27 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 564215d..ce44fa7 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>> >  #endif
>> >
>> > +/*
>> > + * The margin used when comparing utilization with cpu capacity:
>> > + * util * 1024 < capacity * margin
>> > + */
>> > +unsigned int capacity_margin = 1280; /* ~20% */
>> > +
>> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>> >  {
>> >         lw->weight += inc;
>> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
>> >         return (util >= capacity) ? capacity : util;
>> >  }
>> >
>> > +static inline int task_util(struct task_struct *p)
>> > +{
>> > +       return p->se.avg.util_avg;
>> > +}
>> > +
>> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>> > +{
>> > +       long delta;
>> > +       long prev_cap = capacity_of(prev_cpu);
>> > +
>> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
>> > +
>> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */
>> > +       if (delta < prev_cap >> 3)
>> > +               return 0;
>> > +
>> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;
>> > +}
>>
>> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/
>> SMT enabled, then each HT has capacity 589, wake_cap() will result in
>> always not wake affine, right?
>
> The idea is that SMT systems would bail out already at the previous
> condition. We should have max_cpu_capacity == prev_cap == 589, delta
> should then be zero and make the first condition true and make
> wake_cap() always return 0 for any system with symmetric capacities
> regardless of their actual capacity values.
>
> Note that this isn't entirely true as I used capacity_of() for prev_cap,
> if I change that to capacity_orig_of() it should be true.
>
> By making the !wake_cap() condition always true for want_affine, we
> should preserve existing behaviour for SMT/SMP. The only overhead is the
> capacity delta computation and comparison, which should be cheap.
>
> Does that make sense?

Fair enough, thanks for your explanation.

>
> Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible
> temporarily, it should decay to util_avg <=
> capacity_orig_of(task_cpu(p)) over time. That doesn't affect your

Sorry, I didn't find it will decay to capacity_orig in
__update_load_avg(), could you elaborate?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-25 10:29       ` Wanpeng Li
@ 2016-05-25 10:54         ` Morten Rasmussen
  2016-05-25 11:18           ` Wanpeng Li
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-05-25 10:54 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, yuyang.du,
	Vincent Guittot, Mike Galbraith, linux-kernel

On Wed, May 25, 2016 at 06:29:33PM +0800, Wanpeng Li wrote:
> 2016-05-25 17:49 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> > On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:
> >> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> >> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> >> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> >> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> >> > compute demand (utilization) is suitable for the cpu capacities
> >> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> >> > balancing take over (find_idlest_{group, cpu}()).
> >> >
> >> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> >> > containing cpus with different capacities. This is enforced by a
> >> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
> >> >
> >> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> >> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> >> > traversing them.
> >> >
> >> > cc: Ingo Molnar <mingo@redhat.com>
> >> > cc: Peter Zijlstra <peterz@infradead.org>
> >> >
> >> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> >> > ---
> >> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
> >> >  1 file changed, 27 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 564215d..ce44fa7 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> >> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >> >  #endif
> >> >
> >> > +/*
> >> > + * The margin used when comparing utilization with cpu capacity:
> >> > + * util * 1024 < capacity * margin
> >> > + */
> >> > +unsigned int capacity_margin = 1280; /* ~20% */
> >> > +
> >> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> >> >  {
> >> >         lw->weight += inc;
> >> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
> >> >         return (util >= capacity) ? capacity : util;
> >> >  }
> >> >
> >> > +static inline int task_util(struct task_struct *p)
> >> > +{
> >> > +       return p->se.avg.util_avg;
> >> > +}
> >> > +
> >> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> >> > +{
> >> > +       long delta;
> >> > +       long prev_cap = capacity_of(prev_cpu);
> >> > +
> >> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
> >> > +
> >> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */
> >> > +       if (delta < prev_cap >> 3)
> >> > +               return 0;
> >> > +
> >> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;
> >> > +}
> >>
> >> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/
> >> SMT enabled, then each HT has capacity 589, wake_cap() will result in
> >> always not wake affine, right?
> >
> > The idea is that SMT systems would bail out already at the previous
> > condition. We should have max_cpu_capacity == prev_cap == 589, delta
> > should then be zero and make the first condition true and make
> > wake_cap() always return 0 for any system with symmetric capacities
> > regardless of their actual capacity values.
> >
> > Note that this isn't entirely true as I used capacity_of() for prev_cap,
> > if I change that to capacity_orig_of() it should be true.
> >
> > By making the !wake_cap() condition always true for want_affine, we
> > should preserve existing behaviour for SMT/SMP. The only overhead is the
> > capacity delta computation and comparison, which should be cheap.
> >
> > Does that make sense?
> 
> Fair enough, thanks for your explanation.
> 
> >
> > Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible
> > temporarily, it should decay to util_avg <=
> > capacity_orig_of(task_cpu(p)) over time. That doesn't affect your
> 
> Sorry, I didn't find it will decay to capacity_orig in
> __update_load_avg(), could you elaborate?

I should have checked the code before writing that :-( I thought the
scaling by arch_scale_cpu_capacity() in __update_load_avg() would do
that, but it turns out that the default implementation of
arch_scale_cpu_capacity() doesn't do that when we pass a NULL pointer
for the sched_domain, it would have returned smt_gain/span_weight ==
capacity_orig_of(cpu) otherwise.

Sorry for the confusion, though I'm not sure if it is right to return
SCHED_CAPACITY_SCALE for SMT systems.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-25 10:54         ` Morten Rasmussen
@ 2016-05-25 11:18           ` Wanpeng Li
  0 siblings, 0 replies; 61+ messages in thread
From: Wanpeng Li @ 2016-05-25 11:18 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, yuyang.du,
	Vincent Guittot, Mike Galbraith, linux-kernel

2016-05-25 18:54 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
> On Wed, May 25, 2016 at 06:29:33PM +0800, Wanpeng Li wrote:
>> 2016-05-25 17:49 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> > On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote:
>> >> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>:
>> >> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
>> >> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
>> >> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
>> >> > compute demand (utilization) is suitable for the cpu capacities
>> >> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
>> >> > balancing take over (find_idlest_{group, cpu}()).
>> >> >
>> >> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
>> >> > containing cpus with different capacities. This is enforced by a
>> >> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
>> >> >
>> >> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
>> >> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
>> >> > traversing them.
>> >> >
>> >> > cc: Ingo Molnar <mingo@redhat.com>
>> >> > cc: Peter Zijlstra <peterz@infradead.org>
>> >> >
>> >> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> >> > ---
>> >> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>> >> >  1 file changed, 27 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> > index 564215d..ce44fa7 100644
>> >> > --- a/kernel/sched/fair.c
>> >> > +++ b/kernel/sched/fair.c
>> >> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>> >> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>> >> >  #endif
>> >> >
>> >> > +/*
>> >> > + * The margin used when comparing utilization with cpu capacity:
>> >> > + * util * 1024 < capacity * margin
>> >> > + */
>> >> > +unsigned int capacity_margin = 1280; /* ~20% */
>> >> > +
>> >> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>> >> >  {
>> >> >         lw->weight += inc;
>> >> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
>> >> >         return (util >= capacity) ? capacity : util;
>> >> >  }
>> >> >
>> >> > +static inline int task_util(struct task_struct *p)
>> >> > +{
>> >> > +       return p->se.avg.util_avg;
>> >> > +}
>> >> > +
>> >> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>> >> > +{
>> >> > +       long delta;
>> >> > +       long prev_cap = capacity_of(prev_cpu);
>> >> > +
>> >> > +       delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
>> >> > +
>> >> > +       /* prev_cpu is fairly close to max, no need to abort wake_affine */
>> >> > +       if (delta < prev_cap >> 3)
>> >> > +               return 0;
>> >> > +
>> >> > +       return prev_cap * 1024 < task_util(p) * capacity_margin;
>> >> > +}
>> >>
>> >> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/
>> >> SMT enabled, then each HT has capacity 589, wake_cap() will result in
>> >> always not wake affine, right?
>> >
>> > The idea is that SMT systems would bail out already at the previous
>> > condition. We should have max_cpu_capacity == prev_cap == 589, delta
>> > should then be zero and make the first condition true and make
>> > wake_cap() always return 0 for any system with symmetric capacities
>> > regardless of their actual capacity values.
>> >
>> > Note that this isn't entirely true as I used capacity_of() for prev_cap,
>> > if I change that to capacity_orig_of() it should be true.
>> >
>> > By making the !wake_cap() condition always true for want_affine, we
>> > should preserve existing behaviour for SMT/SMP. The only overhead is the
>> > capacity delta computation and comparison, which should be cheap.
>> >
>> > Does that make sense?
>>
>> Fair enough, thanks for your explanation.
>>
>> >
>> > Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible
>> > temporarily, it should decay to util_avg <=
>> > capacity_orig_of(task_cpu(p)) over time. That doesn't affect your
>>
>> Sorry, I didn't find it will decay to capacity_orig in
>> __update_load_avg(), could you elaborate?
>
> I should have checked the code before writing that :-( I thought the
> scaling by arch_scale_cpu_capacity() in __update_load_avg() would do
> that, but it turns out that the default implementation of
> arch_scale_cpu_capacity() doesn't do that when we pass a NULL pointer
> for the sched_domain, it would have returned smt_gain/span_weight ==
> capacity_orig_of(cpu) otherwise.

Thanks for the explanation. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-25  9:12                     ` Morten Rasmussen
@ 2016-05-26  6:45                       ` Vincent Guittot
  2016-06-07 16:50                         ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2016-05-26  6:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 25 May 2016 at 11:12, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, May 24, 2016 at 05:53:27PM +0200, Vincent Guittot wrote:
>> On 24 May 2016 at 17:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, May 24, 2016 at 03:52:00PM +0200, Vincent Guittot wrote:
>> >> On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> > On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
>> >> >> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> >> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
>> >> >> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

[snip]

>> >> > It is not clear to me what the other cases are. What kind of cases do
>> >> > you have in mind?
>> >>
>> >> As an example, you have a task A that have to be on a big CPU because
>> >> of the requirement of compute capacity, that wakes up a task B that
>> >> can run on any cpu according to its utilization. The fast wake up path
>> >> is fine for task B whatever prev cpu is.
>> >
>> > In that case, we will take always take fast path (select_idle_sibling())
>> > for task B if wake_wide() allows it, which should be fine.
>>
>> Even if want_affine is set, the wake up of task B will not use the fast path.
>> The affine_sd will not be set because the sched_domain, which have
>> both cpus, will not have the SD_WAKE_AFFINE flag according to this
>> patch, isn't it ?
>> So task B can't use the fast path whereas nothing prevent him to take
>> benefit of it
>>
>> Am I missing something ?
>
> No, I think you are right. Very good point. The cpumask test with
> sched_domain_span() will of cause return false. So yes, in this case the
> slow path is taken. It isn't wrong as such, just slower for asymmetric
> capacity systems :-)
>

So, I still don't see why the function wake_cap that is introduce by
patch 9 can't be used for testing cross capacity migration at wake up
?
The only reason for which we would like to skip fast wake up path for
cross capacity migration, is whether the task needs more capacity than
the capacity of cpu or prev_cpu. You already do that for prev_cpu,
can't you add same test for cpu ?

> It is clearly not as optimized for asymmetric capacity systems as it
> could be, but my focus was to not ruin existing behaviour and minimize
> overhead for others. There are a lot of different routes through those
> conditions in the first half of select_task_rq_fair() that aren't
> obvious. I worry that some users depend on them and that I don't
> see/understand all of them.
>
> If people agree on changing things, it is fine with me. I just tried to
> avoid getting the patches shot down on that account ;-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-05-23 10:58 ` [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
@ 2016-06-01 19:49   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-01 19:49 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Rik van Riel

On Mon, May 23, 2016 at 11:58:44AM +0100, Morten Rasmussen wrote:
> In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
> select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
> in special cases (NUMA task swapping). However, the
> select_task_rq_fair() helper functions: wake_affine() and
> select_idle_sibling(), still use task_cpu(p) directly to work out
> prev_cpu which leads to inconsistencies.
> 
> This patch passes prev_cpu (potentially overridden by NUMA code) into
> the helper functions to ensure prev_cpu is indeed the same cpu
> everywhere in the wakeup path.

Rik, can you get this ran through the NUMA benchmarks?

> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 218f8e8..c49e25a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  }
>  
>  #ifdef CONFIG_SMP
> -static int select_idle_sibling(struct task_struct *p, int cpu);
> +static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
>  static unsigned long task_h_load(struct task_struct *p);
>  
>  /*
> @@ -1502,7 +1502,8 @@ static void task_numa_compare(struct task_numa_env *env,
>  	 * Call select_idle_sibling to maybe find a better one.
>  	 */
>  	if (!cur)
> -		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
> +		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
> +						   env->dst_cpu);
>  
>  assign:
>  	assigned = true;
> @@ -5013,18 +5014,18 @@ static int wake_wide(struct task_struct *p)
>  	return 1;
>  }
>  
> -static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> +static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> +		       int prev_cpu, int sync)
>  {
>  	s64 this_load, load;
>  	s64 this_eff_load, prev_eff_load;
> -	int idx, this_cpu, prev_cpu;
> +	int idx, this_cpu;
>  	struct task_group *tg;
>  	unsigned long weight;
>  	int balanced;
>  
>  	idx	  = sd->wake_idx;
>  	this_cpu  = smp_processor_id();
> -	prev_cpu  = task_cpu(p);
>  	load	  = source_load(prev_cpu, idx);
>  	this_load = target_load(this_cpu, idx);
>  
> @@ -5189,11 +5190,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  /*
>   * Try and locate an idle CPU in the sched_domain.
>   */
> -static int select_idle_sibling(struct task_struct *p, int target)
> +static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  {
>  	struct sched_domain *sd;
>  	struct sched_group *sg;
> -	int i = task_cpu(p);
>  
>  	if (idle_cpu(target))
>  		return target;
> @@ -5201,8 +5201,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	/*
>  	 * If the prevous cpu is cache affine and idle, don't be stupid.
>  	 */
> -	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
> -		return i;
> +	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
> +		return prev;
>  
>  	/*
>  	 * Otherwise, iterate the domains and find an eligible idle cpu.
> @@ -5223,6 +5223,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	for_each_lower_domain(sd) {
>  		sg = sd->groups;
>  		do {
> +			int i;
> +
>  			if (!cpumask_intersects(sched_group_cpus(sg),
>  						tsk_cpus_allowed(p)))
>  				goto next;
> @@ -5331,13 +5333,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  
>  	if (affine_sd) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
>  			new_cpu = cpu;
>  	}
>  
>  	if (!sd) {
>  		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> +			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>  
>  	} else while (sd) {
>  		struct sched_group *group;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-05-23 12:00     ` Morten Rasmussen
  2016-05-23 13:00       ` Mike Galbraith
  2016-05-23 23:04       ` Yuyang Du
@ 2016-06-01 19:57       ` Peter Zijlstra
  2016-06-02  8:05         ` Peter Zijlstra
  2 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-01 19:57 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Mike Galbraith, mingo, dietmar.eggemann, yuyang.du,
	vincent.guittot, linux-kernel

On Mon, May 23, 2016 at 01:00:10PM +0100, Morten Rasmussen wrote:
> On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> > On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > > decide whether an affine wakeup is desirable. On lightly loaded systems
> > > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > > of wakee_flips in that scenario. It makes little sense to prevent affine
> > > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > > more sense to ignore them in wake_wide().
> > 
> > You sure?  What's the difference between a task flipping enough to
> > warrant spreading the load, and an interrupt source doing the same? 
> >  I've both witnessed firsthand, and received user confirmation of this
> > very thing improving utilization.
> 
> Right, I didn't consider the interrupt source scenario, my fault.
> 
> The problem then seems to be distinguishing truly idle and busy doing
> interrupts. The issue that I observe is that wake_wide() likes pushing
> tasks around in lightly scenarios which isn't desirable for power
> management. Selecting the same cpu again may potentially let others
> reach deeper C-state.
> 
> With that in mind I will if I can do better. Suggestions are welcome :-)

Seeing how we always so select_idle_siblings() after affine_sd, the only
wake_affine movement that matters is cross-llc.

So intra-llc wakeups can avoid the movement, no?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-05-23 10:58 ` [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
  2016-05-24  6:29   ` Mike Galbraith
@ 2016-06-01 19:59   ` Peter Zijlstra
  2016-06-07 14:25     ` Morten Rasmussen
  1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-01 19:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, May 23, 2016 at 11:58:46AM +0100, Morten Rasmussen wrote:
> In the current find_idlest_group()/find_idlest_cpu() search we end up
> calling find_idlest_cpu() in a sched_group containing only one cpu in
> the end. Checking idle-states becomes pointless when there is no
> alternative, so bail out instead.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fe3020..564215d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  	int shallowest_idle_cpu = -1;
>  	int i;
>  
> +	/* Check if we have any choice */
> +	if (group->group_weight == 1) {
> +		return cpumask_first(sched_group_cpus(group));
> +	}

superfluous brackets, also isn't @this_cpu supposed to be part of the
sched_group_cpu(group) mask? 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag
  2016-05-23 10:58 ` [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag Morten Rasmussen
  2016-05-24 23:52   ` Yuyang Du
@ 2016-06-01 20:18   ` Peter Zijlstra
  2016-06-08  8:45     ` Morten Rasmussen
  1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-01 20:18 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, May 23, 2016 at 11:58:49AM +0100, Morten Rasmussen wrote:
> For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
> sched_domain hierarchy we need a way to enable wake-up balancing for the
> lower levels as well as we may want to balance tasks that don't fit the
> capacity of the previous cpu.
> 
> We have the option of introducing a new topology flag to express this
> requirement, or let the existing SD_BALANCE_WAKE flag be set by the
> architecture as a topology flag. The former means introducing yet
> another flag, the latter breaks the current meaning of topology flags.
> None of the options are really desirable.

So why can't you couple this to ASYM_CAPACITY? If that's set anywhere,
add BALANCE_WAKE as appropriate?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-06-01 19:57       ` Peter Zijlstra
@ 2016-06-02  8:05         ` Peter Zijlstra
  2016-06-07 12:08           ` Morten Rasmussen
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-02  8:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Mike Galbraith, mingo, dietmar.eggemann, yuyang.du,
	vincent.guittot, linux-kernel

On Wed, Jun 01, 2016 at 09:57:23PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2016 at 01:00:10PM +0100, Morten Rasmussen wrote:
> > On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> > > On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > > > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > > > decide whether an affine wakeup is desirable. On lightly loaded systems
> > > > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > > > of wakee_flips in that scenario. It makes little sense to prevent affine
> > > > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > > > more sense to ignore them in wake_wide().
> > > 
> > > You sure?  What's the difference between a task flipping enough to
> > > warrant spreading the load, and an interrupt source doing the same? 
> > >  I've both witnessed firsthand, and received user confirmation of this
> > > very thing improving utilization.
> > 
> > Right, I didn't consider the interrupt source scenario, my fault.
> > 
> > The problem then seems to be distinguishing truly idle and busy doing
> > interrupts. The issue that I observe is that wake_wide() likes pushing
> > tasks around in lightly scenarios which isn't desirable for power
> > management. Selecting the same cpu again may potentially let others
> > reach deeper C-state.
> > 
> > With that in mind I will if I can do better. Suggestions are welcome :-)
> 
> Seeing how we always so select_idle_siblings() after affine_sd, the only
> wake_affine movement that matters is cross-llc.
> 
> So intra-llc wakeups can avoid the movement, no?

Won't help I think; the interrupt that got us in this situation will
already have wrecked your idle time/state to begin with. You really want
to help interrupt routing.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
                     ` (2 preceding siblings ...)
  2016-05-25  6:57   ` Wanpeng Li
@ 2016-06-02 14:21   ` Peter Zijlstra
  2016-06-08 11:29     ` Morten Rasmussen
  3 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-02 14:21 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> balancing take over (find_idlest_{group, cpu}()).
> 
> The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> containing cpus with different capacities. This is enforced by a
> previous patch based on the SD_ASYM_CPUCAPACITY flag.
> 
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.

I'm a bit confused...

Lets assume a 2+2 big.little thing with shared LLC:


	---------- SD2 ----------

	-- SD1 --	-- SD1 --

	0	1	2	3


SD1: WAKE_AFFINE, BALANCE_WAKE
SD2: ASYM_CAPACITY, BALANCE_WAKE

t0 used to run on cpu1, t0 used to run on cpu2

cpu0 wakes t0:

  want_affine = 1
  SD1:
    WAKE_AFFINE
      cpumask_test_cpu(prev_cpu, sd_mask) == true
    affine_sd = SD1
    break;

  affine_sd != NULL -> affine-wakeup

cpu0 wakes t1:

  want_affine = 1
  SD1:
    WAKE_AFFINE
      cpumask_test_cpu(prev_cpu, sd_mask) == false
  SD2:
    BALANCE_WAKE
      sd = SD2

  affine_sd == NULL, sd == SD2 -> find_idlest_*()


All without this patch...

So what is this thing doing?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide
  2016-06-02  8:05         ` Peter Zijlstra
@ 2016-06-07 12:08           ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-06-07 12:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, mingo, dietmar.eggemann, yuyang.du,
	vincent.guittot, linux-kernel

On Thu, Jun 02, 2016 at 10:05:09AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 09:57:23PM +0200, Peter Zijlstra wrote:
> > On Mon, May 23, 2016 at 01:00:10PM +0100, Morten Rasmussen wrote:
> > > On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:
> > > > On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > > > > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > > > > decide whether an affine wakeup is desirable. On lightly loaded systems
> > > > > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > > > > of wakee_flips in that scenario. It makes little sense to prevent affine
> > > > > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > > > > more sense to ignore them in wake_wide().
> > > > 
> > > > You sure?  What's the difference between a task flipping enough to
> > > > warrant spreading the load, and an interrupt source doing the same? 
> > > >  I've both witnessed firsthand, and received user confirmation of this
> > > > very thing improving utilization.
> > > 
> > > Right, I didn't consider the interrupt source scenario, my fault.
> > > 
> > > The problem then seems to be distinguishing truly idle and busy doing
> > > interrupts. The issue that I observe is that wake_wide() likes pushing
> > > tasks around in lightly scenarios which isn't desirable for power
> > > management. Selecting the same cpu again may potentially let others
> > > reach deeper C-state.
> > > 
> > > With that in mind I will if I can do better. Suggestions are welcome :-)
> > 
> > Seeing how we always so select_idle_siblings() after affine_sd, the only
> > wake_affine movement that matters is cross-llc.
> > 
> > So intra-llc wakeups can avoid the movement, no?

I think so. I don't see a point in setting affine_sd if it is an
intra-llc wakeup. Maybe make it conditional on tmp->flags &
SD_SHARE_PKG_RESOURCES ?

I would help minimizing some intra-llc wakeup migrations, but not
inter-llc migrations in largely idle scenarios.

> Won't help I think; the interrupt that got us in this situation will
> already have wrecked your idle time/state to begin with. You really want
> to help interrupt routing.

Improved interrupt routing would be nice, but we need to get the
scheduler to not migrate the tasks away from the interrupts in those
partially idle scenarios if we should have chance optimizing idle
time/states by getting the interrupt routed to the cpu where the
consumer task is.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-06-01 19:59   ` Peter Zijlstra
@ 2016-06-07 14:25     ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-06-07 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 01, 2016 at 09:59:20PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2016 at 11:58:46AM +0100, Morten Rasmussen wrote:
> > In the current find_idlest_group()/find_idlest_cpu() search we end up
> > calling find_idlest_cpu() in a sched_group containing only one cpu in
> > the end. Checking idle-states becomes pointless when there is no
> > alternative, so bail out instead.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0fe3020..564215d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> >  	int shallowest_idle_cpu = -1;
> >  	int i;
> >  
> > +	/* Check if we have any choice */
> > +	if (group->group_weight == 1) {
> > +		return cpumask_first(sched_group_cpus(group));
> > +	}
> 
> superfluous brackets, also isn't @this_cpu supposed to be part of the
> sched_group_cpu(group) mask?

I'll kill the brackets.

IIUC, @this_cpu is never part of sched_group_cpus(group).
find_idlest_group() never returns the local group, so if we manage to
get to find_idlest_cpu(), the group would not contain @this_cpu.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations
  2016-05-26  6:45                       ` Vincent Guittot
@ 2016-06-07 16:50                         ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-06-07 16:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Thu, May 26, 2016 at 08:45:03AM +0200, Vincent Guittot wrote:
> On 25 May 2016 at 11:12, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, May 24, 2016 at 05:53:27PM +0200, Vincent Guittot wrote:
> >> On 24 May 2016 at 17:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Tue, May 24, 2016 at 03:52:00PM +0200, Vincent Guittot wrote:
> >> >> On 24 May 2016 at 15:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> > On Tue, May 24, 2016 at 03:27:05PM +0200, Vincent Guittot wrote:
> >> >> >> On 24 May 2016 at 15:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> >> > On Tue, May 24, 2016 at 02:12:38PM +0200, Vincent Guittot wrote:
> >> >> >> >> On 24 May 2016 at 12:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> 
> [snip]
> 
> >> >> > It is not clear to me what the other cases are. What kind of cases do
> >> >> > you have in mind?
> >> >>
> >> >> As an example, you have a task A that have to be on a big CPU because
> >> >> of the requirement of compute capacity, that wakes up a task B that
> >> >> can run on any cpu according to its utilization. The fast wake up path
> >> >> is fine for task B whatever prev cpu is.
> >> >
> >> > In that case, we will take always take fast path (select_idle_sibling())
> >> > for task B if wake_wide() allows it, which should be fine.
> >>
> >> Even if want_affine is set, the wake up of task B will not use the fast path.
> >> The affine_sd will not be set because the sched_domain, which have
> >> both cpus, will not have the SD_WAKE_AFFINE flag according to this
> >> patch, isn't it ?
> >> So task B can't use the fast path whereas nothing prevent him to take
> >> benefit of it
> >>
> >> Am I missing something ?
> >
> > No, I think you are right. Very good point. The cpumask test with
> > sched_domain_span() will of cause return false. So yes, in this case the
> > slow path is taken. It isn't wrong as such, just slower for asymmetric
> > capacity systems :-)
> >
> 
> So, I still don't see why the function wake_cap that is introduce by
> patch 9 can't be used for testing cross capacity migration at wake up
> ?
> The only reason for which we would like to skip fast wake up path for
> cross capacity migration, is whether the task needs more capacity than
> the capacity of cpu or prev_cpu. You already do that for prev_cpu,
> can't you add same test for cpu ?

I think I finally see what you mean. Thanks for your patience :-)

It should be safe to let wake_affine be cross-capacity as long as we
only take the fast path when both prev_cpu and this_cpu have sufficient
capacity for the task. select_idle_sibling() can never end up looking at
cpus with different capacities than those of this_cpu and prev_cpu.

I will give that a try. I have a few ideas on how it can extended to use
the fast path in a few additional cases as well.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag
  2016-06-01 20:18   ` Peter Zijlstra
@ 2016-06-08  8:45     ` Morten Rasmussen
  0 siblings, 0 replies; 61+ messages in thread
From: Morten Rasmussen @ 2016-06-08  8:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 01, 2016 at 10:18:17PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2016 at 11:58:49AM +0100, Morten Rasmussen wrote:
> > For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
> > sched_domain hierarchy we need a way to enable wake-up balancing for the
> > lower levels as well as we may want to balance tasks that don't fit the
> > capacity of the previous cpu.
> > 
> > We have the option of introducing a new topology flag to express this
> > requirement, or let the existing SD_BALANCE_WAKE flag be set by the
> > architecture as a topology flag. The former means introducing yet
> > another flag, the latter breaks the current meaning of topology flags.
> > None of the options are really desirable.
> 
> So why can't you couple this to ASYM_CAPACITY? If that's set anywhere,
> add BALANCE_WAKE as appropriate?

That should be possible. It is set at the sched_domain levels that span
cpus of different capacity, but we need to set BALANCE_WAKE on the level
below as well. So we would have to introduce a dependency between flags
set at different levels.

However, following the discussion with Vincent on enabling WAKE_AFFINE
across cpus of different capacity, I might be able to repurpose
ASYM_CPUCAPACITY to set BALANCE_WAKE instead which would be simpler I
think.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-06-02 14:21   ` Peter Zijlstra
@ 2016-06-08 11:29     ` Morten Rasmussen
  2016-06-08 14:36       ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Morten Rasmussen @ 2016-06-08 11:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Thu, Jun 02, 2016 at 04:21:05PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote:
> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> > compute demand (utilization) is suitable for the cpu capacities
> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> > balancing take over (find_idlest_{group, cpu}()).
> > 
> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
> > containing cpus with different capacities. This is enforced by a
> > previous patch based on the SD_ASYM_CPUCAPACITY flag.
> > 
> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> > traversing them.
> 
> I'm a bit confused...
> 
> Lets assume a 2+2 big.little thing with shared LLC:
> 
> 
> 	---------- SD2 ----------
> 
> 	-- SD1 --	-- SD1 --
> 
> 	0	1	2	3
> 
> 
> SD1: WAKE_AFFINE, BALANCE_WAKE
> SD2: ASYM_CAPACITY, BALANCE_WAKE
> 
> t0 used to run on cpu1, t0 used to run on cpu2
> 
> cpu0 wakes t0:
> 
>   want_affine = 1
>   SD1:
>     WAKE_AFFINE
>       cpumask_test_cpu(prev_cpu, sd_mask) == true
>     affine_sd = SD1
>     break;
> 
>   affine_sd != NULL -> affine-wakeup
> 
> cpu0 wakes t1:
> 
>   want_affine = 1
>   SD1:
>     WAKE_AFFINE
>       cpumask_test_cpu(prev_cpu, sd_mask) == false
>   SD2:
>     BALANCE_WAKE
>       sd = SD2
> 
>   affine_sd == NULL, sd == SD2 -> find_idlest_*()
> 
> 
> All without this patch...
> 
> So what is this thing doing?

Not very much in those cases, but it makes one important difference in
one case. We could do fine without the patch if we could assume that all
tasks are already in the right SD according their PELT utilization and
if not they will be woken up by a cpu in the right SD (so we do
find_idlest_*()). But we can't :-(

Let's take your example above and add that t0 should really be running
on cpu2/3 due to its utilization, assuming SD1[01] are little cpus and
SD1[23] are big cpus. In that case we would still do affine-wakeup and
stick the task on cpu0 despite it being a little cpu.

To avoid that, this patch sets want_affine = 0 in that case so we go
find_idlest_*() to give the task a chance of being put on cpu2/3. The
patch is also setting want_affine = 0 for other cases which are already
taking the find_idlest_*() route due to the cpumask test as illustrated
by your example above.

We can have the current scenarios:

b = big cpu capacity/task util
l = little cpu capacity/task util
x = don't care

case	task util	prev_cpu	this_cpu	wakeup
-------------------------------------------------------------------
1	b		b		b		affine (b)
2	b		b		l		slow (b)
3	b		l		b		slow (b)
4	b		l		l		slow (b)
5	l		b		b		affine (x)
6	l		b		l		slow (x)
7	l		l		b		slow (x)
8	l		l		l		affine (x)

Without the patch we would do affine-wakeup on little in case 4, where
we want to wake up on a big cpu. We only do affine-wakeup when both
this_cpu and prev_cpu have the same capacity and that capacity is
sufficient.

Vincent pointed out that it is overly restrictive as it is perfectly
safe to do affine-wakeup in case 6 and 7, where the waker and the
previous cpu have sufficient capacity but they are not the same.

If we made wake_affine() consider cpu capacity, it should be possible to
do affine-wakeup even for case 2 and 3, leaving us with only case 4
requiring the find_idles_*() route.

There are more cases for taking the slow wakeup path if you have more
than two cpu capacities to deal with, but I'm going to spare you the
full detailed table ;-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-06-08 11:29     ` Morten Rasmussen
@ 2016-06-08 14:36       ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-06-08 14:36 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 08, 2016 at 12:29:52PM +0100, Morten Rasmussen wrote:
> Let's take your example above and add that t0 should really be running
> on cpu2/3 due to its utilization, assuming SD1[01] are little cpus and
> SD1[23] are big cpus. In that case we would still do affine-wakeup and
> stick the task on cpu0 despite it being a little cpu.

Ah indeed.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2016-06-08 14:36 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23 10:58 [PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
2016-05-23 10:58 ` [PATCH 01/16] sched: Fix power to capacity renaming in comment Morten Rasmussen
2016-05-23 10:58 ` [PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
2016-06-01 19:49   ` Peter Zijlstra
2016-05-23 10:58 ` [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide Morten Rasmussen
2016-05-23 11:12   ` Mike Galbraith
2016-05-23 12:00     ` Morten Rasmussen
2016-05-23 13:00       ` Mike Galbraith
2016-05-23 14:10         ` Morten Rasmussen
2016-05-23 15:42           ` Mike Galbraith
2016-05-23 23:17             ` Yuyang Du
2016-05-23 23:04       ` Yuyang Du
2016-06-01 19:57       ` Peter Zijlstra
2016-06-02  8:05         ` Peter Zijlstra
2016-06-07 12:08           ` Morten Rasmussen
2016-05-23 10:58 ` [PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
2016-05-24  6:29   ` Mike Galbraith
2016-05-24  8:05     ` Morten Rasmussen
2016-05-24  8:12       ` Mike Galbraith
2016-06-01 19:59   ` Peter Zijlstra
2016-06-07 14:25     ` Morten Rasmussen
2016-05-23 10:58 ` [PATCH 05/16] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
2016-05-23 10:58 ` [PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations Morten Rasmussen
2016-05-24  9:10   ` Vincent Guittot
2016-05-24 10:29     ` Morten Rasmussen
2016-05-24 12:12       ` Vincent Guittot
2016-05-24 13:16         ` Morten Rasmussen
2016-05-24 13:27           ` Vincent Guittot
2016-05-24 13:36             ` Morten Rasmussen
2016-05-24 13:52               ` Vincent Guittot
2016-05-24 15:02                 ` Morten Rasmussen
2016-05-24 15:53                   ` Vincent Guittot
2016-05-25  9:12                     ` Morten Rasmussen
2016-05-26  6:45                       ` Vincent Guittot
2016-06-07 16:50                         ` Morten Rasmussen
2016-05-23 10:58 ` [PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag Morten Rasmussen
2016-05-24 23:52   ` Yuyang Du
2016-05-25  9:27     ` Morten Rasmussen
2016-06-01 20:18   ` Peter Zijlstra
2016-06-08  8:45     ` Morten Rasmussen
2016-05-23 10:58 ` [PATCH 08/16] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
2016-05-23 10:58 ` [PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
2016-05-24  0:04   ` Yuyang Du
2016-05-24  8:10     ` Morten Rasmussen
2016-05-24  7:03   ` Mike Galbraith
2016-05-24  7:15     ` Mike Galbraith
2016-05-25  6:57   ` Wanpeng Li
2016-05-25  9:49     ` Morten Rasmussen
2016-05-25 10:29       ` Wanpeng Li
2016-05-25 10:54         ` Morten Rasmussen
2016-05-25 11:18           ` Wanpeng Li
2016-06-02 14:21   ` Peter Zijlstra
2016-06-08 11:29     ` Morten Rasmussen
2016-06-08 14:36       ` Peter Zijlstra
2016-05-23 10:58 ` [PATCH 10/16] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
2016-05-23 10:58 ` [PATCH 11/16] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
2016-05-23 10:58 ` [PATCH 12/16] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
2016-05-23 10:58 ` [PATCH 13/16] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
2016-05-23 10:58 ` [PATCH 14/16] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
2016-05-23 10:58 ` [PATCH 15/16] arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems Morten Rasmussen
2016-05-23 10:58 ` [PATCH 16/16] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).